Abstract
Digital processing of a nucleotide sequence requires it to be mapped to a numerical sequence in which the choice of nucleotide to numeric mapping affects how well its biological properties can be preserved and reflected from nucleotide domain to numerical domain. Digital spectral analysis of nucleotide sequences unfolds a period3 power spectral value which is more prominent in an exon sequence as compared to that of an intron sequence. The success of a period3 based exon and intron classification depends on the choice of a threshold value. The main purposes of this article are to introduce novel codes for 1sequence numerical representations for spectral analysis and compare them to existing codes to determine appropriate representation, and to introduce novel thresholding methods for more accurate period3 based exon and intron classification of an unknown sequence. The main findings of this study are summarized as follows: Among sixteen 1sequence numerical representations, the KQuaternary Code I offers an attractive performance. A windowed 1sequence numerical representation (with window length of 9, 15, and 24 bases) offers a possible speed gain over nonwindowed 4sequence Voss representation which increases as sequence length increases. A winner threshold value (chosen from the best among two defined threshold values and one other threshold value) offers a top precision for classifying an unknown sequence of specified fixed lengths. An interpolated winner threshold value applicable to an unknown and arbitrary length sequence can be estimated from the winner threshold values of fixed length sequences with a comparable performance. In general, precision increases as sequence length increases. The study contributes an effective spectral analysis of nucleotide sequences to better reveal embedded properties, and has potential applications in improved genome annotation.
Keywords:
DNA sequence; numerical representation; nucleotide to numeric mapping; exon and intron sequences; coding and noncoding sequences; threshold value; thresholding; exon and intron classification; period3; spectral analysis; discrete Fourier transform; gene detection; genome annotation1. Introduction
It is known that the coding regions of a nucleotide sequence exhibit a period3 spectral property because of the presence of codon structure [1], and this property can be used to identify regions of interest. A nucleotide sequence is a discrete sequence consisting of nucleotides C, G, A, and T in which digital spectral analysis can be used to reveal its hidden periodicities, spectral features, and genome structure. Digital spectral analysis is usually carried out by the discrete Fourier transform (DFT) which is a digital signal processing (DSP) technique. Genomic signal processing (GSP) involves the processing and analysis of a digital signal in the form of a numerical sequence mapped from a nucleotide sequence [1,2]. A general description on the biological aspects of this article can be found in [1,2]. To perform digital spectral analysis, each nucleotide of a nucleotide sequence has to be converted to a numerical value through a mapping. Such a mapping is called numerical representation and its choice affects how well the biological properties of a nucleotide sequence can be revealed in numerical domain. A nucleotide sequence can be numerically represented in the form of Rsequence, a survey on mappings with R ≥ 1 is given in [3,4]. 1sequence numerical representation (with R = 1) is the most compact form of mapping in which one nucleotide is mapped to one fixed numerical value to form a single sequence. 1sequence numerical representations and their relative performances are studied in this article for unknown exon and intron sequence classification with an improved accuracy.
In this article, nine 1sequence numerical representations (Codes 19) [516] are identified though a literature search, which can be grouped under positiveintegervalue, realvalue, and complexvalue numerical representations. In addition, seven 1sequence complexvalue numerical representations (Codes 1016) are derived. In this article, all these sixteen numerical representations and the Voss representation [17] are compared based on the genomes of twelve organisms (including the human).
Genome annotation involves a process of identifying the locations of the coding regions (and associated genes) in a genome. Coding regions exhibit the period3 property in the spectral domain which is less apparent in sequences other than exon sequences, and can therefore be used to detect exon sequences and to distinguish exon regions from intron regions in genomic sequences. The performance depends on an appropriate choice of a period3 threshold value to classify between an exon sequence and an intron sequence. In [17,18], the cumulative distributions of coding and noncoding sequences are used to determine a period3 threshold value. In this article, two threshold value determination methods are defined. Given a set of exon and intron sequences of a fixed length, all the three thresholding values are used to determine the winner threshold value applicable to any nucleotide sequence of the same length. Also, given a number of exon and intron sequence sets of different lengths, the winner threshold values can be interpolated to yield an interpolated winner threshold value applicable to an unknown nucleotide sequence of an arbitrary length. Some commonly used symbols (or abbreviations) are explained in Table 1.
Table 1. List of symbols
2. Methods and results
2.1. Numerical representation
There are a number of numerical representations for nucleotide sequences. In this article, we shall focus on direct and simple numerical representations which satisfy the following requirements: (a) Single sequence mapping for a nucleotide sequence; (b) Fixed value mapping for each nucleotide; and (c) Accessible to digital signal processing analysis. Sixteen 1sequence numerical representations (Codes 116) which satisfy these three requirements are defined in Table 2.
Table 2. List of sixteen numerical representation codes
It is known that exons are rich in nucleotides C and G and introns are rich in nucleotides A and T [8,19]. For ease of viewing, CG and AT are paired when defining each of the sixteen numerical representations. The first group of numerical representations consists of five positiveintegervalue numerical representations (as Codes 15 in Table 2) which shall be described as follows: The Integer Number representation [5,6] is obtained by mapping numerals {1, 3, 2, 0} respectively to the four nucleotides as C = 1, G = 3, A = 2, and T = 0. The Single Galois Indicator representation [7,8] maps the CGAT nucleotides to a Galois field of four, GF(4), which is formed by assigning numerical values to the nucleotides as C = 1, G = 3, A = 0, and T = 2 in a nucleotide sequence. This representation suggests that C < G and A < T. In the Paired Nucleotide Atomic Number representation [9], the paired nucleotides are assigned with atomic numbers as A, G = 62 and C, T = 42 respectively. In the Atomic Number representation [9], a numerical sequence is formed by assigning atomic numbers to each nucleotide as C = 58, G = 78, A = 70, and T = 66 in a nucleotide sequence. The Molecular Mass representation [10] of a nucleotide sequence is formed by mapping the four nucleotides to their molecular masses as C = 110, G = 150, A = 134, and T = 125, respectively.
The second group of numerical representations consists of three realvalue numerical representations (as Codes 68 in Table 2) which can be described as: The ElectronIon Interaction Pseudopotential (EIIP) represents the distribution of the free electrons energies along a nucleotide sequence. In the EIIP representation, a single EIIP indicator sequence [11,12] is formed by substituting the EIIP of the nucleotides as C = 0.1340, G = 0.0806, A = 0.1260, and T = 0.1335 in a nucleotide sequence. In the Paired Numeric representation [6,8], nucleotides (CG, AT) are to be paired in a complementary manner and values of 1 and +1 are to be used to denote, respectively, CG and AT nucleotide pairs. In the Real Number representation [6,8,13], the nucleotide mappings are C = 0.5, G = 0.5, A = 1.5, and T = 1.5, which bears complementary property. The third group consists of one complexvalue numerical representation (as Code 9 in Table 2) called the Complex Number representation [2,5,6,8,1416], it reflects the complementary nature of CG and AT pairs by mapping nucleotides as C = 1j, G = 1+j, A = 1+j, and T = 1j.
In addition to the above nine 1sequence numerical representations, seven 1sequence numerical representations (listed as Codes 1016 in Table 2) are derived in which each nucleotide of a sequence is mapped to either a single realvalue element (± 1) or a single imaginaryvalue element (± j). Each of the Codes 1016, namely, the KTwinPair Code, the KBipolarPair Codes I and II, and the KQuaternary Codes IIV has its equivalent numerical representations. We define a numerical representation R2 to be an equivalent numerical representation of R1 if R2 gives rise to the same power spectrum as that of R1. An equivalent numerical representation can be obtained by multiplying the numerical represented value of each of the four bases [C, G, A, T] of a nucleotide sequence by the same constant which includes any of 1, and ± j. In particular, a complementary numerical representation obtained by inverting the signs of all the four bases [C, G, A, T] in a numerical representation is an equivalent numerical representation. Other forms of equivalent numerical representation exist, for example, the KTwinPair Code [C, G, A, T] = [1, 1, j, j] has an equivalent form [C, G, A, T] = [j, j, 1, 1]. The KQuaternary Code III (Code 15) [C, G, A, T] = [j, 1, 1, j] is identical to the pentanary code in [20] (with [20] specifies that x[n] = 0 for an unknown nucleotide at position n) which was derived for the threedimensional DNA walk for graphical representation of nucleotide sequences. The KBipolarPair Code II (Code 12) [C, G, A, T] = [1, 1, j, j] has an equivalent numerical representation of [C,G,A,T] = [j, j, 1, 1] adopted in [21]. The KQuaternary Code IV (Code 16) [C, G, A, T] = [j, 1, j, 1] has a complementary numerical representation of [C, G, A, T] = [j, 1, j, 1] mentioned in [22]. In [23], complex numerical representations with simultaneous nonzero real part and nonzero imaginary part were adopted which are different from those of the Codes 1016 in which either a real value or an imaginary value is used for each nucleotide to numeric mapping. It should be noted that the numerical representations in [2123] were formulated for sequence alignment but not for period3 power spectral analysis.
The Voss representation (i.e., the Code 17) is a commonly used numerical representation for period3 spectral classification of exon and intron sequences [1,2,17,24,25]. The Voss representation is a 4sequence representation (R = 4) in which each of the four nucleotides of a nucleotide sequence is represented by a separate numerical sequence (as Csequence, Gsequence, Asequence, and Tsequence) such that the n^{th }position of the Csequence is coded by 1 if the n^{th }nucleotide of the sequence is C, otherwise it is coded by 0. In a similar manner, the coding procedure just described applies to the remaining three numerical sequences. A threshold value can be determined from the cumulative distributions of the signaltonoise ratio of the peak at f = 1/3 for each set of exon and intron sequences [17]. Using the Voss representation, two known threshold values (T_{c}, T_{4}) are adopted for comparison in this article. The T_{c }thresholding is determined from the exon and intron cumulative distribution functions [17] and the T_{4 }thresholding is set to a fixed value four [17]. The Codes 116 are to be compared to the Voss representation (Code 17) in Section 2.4.
2.2. Spectral analysis
The purpose of the spectral analysis of a numerically represented nucleotide sequence is to compute its period3 spectral component located at a frequency equal to 2π/3 in the DFT spectrum. Given a numerical sequence, x[l] for l = 1 to L, its finitelength DFT sequence, X[k] for k = 1 to L, and its inverse DFT, x[l] for l = 1 to L, are defined by
In this article, a rectangular windowing of length N bases is used and each consecutive window is rightshifted by three bases with an overlap window length of N6 bases between two adjacent windows. The rectangular windowing is adopted to avoid distortion on the cumulative DFT spectrum. We define the mean cumulative DFT spectrum, X_{T}[k], of all the windowed sequences, X_{m}[k] for m = 1 to M, as
The power spectrum S[k] is obtained from Equation 4 as
From Equation 5, we define the normalized power spectrum S_{n}[k] of a numerically represented nucleotide sequence of length L as
The period3 power spectral value (or period3 value) P_{3 }can be obtained from the normalized power spectrum of a numerically represented sequence at k = N/3+1 as
In all computer simulations, for a window length of WL equals to a sequence length of SL, one complete DFT analysis is adopted. For WL less than SL, each consecutive window is rightshifted by three bases with an overlap window length of "WL6" bases between two adjacent windows. The period3 values P_{3 }computed by Equation 7 for 4583 exon sequences and that for 4583 intron sequences (both of sequence length 1250 bases, window length 24 bases, and window rightshift 3 bases) obtained by the KQuaternary Code I are plotted in Figures 1 and 2. From these figures, it can be observed that the period3 values of exon sequences are higher than those of the intron sequences. This is an important property which will be used to classify exon sequences from intron sequences.
Figure 1. Period3 power spectral values of 4583 exon sequences using KQuaternary Code I. SL = 1250 bases, WL = 24 bases, and WRS = 3 bases.
Figure 2. Period3 power spectral values of 4583 intron sequences using KQuaternary Code I. SL = 1250 bases, WL = 24 bases, and WRS = 3 bases.
The log_{10 }power spectrum S_{n}[k] for 1 ≤ k ≤ N of an exon sequence and that of an intron sequence (both of sequence length 1250 bases, window length 24 bases, and window rightshift 3 bases) obtained by the KQuaternary Code I are plotted in Figures 3 and 4. As shown in Figure 3 (or Figure 4), the exon (or intron) period3 value at 2π/3 defined by Equation 7 has in fact been able to reveal a prominent power spectral value; also, the exon period3 value is higher than the intron period3 value. The DC power spectral value is usually the highest. Besides period3 and DC power spectral value, power spectral values of other periodicities are clearly shown in Figures 3 and 4.
Figure 3. Log_{10}(Exon power spectrum) S[k] for k = 1 to 24 of exon sequence # 4583 using KQuaternary Code I. SL = 1250 bases, WL = 24 bases, and WRS = 3 bases.
Figure 4. Log_{10}(Intron power spectrum) S[k] for k = 1 to 24 of intron sequence # 4583 using KQuaternary Code I. SL = 1250 bases, WL = 24 bases, and WRS = 3 bases.
2.3. Winner threshold value
In this section, the statistics of the period3 values computed from a training set of exon sequences and intron sequences are used to define the threshold values of two novel thresholding methods to classify an untrained sequence to be either an exon sequence or an intron sequence. As explained in Table 1, meanP_{3e }and sdP_{3e }represent, respectively, the mean and standard deviation of the period3 values obtained from the exon sequences of a training set; and meanP_{3i }and sdP_{3i }represent, respectively, the mean and standard deviation of the period3 values obtained from the intron sequences of the same training set. We define two threshold values called the mid threshold value, T_{m}, and the proportional threshold value, T_{p}, as
In both definitions, the exon cluster is centred at meanP_{3e }and the intron cluster is centred at meanP_{3i}. In Equation 8, the mid threshold value is determined by the midpoint between the exon cluster and the intron cluster, whereas in Equation 9, the proportional threshold value is determined in proportion to the standard deviations of the two clusters.
Besides the above two methods for determining a threshold value, the crossover point of the cumulative distribution of all the exon period3 values, F(P_{3e}), and the complementary cumulative distribution of all the intron period3 values, F_{c}(P_{3i}), of a set of exon and intron training sequences can be used to determine a threshold value. We define such a cumulative distribution threshold value, T_{c}, as
Figure 5 shows a plot of F(P_{3e}) and F_{c}(P_{3i}) versus corresponding exon/intron period3 power spectral value. Each of the three threshold values T_{m}, T_{p}, and T_{c }can be used to classify an unknown nucleotide sequence. For illustration, let T_{t }be any of T_{m}, T_{p}, and T_{c}, if an unknown nucleotide sequence has a computed period3 power spectral value greater than or equal T_{t}, the unknown nucleotide sequence is classified as an exon sequence; otherwise it is classified as an intron sequence.
Figure 5. Cumulative distribution threshold value (T_{c}) = 3.881141 at crossover point of exon CDP3 PSV (red) and intron CCDP3 PSV (blue) using KQuaternary Code I. Exon/intron sequence number = 3437, SL = 1250 bases, WL = 24 bases, and WRS = 3 bases.
The performance is measured objectively in terms of the exon classification (%) which is the percentage of correct classification of exon sequences, the intron classification (%) which is the percentage of correct classification of intron sequences, and the precision (%). These three terms are defined in Equations 1113 as
We define a winner threshold value, T_{w}, as the threshold value chosen among T_{m}, T_{p}, and T_{c }that yields the top classification (or the highest precision).
2.4. Classification and speed performances of codes 117
In this article, all the exon and intron sequences of the human and eleven model organisms were downloaded from the UCSC Genomes [2629] as inputs to be processed by Matlab programs.
2.4.1. Classification performance of short sequences
The short sequence group I listed in Table 3 downloaded from the UCSC Human genome [2629] consists of the thirteen short sequence sets ranging from 50 bases to 650 bases (at intervals of 50 bases) as [50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650]. For each of the thirteen short sequence sets, the training sequence numbers are 1 to 10005 and the testing sequence numbers are 10006 to 13340 with a train/test ratio of three for both exon and intron sequences within a set. Each of the thirteen sequence sets is trained and tested using a combination of sixteen numerical representations, four window lengths, and four threshold values (T_{m}, T_{p}, T_{c}, and T_{4}). Figures 6(ab) and 7 (ac) plots the top classifications obtained using each of the thirteen untrained subsets of 3335 exon sequences and 3335 intron sequences of the short sequence group I. Figure 6 (a, top) displays a total of 13 × 4 precision values obtained using the thirteen sequence sets in thirteen combined columns with each combined column consists of four subcolumns of precision values. Each of the four subcolumn precision values shown in Figure 6 (a, top) corresponds to a top classification with its code index, window length index, threshold index, and threshold value displayed, respectively, in Figures 6 (b, bottom) and Figure 7 (ac). In each of the five subplots in Figures 6 (ab) and 7 (ac), the first subcolumn corresponds to the top classification obtained by the top performer among the Codes 1 to 16, the second subcolumn corresponds to the top classification obtained by the Code 10, the third subcolumn corresponds to the top classification obtained by the Code 13, and the fourth subcolumn corresponds to the top classification obtained by the Code 17. The Codes 10 and 13 are found to be among the top performers and therefore they are displayed in two separate subcolumns. As the Codes 116 are to be compared to the Code 17 (described in Section 2.1), therefore the Code 17 is also displayed as one subcolumn.
Table 3. UCSC Human genome consisting of 2 short sequence groups.
Figure 6. (ab) Precision (a, top) and code index (b, bottom) of top classifications of 13 short sequence sets obtained by best of Codes 116 (dark blue), Code 10 (light blue), Code 13 (yellow), and Code 17 (red).
Figure 7. (ac) WL index (a, top), threshold index (b, middle), and threshold values (c, bottom) of top classifications of 13 short sequence sets obtained by best of Codes 116 (dark blue), Code 10 (light blue), Code 13 (yellow), and Code 17 (red).
The classification performance shown in Figures 6(ab) and 7 (ac) and summarized in Table 4 indicate that for the thirteen short sequence sets: (a) The top performer for 8 of these thirteen sequence sets is the Code 13, followed by the Code 17 as the top performer for the remaining 5 sequence sets (with the Code 13 ranks 2^{nd }in 4 cases and the Code 10 ranks 2^{nd }in 1 case, each of these 5 cases is also shown in Table 4 on the row following that of the top performer). (b) The window length of 9 bases yields 7 top classifications; the window length of 15 bases yields 2 top classifications; and for the window length equals to the sequence length, 4 top classifications are obtained. (c) The T_{c }thresholding yields 8 top classifications; followed by the T_{p }thresholding which yields 5 top classifications. In general, top classifications can often be achieved by the Code 13, using WL = 9 bases and T_{p }thresholding for SL = 50 to 450 bases; and using WL = 9 bases and T_{c }thresholding for SL = 500 to 650 bases.
Table 4. Code (Figure 6b, bottom), WL (bases) (Figure 7a, top), threshold method (Figure 7b, middle), and precision (%) (Figure 6a, top) of top classifications of 13 short sequence sets
2.4.2. Classification performance of long sequences
The long sequence group I listed in Table 5 downloaded from the UCSC Human genome [2629] consists of the thirteen long sequence sets ranging from 650 bases to 1250 bases (at intervals of 50 bases) as [650, 700, 750, 800, 850, 900, 950, 1000, 1050, 1100, 1150, 1200, 1250]. For each of the thirteen long sequence sets, the training sequence numbers are 1 to 3437 and the testing sequence numbers are 3438 to 4583 with a train/test ratio of three for both exon and intron sequences within a set. Each of the thirteen sequence sets is trained and tested using a combination of sixteen numerical representations, four window lengths, and four threshold values (T_{m}, T_{p}, T_{c}, and T_{4}). The top classification results of Codes 116 are compared to those obtained by the Code 17 [17] (as described in Section 2.1). Figures 8(ab) and 9 (ac) plots the top classifications obtained using each of the thirteen untrained subsets consisting of 1146 exon sequences and 1146 intron sequences of the long sequence group I. The notations and explanations of Figures 6(ab) and 7 (ac) regarding the short sequence set I in Table 3 apply to Figures 8(ab) and 9 (ac) which display the corresponding results of the long sequence set I in Table 5.
Table 5. UCSC Human genome consisting of 2 long sequence groups
Figure 8. (ab) Precision (a, top) and code index (b, bottom) of top classifications of 13 long sequence sets obtained by best of Codes 116 (dark blue), Code 10 (light blue), Code 13 (yellow), and Code 17 (red).
Figure 9. (ac) WL index (a, top), threshold index (b, middle), and threshold value (c, bottom) of top classifications of 13 long sequence sets obtained by best of Codes 116 (dark blue), Code 10 (light blue), Code 13 (yellow), and Code 17 (red).
The classification performance shown in Figures 8(ab) and 9 (ac) and summarized in Table 6 indicates that for the thirteen long sequence sets: (a) The top performer for the thirteen sequence sets is the Code 13, except for SL = 1100 bases in which the Code 1 (with WL = 24 bases and T_{4 }thresholding) is marginally higher than that of the Code 13 (with WL = 9 bases and T_{p }thresholding as shown in Table 6 on the row following that of the top performer). (b) The window length of 9 bases yields 5 top classifications; the window length of 15 bases yields 4 top classifications; and for window length equals to 24 bases, 3 top classifications are obtained. (c) The T_{p }thresholding yields 6 top classifications; followed by the T_{4 }thresholding which yields 4 top classifications; and the T_{c }thresholding which yields 3 top classifications. In general, top classifications are achievable often using a combination of the Code 13, WL = 9 bases, and T_{p }thresholding.
Table 6. Code (Figure 8b, bottom), WL (bases) (Figure 9a, top), threshold method (Figure 9b, middle), and precision (%) (Figure 8a, top) of top classifications of 13 long sequence sets
2.4.3. Speed performance
In this article, the DFTbased spectral analysis of a numerical sequence is computed using the FFT. In general, the computational complexity of a 4sequence numerical representation (such as the Voss representation [17]) is always four times that of a 1sequence numerical representation if either windowing or nonwindowing is used in the comparison. The FFT computational complexity for a numerical sequence of length L is L × log_{2}(L) and for a windowed sequence of WL = N is N × log_{2}(N). As each consecutive window is rightshifted by three bases, the number of windows NW = 1 + fix((LN)/3) where fix(X) rounds the elements of X to the nearest integer. For a nonwindowed 4sequence numerical representation of SL = L, the computational complexity is 4L × log_{2}(L) whereas for a windowed 1sequence numerical representation of SL = L, the computational complexity is NW × N × log_{2}(N). The ratio of 4L × log_{2}(L) over NW × N × log_{2}(N) gives a speed improvement ratio. A plot of the speed improvement ratio against SL for WL = [9,15,24] bases is shown in Figure 10. As seen from Figure 10, it can be observed that a windowed 1sequence representation with WL = [9,15,24] bases offers a possible speed advantage over a nonwindowed 4sequence representation and the speed improvements increases as SL increases. Also shown in Figure 10, for WL = SL, N = L, the speed improvement ratio is four.
Figure 10. Speed improvement ratio of 1sequence representation over 4sequence representation for window length = 9, 15, 24, SL bases in FFT spectral analysis.
2.5. Interpolated winner threshold value
With the use of the winner threshold value described in Section 2.3, a top precision for an unknown nucleotide sequence of length equals to any of the thirteen long or short sequence sets can be obtained. However, if the length of an unknown nucleotide sequence is neither any of the thirteen short sequence sets nor any of the thirteen long sequence set, an interpolated winner threshold value can be determined as described in this section. For illustration, let us consider the KQuaternary Code I but the interpolated winner threshold methodology also applies to other numerical representations. For a set of exon and intron training sequences, the winner threshold value, T_{w}(P_{3}), is first chosen among the three threshold values, T_{m}(P_{3}), T_{p}(P_{3}), and T_{c}(P_{3}) that yields a top precision during testing. For each window length, the winner threshold value, T_{w}(P_{3}), versus sequence length of the set of thirteen short sequences with SL = 50 to 650 bases is plotted in Figure 11 and similarly in Figure 12 for the set of thirteen long sequences with SL = 650 to 1250 bases. From Figures 11 and 12, given an unknown nucleotide sequence of arbitrary length (within the length of trained sequences), an interpolated winner threshold value, T_{i}(P_{3}), at any of the four specified window lengths can be obtained. Using an interpolated winner threshold value, T_{i}(P_{3}), if the unknown nucleotide sequence has a period3 power spectral value greater than or equal T_{i}(P_{3}), the unknown nucleotide sequence is classified as an exon sequence; otherwise it is classified as an intron sequence.
Figure 11. Winner threshold value T_{w}(P_{3}) versus SL of 13 short sequence sets with WL = 9, 15, 24, SL bases using KQuaternary Code I.
Figure 12. Winner threshold value T_{w}(P_{3}) versus SL of 13 long sequence sets with WL = 9, 15, 24, SL bases using KQuaternary Code I.
2.5.1. Short sequences
For each WL, an interpolated threshold value for an arbitrary sequence length can be estimated using cubic spline interpolation from the corresponding discrete winner threshold values shown in Figure 11. For each of the four WL, the interpolated threshold values and the corresponding precisions obtained on testing each of the twelve untrained sequence sets (under the short sequence group II of Table 3) of sequence lengths [90, 120, 180, 210, 270, 330, 390, 420, 480, 510, 570, 630] bases, are displayed in Figures 13 and 14 using thin red bars. For comparison, the winner threshold values and the corresponding precisions of the thirteen short sequence sets for each of the four WL are displayed in Figures 13 and 14 using thick blue bars. From Figures 13 and 14, it can be observed that the interpolated winner threshold values and their precisions perform well and follow the general trends.
Figure 13. Interpolated winner threshold values (thin red bar) of 12 untrained short sequence sets obtained from winner threshold values (thick blue bar) of 13 short sequence sets using KQuaternary Code I.
Figure 14. Precisions (thin red bar) of 12 untrained short sequence sets (obtained using interpolated winner threshold values) versus precisions (thick blue bar) of 13 short sequence sets (obtained using winner threshold values), both using KQuaternary Code I.
2.5.2. Long sequences
In a similar manner to the short sequence sets, for each WL, an interpolated winner threshold value for an arbitrary sequence length can be determined using cubic spline interpolation from the corresponding discrete winner threshold values of the thirteen long sequence sets shown in Figure 12. For each of the four WL, twelve untrained sequence sets (under the long sequence group II of Table 5) of sequence lengths [690, 720, 780, 810, 870, 930, 990, 1020, 1080, 1110, 1170, 1230] bases are tested, the interpolated winner threshold values and the corresponding precisions obtained are displayed in Figures 15 and 16 using thin red bars. For comparison, the winner threshold values and the corresponding precisions for each of the four WL of the thirteen long sequence sets are displayed in Figures 15 and 16 using thick blue bars. From Figures 15 and 16, it can be observed that the interpolated winner threshold values and their precisions work well and follow the general trends.
Figure 15. Interpolated winner threshold values (thin red bar) of 12 untrained long sequence sets obtained from winner threshold values (thick blue bar) of 13 long sequence sets using KQuaternary Code I.
Figure 16. Precisions (thin red bar) of 12 untrained long sequence sets (obtained using interpolated winner threshold values) versus precisions (thick blue bar) of 13 long sequence sets (obtained using winner threshold values), both using KQuaternary Code I.
2.6. Classification performances of 12 organisms
To demonstrate the efficiency of the numerical representations and thresholding methods when applied to other model organisms, the UCSC genomes [2629] of twelve organisms (including the human) listed in Table 7 are tested. Each of the genomes of the twelve organisms is trained and tested with identical SL of 300 bases, four window lengths, and four threshold values, the results are compared based on the best of the Codes 116 (dark blue), the Code 10 (light blue), the Code 13 (yellow), and the Code 17 (red) [17] (as described in Section 2.1) as shown in Figures 1718. For the twelve organisms, Figures 17(ab) plot precision (a, top) and the corresponding code index (b, bottom); and Figure 18(ac) plot window length index (a, top), threshold index (b, middle), and threshold value (c, bottom) of the top classifications obtained. The results summarized in Table 8 indicate that top classifications are obtained by the Code 13 in seven organisms; followed by the Code 17 with four top classifications (with the Code 13 ranks 2^{nd }in 2 cases and the Code 10 ranks 2^{nd }in 1 case, each of these 3 cases is shown in Table 8 on the row following that of the top performer); and the Code 6 with one top classification. The results also indicate that the Code 13 with (a) WL = 24 bases and T_{p }thresholding is the top choice for organisms 1, 3, 4; (b) WL = 24 bases and T_{4 }thresholding is the top choice for organism 7; (c) WL = 15 bases and T_{c }thresholding is the top choice for organisms 11 and 12; and (d) WL = 300 bases and T_{m }thresholding is the top choice for organism 2.
Table 7. UCSC genome of 12 organisms
Figure 17. (ab) Precision (a, top) and code index (b, bottom) of top classifications of 12 organisms obtained by best of Codes 116 (dark blue), Code 10 (light blue), Code 13 (yellow), and Code 17 (red).
Figure 18. (ac) WL index (a, top), threshold index (b, middle), and threshold value (c, bottom) of top classifications of 12 organisms obtained by best of Codes 116 (dark blue), Code 10 (light blue), Code 13 (yellow), and Code 17 (red).
Table 8. Code index (Figure 17b, bottom), WL (bases) (Figure 18a, top), threshold method (Figure 18b, middle), and precision (%) (Figure 17a, top) of top classifications of 12 organisms (SL = 300 bases)
3. Discussion
In this article, the ability of the KQuaternary Code I (the Code 13) through the use of the discrete Fourier transform to capture the periodicities of an exon sequence or an intron sequence across the entire spectrum at a resolution defined by the window length is shown in Figures 3 and 4. Such a spectral tracking ability is shared among all of the sixteen numerical representations. As seen from Figures 3 and 4, there are three prominent peaks located at frequency equal to 0, 2π/3, and 4π/3. The peak at 2π/3 corresponds to the period3 power spectral value, and the peak at 0 corresponds to the power spectral value at DC which usually exhibits the highest value within a spectrum. The power spectral values at other frequencies are lower and different but all serve to reflect their actual power spectral properties across the spectrum. For the Codes 916, their numerical represented sequences x[n] are complex; therefore, DFT spectrum shows nonsymmetrical peaks at frequency equal to 2π/3 and 4π/3. However, for the Codes 18 in which their numerical represented sequences x[n] are real and therefore symmetrical peaks at frequency equal to 2π/3 and 4π/3 are obtained from DFT analysis.
The interpolated winner threshold values, T_{i}(P_{3}), and their corresponding precisions shown in Figures 13 and 14 for short sequences, and in Figures 15 and 16 for long sequences of the human genome indicate that T_{i}(P_{3}) obtained using cubic spline interpolation from either Figure 11 or Figure 12 can yield a similar precision as compared to that of the winner threshold value, T_{w}(P_{3}), of its nearby SL. It should be noted that each of T_{m}(P_{3}), T_{p}(P_{3}), and T_{c}(P_{3}) required to determine T_{w}(P_{3}) has to be computed directly from the training portion of each short/long sequence set whereas T_{i}(P_{3}) requires minimal computation but can achieve a comparable precision.
Given an unknown human nucleotide sequence of an arbitrary length L, if L is equal to the length of any of the thirteen short sequence sets or any of the thirteen long sequence sets, the choice of a suitable set of code, WL, and threshold can be obtained as a tablelookup from either Table 4 or 6. If L is not equal to the length of any of these thirteen short or long sequence sets, the closest SL, its code, and WL can also be obtained from either Table 4 or 6. For the latter case, once the code and WL are determined, its threshold value can then be determined using the interpolated winner threshold described in Section 2.5. Besides the human genome, the methodologies described in this article can be applied to the genome of other model organisms as verified by the results shown in Figures 17(ab) and 18 (ac) and Tables 7 and 8.
4. Conclusions
In this article, two methods for determining threshold values have been defined, and together with the cumulative distribution threshold value, determine the winner threshold value for classifying an unknown nucleotide sequence of a fixed length. An interpolated winner threshold value has also been introduced to classify an unknown nucleotide sequence of an arbitrary length with a comparable performance to that obtained by the winner threshold value of its nearby SL (in classifying an unknown nucleotide sequence of a fixed length). In general, precision increases as sequence length increases, and classification performance depends on a suitable choice of numerical representation and window length. Sixteen 1sequence numerical representations have been presented and compared to classify untrained exon and intron sequences in the spectral domain, in which the KQuaternary Code I yields attractive performance. When comparing each of the sixteen windowed 1sequence numerical representations using WL = [9,15,24] bases to a nonwindowed 4sequence numerical representation (such as the Voss representation), the speed improvement ratio increases as SL increases which favours long nucleotide sequence analysis. The results obtained indicate the methodologies introduced in this article for exon and intron sequence classification are applicable to the genomes of the human and other model organisms. Overall, the study has developed novel methodologies in numerical representation for improved nucleotide to numeric mapping, spectral analysis for effective period3 spectral value computation, and thresholding for more accurate classification of unknown exon and intron sequences of fixed and arbitrary length.
Competing interests
The authors declare that they have no competing interests.
References

PP Vaidyanathan, Genomics and proteomics: A signal processor's tour. IEEE Circ. Syst. Mag. 4th Q, 6–29 (2004)

D Anastassiou, Genomic signal processing. IEEE Signal Process. Mag. 18, 820 (2001)

HK Kwan, SB Arniker, Numerical representation of DNA sequences, 307–310 in Proceedings of IEEE International Conference on Electro/Information Technology (EIT), Windsor, Ontario, Canada, 79 June 2009

SB Arniker, HK Kwan, Graphical representation of DNA sequences, 311–314 in Proceedings of IEEE International Conference on Electro/Information Technology (EIT), Windsor, Ontario, Canada, 79 June 2009

PD Cristea, in BIOS’02: Genetic Signal Representation and Analysis, SPIE International Conference on Biomedical Optics Symposium, Molecular Analysis and Informatics, San Jose, CA, USA, 2124 January 2002. Proceedings of Society of PhotoOptical Instrumentation Engineers (SPIE) Conference, vol. 4623 (SPIE, 2002), 77–84

M Akhtar, J Epps, E Ambikairajah, On DNA numerical representations for period3 based exon prediction. 4 pages in Proceedings of IEEE Workshop on Genomic Signal Processing and Statistics (GENSIPS), Tuusula, Finland, 1012 June 2007

GL Rosen, Signal Processing for Biologicallyinspired Gradient Source Localization and DNA Sequence Analysis (PhD dissertation, Georgia Institute of Technology, Atlanta, 2006)

M Akhtar, J Epps, E Ambikairajah, Signal processing in sequence analysis: Advances in eukaryotic gene prediction. IEEE J Sel Top Signal Process 2, 310–321 (2008)

T Holden, R Subramaniam, R Sullivan, E Cheng, C Sneider, G Tremberger Jr., A Flamholz, DH Leiberman, TD Cheung, ATCG nucleotide fluctuation of Deinococcus radiodurans radiation genes. in Instruments, Methods, and Missions for Astrobiology X. Proceedings of Society of PhotoOptical 2 Instrumentation Engineers (SPIE) Conference, vol. 6694 (SPIE, 12 September 2007), pp. 6694171–66941710

HE Stanley, SV Buldyrev, AL Goldberger, ZD Goldberger, S Havlin, SM Ossadnik, CK Peng, M Simmons, Statistical mechanics in biology: how ubiquitous are longrange correlations? Physica A 205, 214–253 (1994). PubMed Abstract  Publisher Full Text

AS Nair, SS Pillai, A coding measure scheme employing electronion interaction pseudo potential (EIIP). Bioinformation 1, 197–202 (2006). PubMed Abstract  PubMed Central Full Text

I Cosic, Macromolecular Bioactivity: is it resonant interaction between macromolecules? Theory and applications. IEEE Trans Biomed Eng 41, 1101–1114 (1994). PubMed Abstract  Publisher Full Text

N Chakravarthy, A Spanias, LD Lasemidis, K Tsakalis, Autoregressive modeling and feature analysis of DNA sequences. EURASIP Journal of Genomic Signal Processing 1, 13–28 (2004)

PD Cristea, Conversion of nucleotides sequences into genomic signals. J Cell Mol Med 6, 279–303 (2002). PubMed Abstract  Publisher Full Text

PD Cristea, Representation and analysis of DNA sequences. in Genomic Signal Processing and Statistics, EURASIP Book Series in Signal Processing and Communications, volume 2 (Hindawi Publishing Corporation, 2005), , ed. by ER Dougherty, I Shmulevich, J Chen, ZJ Wang, pp. 15–65

AK Brodzik, O Peters, Symbolbalanced Quaternionic periodicity transform for latent pattern detection in DNA sequences. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 5, Philadelphia, USA, March 2005, 373–376

S Tiwari, S Ramachandran, A Bhattacharya, S Bhattacharya, R Ramaswamy, Prediction of probable genes by Fourier analysis of genomic sequences. Bioinformatics (CABIOS) 13(3), 263–270 (1997). Publisher Full Text

S Datta, A Asif, A fast DFT based gene prediction algorithm for identification of protein coding regions. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 5, Philadelphia, USA, March 2005, 653–656

BYM Kwan, JYY Kwan, HK Kwan, R Atwal, OT Shen, 4 pages in Proceedings of TENCON, Hong Kong, China, 2006

JA Berger, SK Mitra, M Carli, A Neri, New approaches to genome sequence analysis based on digital signal processing. 4 pages in Proceedings of IEEE Workshop on Genomic Signal Processing and Statistics (GENSIPS), Raleigh, North Carolina, October 2002

EA Cheever, DB Searls, W Karunaratne, GC Overton, Using signal processing techniques for DNA sequence comparison, 173–174 in Proceedings of the 1989 Fifteenth Annual Northeast Bioengineering Conference, Boston, MA, 2728 March 1989

S Rajasekaran, H Nick, PM Pardalos, S Sahni, G Shaw, Efficient algorithms for local alignment search. J Comb Optim 5, 117–124 (2001). Publisher Full Text

GD Avenio, M Grigioni, G Orefici, R Creti, SWIFT (sequencewide investigation with Fourier transform): a software tool for identifying proteins of a given class from the unannotated genome sequence. Bioinformatics 21(13), 2943–2949 (2005). PubMed Abstract  Publisher Full Text

CC Yin, SST Yau, Prediction of protein coding regions by the 3base periodicity analysis of a DNA sequence. J Theor Biol 247, 687–694 (2007). PubMed Abstract  Publisher Full Text

J Tuqan, A Rushdi, A DSP approach for finding the codon bias in DNA sequences. IEEE J Sel Topics Signal Process 2(3), 343–356 (2008)

D Karolchik, AS Hinrichs, TS Furey, KM Roskin, CW Sugnet, D Haussler, WJ Kent, The UCSC Table Browser data retrieval tool. Nucleic Acids Res, D493–496 (2004). 

J Goecks, A Nekrutenko, J Taylor, The Galaxy Team, Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 11(8) Article R86 BioMed Central Full Text

D Blankenberg, G Von Kuster, N Coraor, G Ananda, R Lazarus, M Mangan, A Nekrutenko, J Taylor, Galaxy: a webbased genome analysis tool for experimentalists. Curr Protoc Mol Biol Chapter 19(Unit 19.10), 1–21 (2010)

B Giardine, C Riemer, RC Hardison, R Burhans, L Elnitski, P Shah, Y Zhang, D Blankenberg, I Albert, J Taylor, W Miller, WJ Kent, A Nekrutenko, Galaxy: a platform for interactive largescale genome analysis. Genome Res 15(10), 1451–1455 (2005). PubMed Abstract  Publisher Full Text  PubMed Central Full Text