Novel methodologies for spectral classification of exon and intron sequences
1 Department of Electrical and Computer Engineering, University of Windsor, 401 Sunset Avenue, Windsor, ON N9B 3P4, Canada
2 Faculty of Medicine, University of Ottawa, 451 Smyth Road, Ottawa, ON K1H 8M5, Canada
3 School of Medicine, Queen's University, 80 Barrie Street, Kingston, ON K7L 3N6, Canada
EURASIP Journal on Advances in Signal Processing 2012, 2012:50 doi:10.1186/1687-6180-2012-50Published: 28 February 2012
Digital processing of a nucleotide sequence requires it to be mapped to a numerical sequence in which the choice of nucleotide to numeric mapping affects how well its biological properties can be preserved and reflected from nucleotide domain to numerical domain. Digital spectral analysis of nucleotide sequences unfolds a period-3 power spectral value which is more prominent in an exon sequence as compared to that of an intron sequence. The success of a period-3 based exon and intron classification depends on the choice of a threshold value. The main purposes of this article are to introduce novel codes for 1-sequence numerical representations for spectral analysis and compare them to existing codes to determine appropriate representation, and to introduce novel thresholding methods for more accurate period-3 based exon and intron classification of an unknown sequence. The main findings of this study are summarized as follows: Among sixteen 1-sequence numerical representations, the K-Quaternary Code I offers an attractive performance. A windowed 1-sequence numerical representation (with window length of 9, 15, and 24 bases) offers a possible speed gain over non-windowed 4-sequence Voss representation which increases as sequence length increases. A winner threshold value (chosen from the best among two defined threshold values and one other threshold value) offers a top precision for classifying an unknown sequence of specified fixed lengths. An interpolated winner threshold value applicable to an unknown and arbitrary length sequence can be estimated from the winner threshold values of fixed length sequences with a comparable performance. In general, precision increases as sequence length increases. The study contributes an effective spectral analysis of nucleotide sequences to better reveal embedded properties, and has potential applications in improved genome annotation.