Exons and Introns Classification in Human and Other Organisms

In the paper, the relative performances on spectral classification of short exon and intron sequences of the human and eleven model organisms is studied. In the simulations, all combinations of sixteen one-sequence numerical representations, four threshold values, and four window lengths are considered. Sequences of 150-base length are chosen and for each organism, a total of 16,000 sequences are used for training and testing. Results indicate that an appropriate combination of one-sequence numerical representation, threshold value, and window length is essential for arriving at top spectral classification results. For fixed-length sequences, the precisions on exon and intron classification obtained for different organisms are not the same because of their genomic differences. In general, precision increases as sequence length increases.




References:
1] H. K. Kwan, B. Y. M. Kwan, and J. Y. Y. Kwan, "Novel
methodologies for spectral classification of exon and intron
sequences," EURASIP Journal on Advances in Signal
Processing, vol. 2011, 2011 (in press).
[2] R. A. Dalloul, J. A. Long, A. V. Zimin, et al. "Multi-platform
next-generation sequencing of the domestic turkey (Meleagris
gallopavo): Genome assembly and analysis", PLoS Biology, vol.
8, pii: e1000475, 2010.
[3] P. D. Cristea, "Genetic signal representation and analysis," in
Proceedings of Society of Photo-Optical Instrumentation
Engineers (SPIE) Conference, vol. 4623, January 2002, pp. 77-
84.
[4] M. Akhtar, J. Epps, and E. Ambikairajah, "Signal processing in
sequence analysis: Advances in eukaryotic gene prediction,"
IEEE Journal of Selected Topics in Signal Processing, vol. 2,
pp. 310-321, June 2008.
[5] T. Holden, R. Subramaniam, R. Sullivan, E. Cheng, C. Sneider,
G. Tremberger, Jr. A. Flamholz, D. H. Leiberman, and T. D.
Cheung, "ATCG nucleotide fluctuation of Deinococcus
radiodurans radiation genes," in Proceedings of Society of
Photo-Optical Instrumentation Engineers (SPIE), vol. 6694,
August 2007, pp. 669417-1 to 669417-10.
[6] H. E. Stanley, S. V. Buldyrev, A. L. Goldberger, Z. D.
Goldberger, S, Havlin, S. M. Ossadnik, C.-K. Peng, and M.
Simmons, "Statistical mechanics in biology: How ubiquitous are
long-range correlations?" Physica A, vol. 205, pp. 214-253,
April 1994.
[7] A. S. Nair and S. S. Pillai, "A coding measure scheme employing
electron-ion interaction pseudo potential (EIIP),"
Bioinformation, vol. 1, pp. 197-202, October 2006.
[8] N. Chakravarthy, A. Spanias, L. D. Lasemidis, and K. Tsakalis,
"Autoregressive modeling and feature analysis of DNA
sequences," EURASIP Journal of Genomic Signal Processing,
vol. 1, pp. 13-28, January 2004.
[9] P. D. Cristea, "Conversion of nucleotides sequences into
genomic signals," Journal of Cellular and Molecular Medicine,
vol. 6, pp. 279-303, April-June 2002.
[10] S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhattacharya,
and R. Ramaswamy, "Prediction of probable genes by Fourier
analysis of genomic sequences," Bioinformatics (CABIOS), vol.
13, issue 3, pp. 263-270, 1997.
[11] D. Karolchik, A. S. Hinrichs, T. S. Furey, K. M. Roskin, C. W.
Sugnet, D. Haussler, and W. J. Kent, "The UCSC Table Browser
data retrieval tool," Nucleic Acids Research, vol. 32 (Database
issue), pp. D493-496, 1 January 2004.
[12] J. Goecks, A. Nekrutenko, J. Taylor, and The Galaxy Team,
"Galaxy: A comprehensive approach for supporting accessible,
reproducible, and transparent computational research in the life
sciences," Genome Biology, vol. 11, issue 8, article R86, 25
August 2010.
[13] D. Blankenberg, G. Von Kuster, N. Coraor, G. Ananda, R.
Lazarus, M. Mangan, A. Nekrutenko, and J. Taylor, "Galaxy: A
web-based genome analysis tool for experimentalists," Current
Protocols in Molecular Biology, chapter 19, unit 19.10.1-21,
January 2010.
[14] B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski,
P. Shah, Y. Zhang, D. Blankenberg, I. Albert, J. Taylor, W.
Miller, W. J. Kent, and A. Nekrutenko, "Galaxy: A platform for
interactive large-scale genome analysis," Genome Research, vol.
15, issue 10, pp. 1451-1455, 15 October 2005.
[15] J. E. Allen and S. L. Salzberg, "JIGSAW: Integration of
multiple sources of evidence for gene prediction,"
Bioinformatics, vol. 21, no. 18, pp. 3596-603, 2005.
[16] H. Jiang and W. H. Wong, "SeqMap: Mapping massive amount
of oligonucleotides to the genome," Bioinformatics, vol. 24, no.
20, pp. 2395-2396, 2008.