MIM: A Species Independent Approach for Classifying Coding and Non-Coding DNA Sequences in Bacterial and Archaeal Genomes

A number of competing methodologies have been developed to identify genes and classify DNA sequences into coding and non-coding sequences. This classification process is fundamental in gene finding and gene annotation tools and is one of the most challenging tasks in bioinformatics and computational biology. An information theory measure based on mutual information has shown good accuracy in classifying DNA sequences into coding and noncoding. In this paper we describe a species independent iterative approach that distinguishes coding from non-coding sequences using the mutual information measure (MIM). A set of sixty prokaryotes is used to extract universal training data. To facilitate comparisons with the published results of other researchers, a test set of 51 bacterial and archaeal genomes was used to evaluate MIM. These results demonstrate that MIM produces superior results while remaining species independent.




References:
[1] A. Lukashin and M. Borodovsky, "Genemark.hmm: new solutions for
gene finding." Nucleic Acids Res., vol. 26, pp. 1107-1115, 1998.
[2] D. Hyatt, G.-L. Chen, P. F. LoCascio, M. L. Land, F. W. Larimer, and
L. J. Hauser, "Prodigal: prokaryotic gene recognition and translation
initiation site identification," BMC Bioinformatics, vol. 11, 2010.
[3] A. Delcher, K. Bratke, E. Powers, and S. Salzberg, "Identifying bacterial
genes and endosymbiont dna with glimmer," Bioinformatics, vol. 23, pp.
673-679, 2007.
[4] G.-Q. Hu, X. Zheng, H.-Q. Zhu, and Z.-S. She, "Prediction of translation
initiation site with tritisa," Bioinformatics, vol. 25, pp. 123-125, 2009.
[5] H. Ou, F. Guo, and C. Zhang, "Gs-finder: a program to find bacterial
gene start sites with a self-training method," Int. J. Biochem. Cell Biol.,
vol. 36, pp. 535-544, 2004.
[6] I. Rogozin and L. Milanesi, "Analysis of donor splice signals in different
organisms," J. Mol. Evl., vol. 45, pp. 50-59, 1997.
[7] J. Kleffe, K. Hermann, W. Vahrson, B. Wittig, and V. Brendel, "Logitlinear
models for the prediction of splice sites in plant pre-mrna sequences,"
Nucleic Acids Res., vol. 24, pp. 4709-4718, 1996.
[8] S. Brunak, J. Engelbrecht, and S. Knudsen, "Prediction of human mrna
donor and acceptor sites from the dna sequence," J. Mol. Biol., vol. 220,
pp. 49-65, 1991.
[9] S. M. Hebsgaard, P. G. Korning, N. Tolstrup, J. Engelbrecht, P. Rouz,
and S. Brunak, "Splice site prediction in arabidopsis thaliana pre mrna
by combining local and global sequence information," Nucleic Acids
Res., vol. 24, pp. 3439-3452, 1996.
[10] M. Q. Zhang and T. G. Marr, "A weight array method for splicing signal
analysis," Comput. Appl. Biosci., vol. 9, pp. 499-509, 1993.
[11] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman,
"Basic local alignment search tool," J. Mol. Biol., vol. 215, pp. 403-
410, 1990.
[12] P. McCaklon and P. Argos, "Oligopeptide biases in protein sequences
and their use in predicting protein coding regions in nucleotide sequences,"
Proteins: Structure, Function and Genetics, vol. 4, pp. 99-122,
1988.
[13] R. Staden and A. D. McLachlan, "Codon preferences and its uses in
identifying protein coding regions in long dna sequences," Nucleic Acids
Res., vol. 10, pp. 141-156, 1982.
[14] A. S. Kolaskar and B. V. B. Reddy, "A method to locate protein
sequences in dna and prokaryotic systems," Nucleic Acids Res., vol. 13,
pp. 185-194, 1985.
[15] R. D. Blake and S. Early, "Distribution and evolution of sequence
characterisitcs in e. coli genome," J. Biomol. Struct. Dynam., vol. 4,
pp. 291-307, 1996.
[16] J. R. Rose and A. El Allali, "Mutual information measure for distinguishing
coding and non-coding dna sequences," Biocomp, vol. 1, pp.
214-219, 2008.
[17] Z. Ouyang and Z. S. She, "Multivariate entropy distance method for
distinguishing coding and non-coding dna sequences," J. Bioinform.
Comput. Biol., vol. 2, pp. 353-373, 2004.
[18] L. Q. Zhou, Z. G. Yu, J. Q. Deng, V. Anh, and S. C. Long, "A fractal
method to distinguish coding and non-coding sequences in a complete
genome based on a number sequence representation, j," Theor. Biol.,
vol. 232, pp. 559-567, 2004.
[19] Y. Zhou, L. Q. Zhou, Z. G. Yu, and V. V. Anh, "Distinguish coding and
noncoding sequences in a complete genome using fourier transform,"
International Conference on Natural Computation, pp. 295-299, 2007.
[20] V. A. Guo-Sheng and Y. Zu-Guo, "Distinguishing coding from noncoding
sequences in prokaryote complete genome based on the global
desciptor," IEEE Computer Society: Sixth International Conference on
Fuzzy Systems and Knowledge Discovery, pp. 42-46, 2009.
[21] D. A. Benson, I. Karsch-Mizrachi, D. Lipman, J. Ostell, and E. Sayers,
"Genbank," Nucleic Acids Res., vol. 37(Database issue), pp. D26-31,
2009.
[22] M. W. Bern and D. Goldberg, "Automatic selection of representative
proteins for bacterial phylogeny," BMC Evolutionary Biology, vol. 5,
2005.
[23] M. Burset and R. Guigo, "Evaluation of gene structure prediction
programs," Genomics, vol. 34, pp. 353-367, 1996.
[24] R. K.E., "Ecogene: a genome sequence database for escherichia coli
k-12," Nucleic Acids Res., vol. 28, pp. 60-64, 2000.