Statistics of Exon Lengths in Animals, Plants, Fungi, and Protists

Eukaryotic protein-coding genes are interrupted by spliceosomal introns, which are removed from the RNA transcripts before translation into a protein. The exon-intron structures of different eukaryotic species are quite different from each other, and the evolution of such structures raises many questions. We try to address some of these questions using statistical analysis of whole genomes. We go through all the protein-coding genes in a genome and study correlations between the net length of all the exons in a gene, the number of the exons, and the average length of an exon. We also take average values of these features for each chromosome and study correlations between those averages on the chromosomal level. Our data show universal features of exon-intron structures common to animals, plants, and protists (specifically, Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Cryptococcus neoformans, Homo sapiens, Mus musculus, Oryza sativa, and Plasmodium falciparum). We have verified linear correlation between the number of exons in a gene and the length of a protein coded by the gene, while the protein length increases in proportion to the number of exons. On the other hand, the average length of an exon always decreases with the number of exons. Finally, chromosome clustering based on average chromosome properties and parameters of linear regression between the number of exons in a gene and the net length of those exons demonstrates that these average chromosome properties are genome-specific features.





References:
[1] H. Naora and N.J. Deacon, "Relationship between the total size of exon
and introns in the protein-coding genes of higher eukaryotes," Proc.
Natl. Acad. Sci.. USA, vol. 79: pp. 6196-6200, 1982.
[2] J.D. Hawkins, "A survey on intron and exon lengths," Nucleic Acids
Res., vol. 16: pp. 9893-9908, 1988.
[3] M. Deutsch and M. Long, "Intron-exon structures of eukaryotic model
organisms," Nucleic Acids Res., vol. 27: p. 3219-3228, 1999.
[4] E.V. Kriventseva and M.S. Gelfand, "Statistical analysis of the exonintron
structure of higher and lower eukaryote genes," J. Biomol. Struct.
Dyn., vol. 17, pp. 281-288, 1999.
[5] A.A. Mironov and M.S. Gelfand, "Prediction and computer analysis of
the exon-intron structure of human genes," Mol. Biol., vol. 38, pp. 70-
77, 2004.
[6] A.T. Ivashchenko and S.A. Atambayeva, "Variation in lengths of introns
and exons in genes of the Arabidopsis thaliana nuclear genome,"
Russian Journal of Genetics, vol. 40, pp. 1179-1181, 2004.
[7] S.W. Roy and D. Penny, "Intron length distributions and gene
prediction," Nucleic Acids Res., vol. 35, pp. 4737-4742, 2007.
[8] S.A. Atambayeva, V.A. Khailenko, and A.T. Ivashchenko, "Intron and
exon length variation in arabidopsis, rice, nematode, and human," Mol.
Biol., vol. 42, pp. 312-320, 2008.
[9] A.T Ivashchenko,. V.A. Khailenko, and S.A. Atambayeva, "Variation of
the lengths of exons and introns in Human Genome genes," Russian
Journal of Genetics, vol. 45, pp.16-22, 2009.
[10] F.S. Collins et al., "Finishing the euchromatic sequence of the human
genome. International Human Genome Sequencing Consortium,"
Nature, vol. 431, pp. 931-945, 2004.
[11] E.M. Schwarz et al., "WormBase: better software, richer content,"
Nucleic Acids Res., vol. 34 (Database), pp. D475-D478, 2006.
[12] R.A. Drysdale and M.A. Crosby, "FlyBase: genes and gene models,"
Nucleic Acids Res., vol. 33 (Database), pp. D390-D395, 2005.
[13] B.J. Haas et al., "Complete reannotation of the Arabidopsis genome:
methods, tools, protocols and the final release," BMC Biol., vol. 3, p. 7,
2005.
[14] J.M.J. Logsdon, A. Stoltzfus, and W.F. Doolittle, "Molecular evolution:
recent cases of spliceosomal intron gain?" Curr. Biol., vol. 8: pp. R560-
R563, 1998.
[15] J.M Archibald,. C.J. O'Kelly, and W.F. Doolittle, "The chaperonin genes
of jakobid and jakobid-like flagellates: implications for eukaryotic
evolution," Mol. Biol. Evol., vol. 19, pp. 422-431, 2002.
[16] A.T Ivashchenko, M.I. Tauasarova, and S.A. Atambayeva, "Exon-Intron
Structure of Genes in Complete Fungal Genomes," Mol. Biol., vol. 43,
pp. 24-31, 2009.
[17] B.J. Loftus et al., "The genome of the basidiomycetous yeast and
human pathogen Cryptococcus neoformans," Science, vol. 307, pp.
1321-1324, 2005.
[18] D. Martinez et al., "Genome sequence of the lignocellulose degrading
fungus Phanerochaete chrysosporium strain RP78," Nat. Biotechnol.,
vol. 22, pp. 695-700, 2004.
[19] S. Gudlaugsdottir, D.R. Boswell, G.R. Wood, and J. Ma, "Exon size
distribution and the origin of introns," Genetica, vol. 131, pp. 299-306,
2007.
[20] Y. Ryabov and M. Gribskov, "Spontaneous symmetry breaking in
genome evolution," Nucleic Acids Res., vol. 36, pp. 2756-2763, 2008.
[21] G. Cho and R.F. Doolittle, "Intron distribution in ancient paralogs
supports random insertion and not random loss," J. Mol. Evol., vol. 44,
pp. 573-584, 1997.
[22] S.W. Roy, "The origin of recent introns: transposons?" Genome Biol.,
vol. 5, p. 251, 2004.
[23] W. Gilbert, "The exon theory of genes," in Symp. Quant. Biol., Cold
Spring Harbor, vol.52, 1987, pp.901-905.
[24] T. Cavalier-Smith, "Selfish DNA and the origin of introns," Nature, vol.
315, pp. 283-284, 1985.
[25] J.M. Logsdon and J.D. Palmer, "Origin of introns - early or late?"
Nature, vol. 369, pp. 526-528, 1994.
[26] M.K. Sakharkar, V.T. Chow, and P. Kangueane, "Distributions of exons
and introns in the human genome," In Silico Biol., vol. 4, pp. 387-393,
2004.