A Novel Approach for Protein Classification Using Fourier Transform

Discovering new biological knowledge from the highthroughput biological data is a major challenge to bioinformatics today. To address this challenge, we developed a new approach for protein classification. Proteins that are evolutionarily- and thereby functionally- related are said to belong to the same classification. Identifying protein classification is of fundamental importance to document the diversity of the known protein universe. It also provides a means to determine the functional roles of newly discovered protein sequences. Our goal is to predict the functional classification of novel protein sequences based on a set of features extracted from each protein sequence. The proposed technique used datasets extracted from the Structural Classification of Proteins (SCOP) database. A set of spectral domain features based on Fast Fourier Transform (FFT) is used. The proposed classifier uses multilayer back propagation (MLBP) neural network for protein classification. The maximum classification accuracy is about 91% when applying the classifier to the full four levels of the SCOP database. However, it reaches a maximum of 96% when limiting the classification to the family level. The classification results reveal that spectral domain contains information that can be used for classification with high accuracy. In addition, the results emphasize that sequence similarity measures are of great importance especially at the family level.





References:
[1] J. Zhao, "Multivariate Statistical Analysis of Protein Variation", A Ph.
D. dissertation, available at http://www.lib.ncsu.edu/theses/available/etd-
12092005-003538/unrestricted/etd.pdf
[2] A. Murzin, S. Brenner, T. Hubbard, and C. Chothia, "SCOP: A
Structural Classification of Proteins Database for the Investigation of
Sequences and Structures," Journal of Molecular Biology, vol. 247, no. 4,
pp. 536-540, 1995.
[3] C. Orengo, A. Michie, S. Jones, D. Jones, M. Swindells, and J.
Thornton, "CATH- A Hierarchic Classification of Protein Domain
Structures," Structure, vol. 5, no. 4, pp. 1093-1108, 1997.
[4] A. Bateman, L. Coin, R. Durbin, R. Finn, V. Hollich, S. Griffiths-Jones,
A. Khanna, M. Marshall, S. Moxon, E. Sonnhammer, D. Holme, C.
Yeats, and S. Eddy, "The Pfam protein Families Database," Nucleic Acids
Res., vol. 32, no. 36, pp. D138-D141, 2004.
[5] O. Camoglu, T. Can, A. Singh, and Y. Wang, "Decision Tree Based
Information Integration for Automated Protein Classification," Journal of
Bioinformatics and Computational Biology (JBCB), Vol. 3, No. 3, pp. 717-
742, 2005.
[6] O. André, F. Daniel, F. Ant├│nio, "Peptide programs: applying fragment
programs to protein classification", Proceeding of the 2nd International
Workshop on Data and Text Mining in Bioinformatics, pp. 37-44, 2008.
[7] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W.
Miller, and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs", Nucleic Acids Res.,
vol. 25, no. 17, pp. 3389-3402, 1997.
[8] W. Tian, and J. Skolnick, "How well is enzyme function conserved as a
function of pairwise sequence identity?", Molecular Biological, vol. 3,
no.4, pp. 863-882, 2003.
[9] D. Devos, and A. Valencia, "Intrinsic errors in genome annotation",
Trends Genetics, vol. 17, no.8, pp. 429-431, 2001.
[10] E. N. Baker, V. L. Arcus, and J. S. Lott, "Protein structure prediction
and analysis as a tool for functional genomics", Appl. Bioinformatics,
vol. 2, no. 3, pp. 3-10, 2003.
[11] M. Grotthuss, D. Plewczynski, K. Ginalski, L. Rychlewski, and E. I.
Shakhnovich, "PDB-UF: database of predicted enzymatic functions for
unannotated protein structures from structural genomics", BMC
Bioinformatics, vol. 7, no. 1, pp. 53-56, 2006.
[12] J. C. Whisstock, and A. M. Lesk, "Prediction of protein function from
protein sequence and structure", Q Rev Biophys., vol. 36, no. 3, pp. 307-
340, 2003.
[13] I. Friedberg, "Automated protein function prediction the genomic
challenge", Brief Bioinformatics, vol. 7, no. 3, pp. 225-242, 2006.
[14] I., Melvin, E. Ie, J. Wetson, W. S. Noble, and C. Leslie, "Multi-class
protein classification using adaptive codes", J Mach. Learn. Res., vol. 8,
pp. 1557-1581, 2007.
[15] L. Y. Han , C. Z. Cai, Z. L. Ji, Z. W Cao., J. Cui, and Y. Z. Chen, "
Predicting functional family of novel enzymes irrespective of sequence
similarity: a statistical learning approach", Nucleic Acids Res., vol. 32,
no. 21, pp. 6437-6444, 2004.
[16] R. E. Langlois, M. B. Carson, N. Bhardwaj, and H. Lu "Learning to
translate sequence and structure to function: Identifying DNA binding
and membrane binding proteins" , Annals of Biomedical Engineering,
vol. 35, no. 6, pp. 1043-1052, 2007.
[17] Z. R. Yang, and R. Hamer, "Bio-basis function neural networks in
protein data mining", Current Pharmaceutical Design, vol. 13, no. 14,
pp. 1403-1413, 2007.
[18] J. Busch, P. Ferrari, A. Flesia, S. P. Grynberg, and F. Leonardi," Testing
statistical hypothesis on random trees and applications to the protein
classification problem", Annals of Applied Statistics, Vol.3, No.2, pp.542-
563, 2009.
[19] M. Q. Yang, J. Y. Yang, and O. K. Ersoy, "Classification of proteins
multiple-labelled and single-labelled with protein functional classes",
Int. J Gen. Syst., vol. 36, no.1, pp. 91-109, 2007.
[20] C. Pasquier, V. Promponas, and S. J. Hamodrakas, "PRED-CLASS:
Cascading Neural networks for generalized protein classification and
genome wide applications", Proteins, PROTEINS: Structure, Function,
and Genetics, vol. 44, no.1, pp. 361-369, 2001.
[21] B. J. Webb-Robertson, C. Oehmen, and M. Matzke, "SVM-BALSA:
Remote homology detection based on Bayesian sequence alignment",
Computational Biological Chemistry, vol. 29, no. 6, pp. 440-443, 2005.
[22] Z. D. Zhang, S. Kochhar, and M. G. Grigorov, " Descriptor-based
protein remote homology identification", Protein Science, vol. 14, no.2,
pp. 431-444, 2005.
[23] N. Bhardwaj, R. E. Langlois, G. J Zhao, and H. Lu " Kernel-based
machine learning protocol for predicting DNA binding proteins",
Nucleic Acids Res, vol. 33, no. 20, pp. 6486-6493, 2005.
[24] P. D. Dobson, and A. J. Doig, "Predicting enzyme class from protein
structure without alignments", Journal of Molecular Biology, vol. 345,
no. 1, pp. 187-199, 2005.
[25] Y. D. Cai, and A. J. Doig, "Prediction of Saccharomyces cerevisiae
protein functional class from functional domain composition",
Bioinformatics, vol. 20, no.8, pp. 1292-1300, 2004.
[26] Q. W. Dong, X. L. Wang, and L. Lin, "Application of latent semantic
analysis to protein remote homology detection", Bioinformatics, vol. 22,
no. 3, pp. 285-290, 2005.
[27] R. Kuang, E. Ie, K. Wang, K. Wang, M. Siddiqi, Y. Freund, and C.
Leslie, "Profile-based string kernels for remote homology detection and
motif extraction", Journal of Bioinformatics and Computational
Biology, vol. 3, no.3, pp. 527-550, 2005.
[28] H. Rangwala, and G. Karypis, "Profile-based direct kernels for remote
homology detection and fold recognition", Bioinformatics, vol. 2, no.23,
pp. 4239-4247, 2005.
[29] L. Nanni, S. Mazzara, L. Pattini, and A. Lumini, "Protein classification
combining surface analysis and primary structure", Protein Engineering:
Design and Selection, vol. 22, no. 4, pp. 267-272, 2009.
[30] D. Eisenberg, R. Weiss, and T. Terwilliger, "The Helical Hydrophobic
Moment: A Measure of the Amphiphilicity of a Helix", Nature, vol.4,
pp. 299-371, 1982.
[31] D. Eisenberg, E. Schwarz, M., Komaromy and R. Wall, "Analysis of
Membrane and Surface Protein Sequences with the Hydrophobic
Moment Plot", Journal of Molecular Biology, vol.42, no.1, pp. 125-179,
1984.
[32] L. Pattini, L. Riva, and S. Cerutti, "A wavelet based method to predict
the alpha helix content in the secondary structure of globular proteins",
Proceedings of the IEEE-EMBS, pp.132-133 , 2002.
[33] A. Shepherd, G. Gorse, and J. Thornton, "A novel approach to the
recognition of protein architecture from sequence using Fourier analysis
and neural networks", Proteins, vol. 50, no.2, pp. 290-302, 2003.
[34] A. Antonina, H. Dave, C. John-Marc, and E. Steven, "Data growth and
its impact on the SCOP database: new developments", Nucleic Acids
Res., vol. 36, no. 1, pp. 1-7, 2008.
[35] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H.
Weissig, I.N. Shindyalov, and P.E. Bourne, "The Protein Data Bank",
Nucleic Acids Res., vol. 28, no. 1, pp.235-242, 2000.
[36] L. Lo Conte, S.E. Brenner, T.J.P. Hubbard, C. Chothia, and A.G.
Murzin, "SCOP database in 2002: refinements accommodate structural
genomics", Nucleic Acids Res., vol. 30, no.1, pp. 264-267, 2002.
[37] J. M. Chandonia, G. Hon, N.S. Walker, L. Lo Conte, P. Koehl, M.
Levitt, and S.E. Brenner, "The ASTRAL compendium in 2004",
Nucleic Acids Res., vol. 32, no.1, pp. 189-192, 2004.
[38] D. Wilson, M. Madera, C. Vogel, C. Chothia, and J. Gough, "The
SUPERFAMILY database in 2007: families and functions", Nucleic Acids
Res., vol. 35, Database Issue, pp. 308-313, 2007.