Virulent-GO: Prediction of Virulent Proteins in Bacterial Pathogens Utilizing Gene Ontology Terms

Prediction of bacterial virulent protein sequences can give assistance to identification and characterization of novel virulence-associated factors and discover drug/vaccine targets against proteins indispensable to pathogenicity. Gene Ontology (GO) annotation which describes functions of genes and gene products as a controlled vocabulary of terms has been shown effectively for a variety of tasks such as gene expression study, GO annotation prediction, protein subcellular localization, etc. In this study, we propose a sequence-based method Virulent-GO by mining informative GO terms as features for predicting bacterial virulent proteins. Each protein in the datasets used by the existing method VirulentPred is annotated by using BLAST to obtain its homologies with known accession numbers for retrieving GO terms. After investigating various popular classifiers using the same five-fold cross-validation scheme, Virulent-GO using the single kind of GO term features with an accuracy of 82.5% is slightly better than VirulentPred with 81.8% using five kinds of sequence-based features. For the evaluation of independent test, Virulent-GO also yields better results (82.0%) than VirulentPred (80.7%). When evaluating single kind of feature with SVM, the GO term feature performs much well, compared with each of the five kinds of features.




References:
[1] B.B. Finlay, and S. Falkow, Common themes in microbial pathogenicity
revisited. Microbiology and Molecular Biology Reviews 61 (1997)
136-&.
[2] H.J. Wu, A.H.J. Wang, and M.P. Jennings, Discovery of virulence factors
of pathogenic bacteria. Current Opinion in Chemical Biology 12 (2008)
93-101.
[3] R.A. Weiss, Virulence and pathogenesis. Trends in Microbiology 10
(2002) 314-317.
[4] L.H. Chen, J. Yang, J. Yu, Z.J. Ya, L.L. Sun, Y. Shen, and Q. Jin, VFDB:
a reference database for bacterial virulence factors. Nucleic Acids
Research 33 (2005) D325-D328.
[5] J. Yang, L.H. Chen, L.L. Sun, J. Yu, and Q. Jin, VFDB 2008 release: an
enhanced web-based resource for comparative pathogenomics. Nucleic
Acids Research 36 (2008) D539-D542.
[6] A. Bairoch, and R. Apweiler, The SWISS-PROT protein sequence
database and its supplement TrEMBL in 2000. Nucleic Acids Research 28
(2000) 45-48.
[7] A. Garg, and D. Gupta, VirulentPred: a SVM based prediction method for
virulent proteins in bacterial pathogens. Bmc Bioinformatics 9 (2008) -.
[8] G. Sachdeva, K. Kumar, P. Jain, and S. Ramachandran, SPAAN: a
software program for prediction of adhesins and adhesin-like proteins
using neural networks. Bioinformatics 21 (2005) 483-491.
[9] S.F. Altschul, T.L. Madden, A.A. Schaffer, J.H. Zhang, Z. Zhang, W.
Miller, and D.J. Lipman, Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Research
25 (1997) 3389-3402.
[10] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry,
A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill,
L. Issel-Tarver, A. Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M.
Ringwald, G.M. Rubin, G. Sherlock, and G.O. Consortium, Gene
Ontology: tool for the unification of biology. Nature Genetics 25 (2000)
25-29.
[11] A. Lewin, and I.C. Grieve, Grouping Gene Ontology terms to improve the
assessment of gene set enrichment in microarray data. Bmc
Bioinformatics 7 (2006) -.
[12] S. Carroll, and V. Pavlovic, Protein classification using probabilistic
chain graphs and the Gene Ontology structure. Bioinformatics 22 (2006)
1871-1878.
[13] Z.D. Lei, and Y. Dai, Assessing protein similarity with Gene Ontology
and its use in subnuclear localization prediction. Bmc Bioinformatics 7
(2006) -.
[14] Z.L. Qian, Y.D. Cai, and Y.X. Li, A novel computational method to
predict transcription factor DNA binding preference. Biochemical and
Biophysical Research Communications 348 (2006) 1034-1037.
[15] W.L. Huang, C.W. Tung, S.W. Ho, S.F. Hwang, and S.Y. Ho,
ProLoc-GO: Utilizing informative Gene Ontology terms for
sequence-based prediction of protein subcellular localization. Bmc
Bioinformatics 9 (2008) -.
[16] D. Barrell, E. Dimmer, R.P. Huntley, D. Binns, C. O'Donovan, and R.
Apweiler, The GOA database in 2009-an integrated Gene Ontology
Annotation resource. Nucleic Acids Research 37 (2009) D396-D403.
[17] K. Chan, and W. Lam, Gene ontology classification of biomedical
literatures using context association. Information Retrieval Technology,
Proceedings 3689 (2005) 552-557.
[18] D.W. Park, H.S. Heo, H.C. Kwon, and H.Y. Chung, Protein function
classification based on gene ontology. Information Retrieval Technology,
Proceedings 3689 (2005) 691-696.
[19] S. Altschul, T. Madden, A. Schaffer, J.H. Zhang, Z. Zhang, W. Miller, and
D. Lipman, Gapped BLAST and PSI-BLAST: A new generation of
protein database search programs. Faseb Journal 12 (1998)
A1326-A1326.
[20] I.H. Witten, and E. Frank, Data Mining: Practical machine learning tools
and techniques, Morgan Kaufmann, San Francisco, 2005.
[21] C. Chang, and C. Lin, LIBSVM: a library for support vector machines.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.. 2001.
[22] S. M, Cross-validatory choice and assessment of statistical predictions.
Jounral of the Royal Statistical Society 36 (1974) 111-147.