Comparison of Domain and Hydrophobicity Features for the Prediction of Protein-Protein Interactions using Support Vector Machines

The protein domain structure has been widely used as the most informative sequence feature to computationally predict protein-protein interactions. However, in a recent study, a research group has reported a very high accuracy of 94% using hydrophobicity feature. Therefore, in this study we compare and verify the usefulness of protein domain structure and hydrophobicity properties as the sequence features. Using the Support Vector Machines (SVM) as the learning system, our results indicate that both features achieved accuracy of nearly 80%. Furthermore, domains structure had receiver operating characteristic (ROC) score of 0.8480 with running time of 34 seconds, while hydrophobicity had ROC score of 0.8159 with running time of 20,571 seconds (5.7 hours). These results indicate that protein-protein interaction can be predicted from domain structure with reliable accuracy and acceptable running time.





References:
[1] B. Rost, J. Liu, R. Nair, K. O. Wrzeszczynski, and Y. Ofran,
"Automatic prediction of protein function," Cell. Mol. Life Sci. vol. 60,
pp. 2637-2650, 2003.
[2] H. Lodish, A. Berk, L. Zipursky, P. Matsudaira, D. Baltimore, and J.
Darnell, Molecular cell biology (4th edition). W.H. Freeman, New
York, 2000.
[3] B. Alberts, A. Johnson, J. Lewis, M. Raff, K.Roberts, and P. Walter,
Molecular Biology of the Cell (4th edition). Garland Science, 2002.
[4] P. Uetz and C. S. Vollert, "Protein-Protein Interactions," Encyclopedic
Reference of Genomics and Proteomics in Molecular Medicine
(ERGPMM), Springer Verlag, 2005.
[5] E. M. Phizicky and S. Fields, "Protein-protein interactions: Method for
detection and analysis," Microbiological Reviews, pp.94-123, 1995.
[6] E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, and D.
Eisenberg, "A combined algorithm for genome-wide prediction of
protein function," Nature, vol. 402, pp:83-86, 1999.
[7] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O.
Yeates, "Assigning protein functions by comparative genome analysis:
protein phylogenetic profiles," In the proceedings of National Academy
of Sciences, USA, vol. 96, pp. 4285-4288, 1999.
[8] F. Pazos and A. Valencia, "Similarity of phylogenetic trees as indicator
of protein-protein interaction," Protein Engineering, vol. 14(9), pp: 609-
614, 2001.
[9] A. J. Enright, I. N. Ilipoulos, C. Kyrpides, and C. A. Ouzounis, "Protein
interaction maps for complete genomes based on gene fusion events,"
Nature, vol. 402, pp: 86-90, 1999.
[10] D. Eisenberg, E. M. Marcotte, I. Xenarios, and T. O. Yeates, "Protein
function in the post-genomic era," Nature, vol. 405, pp: 823-826, 2000.
[11] J. Wojcik and V. Schachter, "Protein-Protein interaction map inference
using interacting domain profile pairs," Bioinformatics, vol. 17,
pp:S296-S305, 2001.
[12] J. R. Bock and D. A. Gough, "Predicting protein-protein interactions
from primary structure," Bioinformatics, vol. 17(5), pp: 455-460, 2001.
[13] T. Oyama, K. Kitano, K. Satou, and T. Ito, "Extraction of knowledge on
protein-protein interaction by association rule discovery,"
Bioinformatics, vol. 18(5), pp: 705-714, 2002.
[14] T. Pawson and P. Nash, "Assembly of cell regulatory systems through
protein interaction domains," Science, vol. 300, pp: 445-452, 2003.
[15] W. K. Kim, J. Park, and J. K. Suh, "Large scale statistical prediction of
protein-protein interaction by potentially interacting domain (PID) pair,"
Genome Informatics, vol. 13, pp: 42-50, 2002.
[16] S. M. Gomez, W. S. Noble, and A. Rzhetsky, "Learning to predict
protein-protein interactions from protein sequences," Bioinformatics,
vol. 19(15), pp: 1875-1881, 2003.
[17] I. Xenarios, L. Salwinski, X. J. Duan, P. Higney, S. M. Kim, and D.
Eisenberg, "DIP, the Database of Interacting Proteins: a research tool for
studying cellular networks of protein interactions," Nucleic Acids
Research, vol. 30(1), pp: 303- 305, 2002.
[18] Y. Chung, G. Kim, Y. Hwang, and H. Park, "Predicting Protein-Protein
Interactions from One Feature Using SVM," In proceedings of IEA/AIE
pp:50-55, 2004.
[19] V. N. Vapnik, The Nature of Statistical Learning Theory. Springer.
1995.
[20] S. K. Ng, Z. Zhang, S. H. Tan, and K. Lin, "InterDom: a database of
putative interacting protein domains for validating predicted protein
interactions and complexes," Nucleic Acids Research, vol. 31, pp: 251-
254, 2003.
[21] A. Bateman, L. Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths-
Jones, A. Khanna, M. Marshall, S. Moxon, E. L. Sonnhammer, D.J.
Studholme, C. Yeats, and S. R. Eddy, "The Pfam: Protein Families
Database," Nucleic Acids Research: Database Issue, vol. 32, pp: D138-
D141, 2004.
[22] T. P. Hopp and K. R. Woods, "Predicting of protein antigenic
determinants from amino acid sequences," Proc. Natl Acad. Sci. USA,
78, 3824-3828, 1981.
[23] C. M. Deane, L. Salwinski, I. Xenarios, and D. Eisenberg, "Protein
interactions: two methods for assessment of the reliability of high
throughput observations," Molecular & Cellular Proteomics, vol. 1(5),
pp: 349-56, 2002.
[24] Hong EL, Balakrishnan R, Christie KR, Costanzo MC, Dwight SS,
Engel SR, Fisk DG, Hirschman JE, Livestone MS, Nash R, Park J,
Oughtred R, Skrzypek M, Starr B, Theesfeld CL, Andrada R, Binkley G,
Dong Q, Lane C, Hitz B, Miyasato S, Schroeder M, Sethuraman A,
Weng S, Dolinski K, Botstein D, and Cherry JM. "Saccharomyces
Genome Database" http://www.yeastgenome.org/, (10th Oct 2005).
[25] N. J. Mulder, R. Apweiler, T. K. Attwood, A. Bairoch, D. Barrell, A.
Bateman, D. Binns, et al., "The InterPro Database brings increased
coverage and new features," Nucleic Acids Research, vol. 31, pp: 315-
318, 2003.
[26] C. C. Chang and C. J. Lin, "LIBSVM : a library for support vector
machines," 2001. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm. (24th March 2005).