Predicting Protein-Protein Interactions from Protein Sequences Using Phylogenetic Profiles
In this study, a high accuracy protein-protein interaction
prediction method is developed. The importance of the proposed
method is that it only uses sequence information of proteins while
predicting interaction. The method extracts phylogenetic profiles of
proteins by using their sequence information. Combining the phylogenetic
profiles of two proteins by checking existence of homologs
in different species and fitting this combined profile into a statistical
model, it is possible to make predictions about the interaction status
of two proteins.
For this purpose, we apply a collection of pattern recognition
techniques on the dataset of combined phylogenetic profiles of protein
pairs. Support Vector Machines, Feature Extraction using ReliefF,
Naive Bayes Classification, K-Nearest Neighborhood Classification,
Decision Trees, and Random Forest Classification are the methods
we applied for finding the classification method that best predicts
the interaction status of protein pairs. Random Forest Classification
outperformed all other methods with a prediction accuracy of 76.93%
[1] O¨ mer N. Yaverog˘lu, Tolga Can, "Prediction of proteinprotein
interactions using statistical data analysis methods",
4th International Symposium on Health Informatics
and Bioinformatics , 2009
[2] Joel R. Bock, David A. Gough, "Predicting protein-protein
interactions from primary structure" Bioinformatics, vol.
17, no. 5, 2001.
[3] Lukasz Salwinski, David Eisenbergy, "Computational
methods of analysis of protein-protein interactions" Current
opinion in structural biology, 13:377-382, 2003.
[4] Alfonso Valencia, Florencio Pazos, "Computational methods
for the prediction of protein interactions", Current
opinion in structural biology, 12:368-373, 2002
[5] Chih-Chung Chang, Chih-Jen Lin, "LIBSVM: a Library
for Support Vector Machines", 2003.
[6] Yanzhi Guo, Lezheng Yu, Zhining Wen, Menglong Li,
"Using support vector machine combined with auto covariance
to predict protein-protein interactions from protein
sequences" Nucleic Acids Research, vol. 36, no. 9, 2008.
[7] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, "A
Practical Guide to Support Vector Classification", 2008.
[8] Lindsay I. Smith, "A tutorial on Principal Components
Analysis", 2002.
[9] I. Xenarios, L. Salwinski, X. J. Duan, P. Higney, S.
M. Kim, D. Eisenberg, "DIP:the database of interacting
proteins. A research tool for studying cellular networks
of protein interactions." Nucleic Acids Research, vol. 30,
pages: 303-305, 2002.
[10] Marko Robnik-ˆSikonja, Igor Kononenko "Theoretical
and Empirical Analysis of ReliefF and RReliefF" Machine
Learning, vol. 53, pages: 2369, 2003.
[11] Yiran Li, "Feature Extraction with RELIEF and Its
Kernelization"
[12] Paul Helman, Robert Veroff, Susan R. Atlas and Cheryl
Willman "A Bayesian Network Classification Methodology
for Gene Expression Data" Journal of Computational Biology
11(4): 581-615. doi:10.1089/cmb.2004.11.581, 2004.
[13] Tin Kam Ho "Random Decision Forests " Proc. of the
3rd Int-l Conf. on Document Analysis and recognition,
Montreal, Canada, 1995.
[14] Leo Breiman and Adele Cutler "Random Forests"
[15] Ian H. Witten and Eibe Frank "Data Mining: Practical
machine learning tools and techniques", 2nd Edition, Morgan
Kaufmann, San Francisco, 2005.
[1] O¨ mer N. Yaverog˘lu, Tolga Can, "Prediction of proteinprotein
interactions using statistical data analysis methods",
4th International Symposium on Health Informatics
and Bioinformatics , 2009
[2] Joel R. Bock, David A. Gough, "Predicting protein-protein
interactions from primary structure" Bioinformatics, vol.
17, no. 5, 2001.
[3] Lukasz Salwinski, David Eisenbergy, "Computational
methods of analysis of protein-protein interactions" Current
opinion in structural biology, 13:377-382, 2003.
[4] Alfonso Valencia, Florencio Pazos, "Computational methods
for the prediction of protein interactions", Current
opinion in structural biology, 12:368-373, 2002
[5] Chih-Chung Chang, Chih-Jen Lin, "LIBSVM: a Library
for Support Vector Machines", 2003.
[6] Yanzhi Guo, Lezheng Yu, Zhining Wen, Menglong Li,
"Using support vector machine combined with auto covariance
to predict protein-protein interactions from protein
sequences" Nucleic Acids Research, vol. 36, no. 9, 2008.
[7] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, "A
Practical Guide to Support Vector Classification", 2008.
[8] Lindsay I. Smith, "A tutorial on Principal Components
Analysis", 2002.
[9] I. Xenarios, L. Salwinski, X. J. Duan, P. Higney, S.
M. Kim, D. Eisenberg, "DIP:the database of interacting
proteins. A research tool for studying cellular networks
of protein interactions." Nucleic Acids Research, vol. 30,
pages: 303-305, 2002.
[10] Marko Robnik-ˆSikonja, Igor Kononenko "Theoretical
and Empirical Analysis of ReliefF and RReliefF" Machine
Learning, vol. 53, pages: 2369, 2003.
[11] Yiran Li, "Feature Extraction with RELIEF and Its
Kernelization"
[12] Paul Helman, Robert Veroff, Susan R. Atlas and Cheryl
Willman "A Bayesian Network Classification Methodology
for Gene Expression Data" Journal of Computational Biology
11(4): 581-615. doi:10.1089/cmb.2004.11.581, 2004.
[13] Tin Kam Ho "Random Decision Forests " Proc. of the
3rd Int-l Conf. on Document Analysis and recognition,
Montreal, Canada, 1995.
[14] Leo Breiman and Adele Cutler "Random Forests"
[15] Ian H. Witten and Eibe Frank "Data Mining: Practical
machine learning tools and techniques", 2nd Edition, Morgan
Kaufmann, San Francisco, 2005.
@article{"International Journal of Information, Control and Computer Sciences:60187", author = "Omer Nebil Yaveroglu and Tolga Can", title = "Predicting Protein-Protein Interactions from Protein Sequences Using Phylogenetic Profiles", abstract = "In this study, a high accuracy protein-protein interaction
prediction method is developed. The importance of the proposed
method is that it only uses sequence information of proteins while
predicting interaction. The method extracts phylogenetic profiles of
proteins by using their sequence information. Combining the phylogenetic
profiles of two proteins by checking existence of homologs
in different species and fitting this combined profile into a statistical
model, it is possible to make predictions about the interaction status
of two proteins.
For this purpose, we apply a collection of pattern recognition
techniques on the dataset of combined phylogenetic profiles of protein
pairs. Support Vector Machines, Feature Extraction using ReliefF,
Naive Bayes Classification, K-Nearest Neighborhood Classification,
Decision Trees, and Random Forest Classification are the methods
we applied for finding the classification method that best predicts
the interaction status of protein pairs. Random Forest Classification
outperformed all other methods with a prediction accuracy of 76.93%", keywords = "Protein Interaction Prediction, Phylogenetic Profile,
SVM , ReliefF, Decision Trees, Random Forest Classification", volume = "3", number = "8", pages = "2068-7", }