Selecting Negative Examples for Protein-Protein Interaction

Proteomics is one of the largest areas of research for bioinformatics and medical science. An ambitious goal of proteomics is to elucidate the structure, interactions and functions of all proteins within cells and organisms. Predicting Protein-Protein Interaction (PPI) is one of the crucial and decisive problems in current research. Genomic data offer a great opportunity and at the same time a lot of challenges for the identification of these interactions. Many methods have already been proposed in this regard. In case of in-silico identification, most of the methods require both positive and negative examples of protein interaction and the perfection of these examples are very much crucial for the final prediction accuracy. Positive examples are relatively easy to obtain from well known databases. But the generation of negative examples is not a trivial task. Current PPI identification methods generate negative examples based on some assumptions, which are likely to affect their prediction accuracy. Hence, if more reliable negative examples are used, the PPI prediction methods may achieve even more accuracy. Focusing on this issue, a graph based negative example generation method is proposed, which is simple and more accurate than the existing approaches. An interaction graph of the protein sequences is created. The basic assumption is that the longer the shortest path between two protein-sequences in the interaction graph, the less is the possibility of their interaction. A well established PPI detection algorithm is employed with our negative examples and in most cases it increases the accuracy more than 10% in comparison with the negative pair selection method in that paper.




References:
[1] Juwen Shen, Jian Zhang, Xiaomin Luo, Weiliang Zhu, Kunqian Yu,
Kaixian Chen, Yixue Li, and Hualiang Jiang, "Predicting protein-protein
interactions based only on sequences information", PNAS, vol. 104, no.
11,pp. 4337-4341, 2007.
[2] Shawn Martin, Diana Roe and Jean-Loup Faulon, "Predicting
protein-protein interactions using signature products", Bioinformatics,
Vol. 21 no. 2 2005, pp. 218-226
[3] Jin Wang, Chunhe Li, Erkang Wang and Xidi Wang, "Uncovering the
rules for protein-protein interactions from yeast genomic data", PNAS,
2009, vol. 106, no. 10 , pp. 3752-3757.
[4] Xue-wen Chen and Mei Liu, "Prediction of Protein-Protein Interactions
Using Random Decision Forest Framework", Bioinformatics, 21(24), pp.
4394-4400, 2005.
[5] Nazar Zaki, Safaai Deris and Hany Alashwal, "Protein-Protein Interaction
Detection Based on Substring Sensitivity Measure", International
Journal of Biological and Medical Sciences, 1:2 2006
[6] Joel R. Bock and David A. Gough," Predicting protein-protein
interactions from primary structure", Vol. 17 no. 5 2001 pp. 455-460
[7] Xiao-Li Li, Soon-Heng Tan, See-Kiong Ng, "Improving domain-based
protein interaction prediction using biologically-significant negative
dataset", International Journal of Data Mining and Bioinformatics,
Vol-1, No.2 pp. 138 - 149, 2006.
[8] Daniel R Rhodes,Scott A Tomlins, Sooryanarayana Varambally,
Vasudeva Mahavisno, Terrence Barrette, Shanker Kalyana Sundaram,
Debashis Ghosh, Akhilesh Pandey and Arul M Chinnaiyan, "Probabilistic
model of the human protein-protein interaction network", Nature
Biotechnology 23, 2005, pp. 951 - 959
[9] R. Jansen, H. Yu, D. Greenbaum, Y. Kluger, N.J. Krogan, S. Chung, A.
Emili, M. Snyder, J.F. Greenblatt and M. Gerstein, "A Bayesian networks
approach for predicting protein-protein interactions from genomic data",
Science, 302: (5644), pp. 449-453, 2003.
[10] Lu LJ, Xia Y, Paccanaro A, Yu H and Gerstein M, "Assessing the limits of
genomic data integration for predicting protein networks", Genome Res
2005, 15(7) pp. 945-953.
[11] Kumar,A., Agarwal,S., Heyman,John A., Matson S., Heidtman M.,
Piccirillo S., Umansky L., Drawid A., Jansen R., Liu, Y., Kei- Cheung H.,
Miller P., Gerstein M., Roeder G. S., and Snyder M., "Subcellular
localization of the yeast proteome", Genes Dev., 16, 2002, pp. 707-719.
[12] E. Coward, "Shufflet: shuffling sequences while conserving the k-let
counts", Bioinformatics, 15, pp. 1058-1059.
[13] D. Kandel, Y. Mathias, R. Unger and P. Winkler, "Shuffling biological
sequences", Discrete Appl. Math., 71, pp. 171-185, 1996.
[14] M. Deng, F. Sun, S. Metha and T. Chen, "Inferring domain-domain
interactions from protein-protein interactions", Genome Research, Vol.
12, pp.1540-1548, 2002.
[15] S.K. Ng, Z. Zhang, and S.H. Tan, "Integrative approach for
computationally inferring protein domain interactions", Bioinformatics,
Vol. 19, pp.923-929, 2003.
[16] Wan, K.K. and Jong, P., "Large scale statistical prediction of
protein-protein interaction by potentially interacting domain (pid) pair",,
Genome Informatics, Vol. 13, 2002, pp.45-50.
[17] Fiona Browne, Haiying Wang, Huiru Zheng and Francisco Azuaje,
"GRIP: A web-based system for constructing Gold Standard datasets for
protein-protein interaction prediction", Source Code for Biology and
Medicine 2009, 4:2
[18] P. Pagel, S. Kovac, M. Oesterheld, B. Brauner, I. Dunger-Kaltenbach, G.
Frishman, C. Montrone, P. Mark, V. St├╝mpflen, H.W. Mewes, A. Ruepp
and D. Frishman, "The MIPS mammalian protein-protein interaction
database", Bioinformatics, 21, pp. 832-834,2005.
[19] L . Salwinski, C.S. Miller, A.J. Smith, F.K. Pettit, J.U. Bowie and D.
Eisenberg, "The Database of Interacting Proteins: 2004 update", Nucleic
Acids Res, 32 Database issue:D449-51, 2004.
[20] Bader, G.D., Betel, D. and Hogue, C.W., "BIND: the Biomolecular
Interaction Network Database", Nucleic Acids Res. 31, 2003, pp.
248-250.
[21] Mishra, G.R. et al., "Human protein reference database; 2006 update",
Nucleic Acids Res. 34, D411-D414, Network Database. Nucleic Acids
Res. 31, 2003, 248-250.
[22] A. Chatr-aryamontri et al. "MINT: the Molecular INTeraction database",
Nucleic Acids Res. 35, D572-D574, 2007.
[23] T. Reguly et al., "Comprehensive curation and analysis of global
interaction networks in Saccharomyces cerevisiae", J. Biol., 5, 11, 2006.
[24] C. von Mering et al., "Comparative assessment of large-scale data sets of
protein-protein interactions", Nature, 417, 2002, pp. 399-403.
[25] A. M. Deane, L. Salwinski, I. Xenarios, D. Eisenberg, Mol. Cell.
Proteomics 1, 349, 2002.
[26] A.M. Edwards, B. Kus, R. Jansen, D. Greenbaum, J. Greenblatt and M.
Gerstein, "Bridging structural biology and genomics: assessing protein
interaction data with known complexes", Trends Genet 18, pp. 529-536,
2002.
[27] Jingkai Yu and Farshad Fotouhi, "Computational Approaches for
Predicting Protein-Protein Interactions: A Survey", J Med Sys 30(1),
2006, pp. 39-44.
[28] Michael E Cusick, Haiyuan Yu, Alex Smolyar, Kavitha
Venkatesan,Anne-Ruxandra Carvunis, Nicolas Simonis, Jean-François
Rual, Heather Borick,Pascal Braun, Matija Dreze, Jean Vandenhaute,
Mary Galli, Junshi Yazaki,David E Hill1, Joseph R Ecker, Frederick P
Roth and Marc Vidal, "Literature-curated protein interaction datasets",
Nature Methods, VOL.6 NO.1, JANUARY 2009.
[29] Jansen, R. and Gerstein, M., "Analyzing protein function on a genomic
scale: the importance of gold-standard positives and negatives for
network prediction", Curr. Opin. Microbiol. 7, 2004, pp. 535-545.
[30] P. Braun et al., "An experimentally derived confidence score for binary
protein-protein interactions", Nat. Methods 6, pp. 91-97, 2008.
[31] Ben-Hur A and Noble S, "Choosing negative examples for the prediction
of protein-protein interactions", BMC Bioinformatics, 2006, 7:S2.
[32] S.M. Gomez, W.S. Noble and A. Rzhetsky, "Learning to predict
proteinprotein interactions", Bioinformatics, 19:1875-1881, 2003.
[33] Ben-Hur A and Noble WS, "Kernel methods for predicting proteinprotein
interactions", Bioinformatics, 2005, 21(suppl 1):i38-i46.
[34] Zhang LV, Wong S, King O and Roth F, "Predicting co-complexed
protein pairs using genomic and proteomic data integration", BMC
Bioinformatics, 2004, 5:38-53.
[35] Qi Y, Klein-Seetharaman J and Bar-Joseph Z, "Random Forest Similarity
for Protein-Protein Interaction Prediction from Multiple Sources",
Proceedings of the Pacific Symposium on Biocomputing 2005.
[36] Han, D., Kim, H., Jang, W. and Lee, S., "Domain combination based
protein-protein interaction possibility ranking method",, IEEE Fourth
Symposium on Bioinformatics and Bioengineering, 2004, pp.434-441.
[37] Han, D., Kim, H., Seo, J. and Jang, W. , "Domain combination based
probabilistic framework for protein-protein interaction predication",
Genome Informatics, Vol. 14, 2003, pp.250-259.
[38] Iakes Ezkurdia, Lisa Bartoli, Piero Fariselli, Rita Casadio, Alfonso
Valencia and Michael L. Tress, "Progress and challenges in predicting
protein-protein interaction sites", Briefings In Bioinformatics. vol 10. no
3., Advance Access publication April 3, 2009
[39] Stanley Letovsky and Simon Kasif, "Predicting protein function from
protein/protein interaction data: a probabilistic approach",
Bioinformatics, Vol. 19 Suppl. 1, pp. i197-i204, 2003.