Detecting Remote Protein Evolutionary Relationships via String Scoring Method

The amount of the information being churned out by the field of biology has jumped manifold and now requires the extensive use of computer techniques for the management of this information. The predominance of biological information such as protein sequence similarity in the biological information sea is key information for detecting protein evolutionary relationship. Protein sequence similarity typically implies homology, which in turn may imply structural and functional similarities. In this work, we propose, a learning method for detecting remote protein homology. The proposed method uses a transformation that converts protein sequence into fixed-dimensional representative feature vectors. Each feature vector records the sensitivity of a protein sequence to a set of amino acids substrings generated from the protein sequences of interest. These features are then used in conjunction with support vector machines for the detection of the protein remote homology. The proposed method is tested and evaluated on two different benchmark protein datasets and it-s able to deliver improvements over most of the existing homology detection methods.





References:
[1] T. Smith, and M. Waterman, "Identification of common molecular
subsequence", J. Mol. Biol, 147, pp.195, 1981.
[2] W. R. Pearson, "Rapid and sensitive sequence comparisons with
FASTAP and FASTA Method", Enzymol, 183, pp. 63, 1985.
[3] S. F. Altschul, W. Gish, W. Miller, E. Myer and J. Lipman "Basic local
alignment search tool", J. Mol. Biol., 215, pp. 403, 1990.
[4] M. Gribskov, R. L├╝thy and D. Eisenberg, "Profile analysis. Method",
Enzymol., 183, pp. 146, 1990.
[5] P. Baldi, Y. Chauvin, T. Hunkapiller and M. A. McClure, "Hidden
Markov models of biological primary sequence information", Proc. Nati.
Acad. Sci., 91: pp. 1059, 1994.
[6] A. Krogh, M. Brown, I. S. Mian, K. Sjölander D. Haussler, "Hidden
Markov models in computational biology: Applications to protein
modeling", J. Mol. Biol., 235, pp. 1501, 1994.
[7] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W.
Miller and D. J. Lipman, "Gapped Blast and Psi-Blast: a new generation
of protein database search programs", Nuc. Acid. Res., 25: pp. 3389,
1997.
[8] K. Karplus, C. Barrett and R. Hughey, "Hidden Markov models for
detecting remote protein homologies", Bioinformatics, 14, pp. 846,
1998.
[9] T. Jaakkola, M. Diekhans and D. Haussler "A discriminative framework
for detecting remote protein homologies", J. Comp. Biol., 7, pp. 95,
2000.
[10] V. N. Vapnik, "Statistical Learning Theory", John Wiley & Sons, Inc.,
1998.
[11] N. Cristianini, and J. Shawe-Taylor, "An introduction to Support Vector
Machines", Cambridge, UK: Cambridge University Press. 2000.
[12] N. M. Zaki, S. Deris, and R. M. Illias, "Feature Extraction for Protein
Homologies Detection Using Markov Models Combining Scores", Int.
J. on Comp. Intelligence and Appl., 1, pp. 1, 2004.
[13] C. Leslie, E. Eskin, J. Weston and W. Noble, "Mismatch String Kernels
for Discriminative Protein Classification", Bioinformatics, 20, pp. 67,
2004.
[14] N. M. Zaki, S. Deris, and R. M. Illias, "Application of string kernels in
protein sequence classification", App. Bioinformatics, 1, pp. 45, 2005.
[15] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins,
"Text Classification using String Kernels", J. Machine Learning Res., 2,
pp. 419, 2002.
[16] L. Liao, and W. S. Noble, "Combining Pairwise Sequence Similarity and
Support Vector Machines for Detecting Remote Protein Evolutionary
and Structural Relationships", J. Comp. Biol., 10, pp. 857, 2003.
[17] Zaki, N. M. and Deris, S. (2005). "Representing Protein Sequence with
Low Number of Dimensions". Journal of Biological Sciences, 5(6): 795-
800.
[18] A. G. Murzin, S. E. Brenner T. Hubbard C. Chothia, "SCOP: a structural
classification of proteins database for the investigation of sequences and
structures", J. Molec. Biol., 247, pp. 536, 1995.
[19] S. E. Brenner, P. Koehl and M. Levitt, "The ASTRAL compendium for
sequence and structure analysis", Nucl. Acids Res., 28, pp. 254, 2000.
[20] S. R. Eddy, "Multiple alignment using hidden Markov models," In Proc.
of the 3rd ISMB, pp. 114, 1995.
[21] Swets, "Measuring the accuracy of diagnostic systems". Science, 270:
1285-1293. 1988.