SAF: A Substitution and Alignment Free Similarity Measure for Protein Sequences

The literature reports a large number of approaches for measuring the similarity between protein sequences. Most of these approaches estimate this similarity using alignment-based techniques that do not necessarily yield biologically plausible results, for two reasons. First, for the case of non-alignable (i.e., not yet definitively aligned and biologically approved) sequences such as multi-domain, circular permutation and tandem repeat protein sequences, alignment-based approaches do not succeed in producing biologically plausible results. This is due to the nature of the alignment, which is based on the matching of subsequences in equivalent positions, while non-alignable proteins often have similar and conserved domains in non-equivalent positions. Second, the alignment-based approaches lead to similarity measures that depend heavily on the parameters set by the user for the alignment (e.g., gap penalties and substitution matrices). For easily alignable protein sequences, it's possible to supply a suitable combination of input parameters that allows such an approach to yield biologically plausible results. However, for difficult-to-align protein sequences, supplying different combinations of input parameters yields different results. Such variable results create ambiguities and complicate the similarity measurement task. To overcome these drawbacks, this paper describes a novel and effective approach for measuring the similarity between protein sequences, called SAF for Substitution and Alignment Free. Without resorting either to the alignment of protein sequences or to substitution relations between amino acids, SAF is able to efficiently detect the significant subsequences that best represent the intrinsic properties of protein sequences, those underlying the chronological dependencies of structural features and biochemical activities of protein sequences. Moreover, by using a new efficient subsequence matching scheme, SAF more efficiently handles protein sequences that contain similar structural features with significant meaning in chronologically non-equivalent positions. To show the effectiveness of SAF, extensive experiments were performed on protein datasets from different databases, and the results were compared with those obtained by several mainstream algorithms.




References:
[1] S. Vinga, and J. Almeida, "Alignment-free sequence comparison - a
review," BIOINFORMATICS, 4, vol. 19, 2003, pp. 513-523.
[2] M.W. Berry, and R.D. Fierro, "Low-rank orthogonal decompositions for
information retrieval applications," Numerical Linear Algebra with
Applications, 3, vol. 4, 1996, pp. 301-327.
[3] S. Karlin, and G. Ghandour, "Comparative statistics for DNA and protein
sequences: Single sequence analysis," Proc. Natl. Acad. Sci. USA, 17, vol.
82, 1985, pp. 5800-5804.
[4] M. Ganapathiraju, J. Klein-Seetharaman, N. Balakrishnan, and R. Reddy,
"Characterization of protein secondary structure," Signal Processing
Magazine, IEEE, 3, vol. 21, May 2004, pp. 78-87.
[5] A. Amir, M. Lewenstein, and E. Porat, "Faster algorithms for string
matching with k mismatches," J. Algorithms, 2, vol. 50, 2004, pp.
257-275.
[6] M. Brand, "Fast low-rank modifications of the thin singular value
decomposition," Linear Algebra and Its Applications, 1, vol. 415, 2006,
pp. 20-30.
[7] http://www.ncbi.nlm.nih.gov/BLAST.
[8] http://www.ebi.ac.uk/fasta33.
[9] http://www.ebi.ac.uk/clustalw.
[10] E. L. L. Sonnhammer, and V. Hollich, "Scoredist: A simple and robust
protein sequence distance estimator," BMC Bioinformatics 2005, vol. 6
pp. 108.
[11] A. Kelil, S. Wang, and R. Brzezinski, "A new alignment-independent
algorithm for clustering protein sequences," in Proc. 7th IEEE
International Conference on BioInformatics and BioEngineering.
Cambridge, Harvard University, Massachusetts, USA, 2007, pp. 27-34.
[12] A. Kelil, S. Wang, and R. Brzezinski, "CLUSS2: An
alignment-independent algorithm for clustering protein families with
multiple biological functions," International Journal of Computational
Biology and Drug Design, 2008. (In press).
[13] K.P. Wu, H.N. Lin, T.Y. Sung, and W.L. Hsu, "A New Similarity
Measure among Protein Sequences," in Proc. Computational Systems
Bioinformatics, Stanford, CA, USA, 2003, pp. 347-352.
[14] A. Bogan-Marta, N. Laskaris, M. A. Gavrielides, I. Pitas, K. Lyroudia, "A
novel efficient protein similarity measure based on n-gram modeling," in
Proc. 2nd International Conference on Computational Intelligence in
Medicine and Healthcare. Costa da Caparica, Lisbon, Portugal, 2005.
[15] R.L. Tatusov, N.D. Fedorova, J.D. Jackson, A.R. Jacobs, B. Kiryutin,
E.V. Koonin, D.M. Krylov, R. Mazumder, S.L. Mekhedov, A.N.
Nikolskaya, B.S. Rao, S. Smirnov, A.V. Sverdlov, S. Vasudevan, Y.I.
Wolf, J.J. Yin, and D.A. Natale, "The COG database: An updated version
includes eukaryotes," BMC Bioinformatics, 4, vol. 41, 2003.
[16] K. ONeill, W. Klimke, and T. Tatusova, "Protein clusters: A collection of
proteins grouped by sequence similarity and function," NCBI, October 04,
2007.
[17] N. C├┤té, A. Fleury, E. Dumont-Blanchette, T. Fukamizo, M. Mitsutomi,
and R. Brzezinski, "Two exo-β-D-glucosaminidases/exochitosanases
from actinomycetes define a new subfamily within family 2 of glycoside
hydrolases," Biochem. J., vol. 394, 2006, pp. 675-686.
[18] T. Fukamizo, A. Fleury, N. C├┤té, M. Mitsutomi, and R. Brzezinski,
"Exo-β-D-glucosaminidase from Amycolatopsis orientalis: Catalytic
residues, sugar recognition specificity, kinetics, and synergism,"
Glycobiology, vol. 16, 2006, pp. 1064-1072.
[19] www.cazy.org.
[20] D. Higgins, "Multiple Alignment," in The Phylogenetic Handbook, M.
Salemi, and A. M. Vandamme, Ed. Cambridge University Press, 2004, pp.
45-71.
[21] A. Kelil, S. Wang, R. Brzezinski, and F. Alain, "CLUSS: Clustering of
protein sequences based on a new similarity measure," BMC
Bioinformatics, 2007, vol. 8, pp. 286.
[22] W. C. Lo, and P. C. Lyu, "CPSARST: An efficient circular permutation
search tool applied to the detection of novel protein structural
relationships," Genome Biology, vol. 9, 2008.
[23] S. Uliel, A. Fliess, and R. Unger, "Naturally occurring circular
permutations in proteins," Protein Eng., vol. 14, 2001, pp. 533-542.
[24] http://www.ncbi.nlm.nih.gov.
[25] http://prospectus.usherbrooke.ca/CLUSS.