Clustering Protein Sequences with Tailored General Regression Model Technique

Cluster analysis divides data into groups that are meaningful, useful, or both. Analysis of biological data is creating a new generation of epidemiologic, prognostic, diagnostic and treatment modalities. Clustering of protein sequences is one of the current research topics in the field of computer science. Linear relation is valuable in rule discovery for a given data, such as if value X goes up 1, value Y will go down 3", etc. The classical linear regression models the linear relation of two sequences perfectly. However, if we need to cluster a large repository of protein sequences into groups where sequences have strong linear relationship with each other, it is prohibitively expensive to compare sequences one by one. In this paper, we propose a new technique named General Regression Model Technique Clustering Algorithm (GRMTCA) to benignly handle the problem of linear sequences clustering. GRMT gives a measure, GR*, to tell the degree of linearity of multiple sequences without having to compare each pair of them.




References:
[1] R. Agrawal, C. Faloutsos and A. Swami, Efficient Similarity Search in
Sequence Databases, Proceedings of the 4th Intl. Conf. on Foundations
of Data Organizations and Algorithms (FODO) (1993), pp. 69-84.
[2] B. Yi and C. Faloutsos, Fast Time Sequence Indexingfor Arbitrary Lp
Norms, The 26th International Conference on Very Large
Databases(VLDB) (2000), pp. 385-394.
[3] D. Rafiei and A. Mendelzon, Efficient Retrieval of Similar Time
Sequences Using DFT, Proceedings of the 5th International Conference
on Foundations of Data Organizations and Algorithms (FODO) (1998),
pp. 69-84.
[4] R. Agrawal, K. I. Lin, H. S. Sawhne and K. Shim, Fast Similarity
Search in the Presence of Noise, Scaling, and Translation in Time-Series
Databases, Proc. of the 2Ist VLDB Conference(1995), pp. 490-501.
[5] T. Bozkaya, N. Yazdani and Z.M. Ozsoyoglu, Matchingand Indexing
Sequences of Different Lengths, Proc. of the 6th International
Conference on Information and Knowledge Management(1997), pp.
128-135.
[6] E. Keogh, A fast and robust method for pattern matching in sequences
database, WUSS (1997).
[7] E. Keogh and P. Smyth. A Probabilistic Approach to Fast Pattern
Matching in Sequences Databases, The 3rd Intl. Conf. on Knowledge
Discovery and DataMining(1997), pp. 24-30.
[8] C. Faloutsos, M. Ranganathan and Y. Manolopoulos, Fast Subsequence
Matching in Time-Series Databases, International Proceedings of the
ACM SIGMOD Conference on management of Data(1994), pp. 419-
429.
[9] C. Chung, S. Lee, S. Chun, D. Kim and J. Lee, Similarity Search for
Multidimensional Data Sequences, Proceedings of the 16th International
Conf. on Data Engineering(2000), pp. 599-608.
[10] D. Goldin and P. Kanellakis, On similarity queries for time-series data:
constraint specification and implementation, The 1st International
Conference on the Principles and practice of Constraint Programming
(1995), pp. 137-153.
[11] C. Perng, H. Wang, S. Zhang and D. Parker, Landmarks: a New Model
for Similarity-based Pattern Querying in Sequences Databases, Proc. of
the 16th International Conference on Data Engineering(2000)
[12] H. Jagadish, A. Mendelzon and T. Milo, Similarity-Based Queries, The
Symposium on Principles of Database Systems (1995), pp. 36-45.
[13] D. Rafiei and A. Mendelzon, Similarity-Based Queries for Sequences
Data, Proc. of the ACM SIGMOD Conference on Management of
Data(1997), pp. 13-25.
[14] C. Li, P. Yu and V. Castelli, Similarity Search Algorithm for Databases
of Long Sequences, The 12th International Conference on Data
Engineering (1996), pp. 546-553.
[15] G. Das, D. Gunopulos and H. Mannila, Finding similar sequences, The
1st European Symposium on Principles of Data Mining and Knowledge
Discovery(1997),pp. 88-100.
[16] K. Chu and M. Wong, Fast Time-Series Searching with Scaling and
Shifting, The 18th ACM Symp. On Principles of Database Systems
(PODS 1999), pp. 237-248.
[17] B. Bollobas, G. Das, D. Gunopulos and H. Mannila, Time-Series
Similarity Problems and Well-Separated Geometric Sets, The 13th
Annual ACM Symposium on Computational Geometry (1997), pp. 454-
456.
[18] D. Berndt and J. Clifford, Using Dynamic Time Warping to Find
Patterns in Sequences, Working Notes of the Knowledge Discovery in
Databases Workshop(1994), pp. 359-370.
[19] B. Yi, H. Jagadish and C. Faloutsos, Efficient Retrieval of Similar Time
Sequences Under Time Warping, Proc. of the 14th International
Conference on Data Engineering (1998), pp. 23-27.
[20] S. Park, W. Chu, J. Yoon and C. Hsu, Efficient Similarity Searches for
Time-Warped Subsequences in Sequence Databases, Proc. of the 16th
International Conf. on Data Engineering (2000).
[21] Z. Struzik and A. Siebes, The Haar Wavelet Transform in the Sequences
Similarity Paradigm, PKDD (1999).
[22] K. Chan and W. FU. Efficient Sequences Matching by Wavelets, The
15th international Conf. on Data Engineering (1999).
[23] G. Das, K. Lin, H. Mannila, G. Renganathan and P. Smyt, Rule
Discovery from Sequences, Knowledge Discovery and Data
Mining(1998), pp. 16-22.
[24] G. Das, D. Gunopulos, Sequences Similarity Measures, KDD-2000:
Sequences Tutorial.
[25] I. Dhillon, A New O(n2) Algorithm for the Symetric Tridiagonal
Eigenvalue/Eigenvector Problem, Ph.D. Thesis. University of.
California, Berkerley, 1997.
[26] R. Duda, P. Hart and D. Stork, Pattern Classification. 2nd Edition, John
Wiley & Sons, 2000.
[27] J. Wooldridge, Introductory Econometrics: a modern approach, South-
Western College Publishing, 1999.
[28] F. Mosteller and J. Tukey, Data Analysis and Regression: A Second
Course in Statistics, Addison-Wesley, 1977.
[29] M.R. Anderberg. Cluster Analysis for Applications. Academic Press,
New York, December 1973.
[30] J. Han, M.Kamber, and A.Tung. Spatial Clustering Methods in Data
Mining: A review. In H.J. Miller and J.Han, editors, Geographic Data
Mining and Knowledge Discovery, pages 188-217. Taylor and Francis,
London, December 2001.
[31] Gusfield D. Algorithms on Strings, Trees and Sequences. New York:
Cambridge University Press, 1997.