An Efficient and Generic Hybrid Framework for High Dimensional Data Clustering

Clustering in high dimensional space is a difficult problem which is recurrent in many fields of science and engineering, e.g., bioinformatics, image processing, pattern reorganization and data mining. In high dimensional space some of the dimensions are likely to be irrelevant, thus hiding the possible clustering. In very high dimensions it is common for all the objects in a dataset to be nearly equidistant from each other, completely masking the clusters. Hence, performance of the clustering algorithm decreases. In this paper, we propose an algorithmic framework which combines the (reduct) concept of rough set theory with the k-means algorithm to remove the irrelevant dimensions in a high dimensional space and obtain appropriate clusters. Our experiment on test data shows that this framework increases efficiency of the clustering process and accuracy of the results.




References:
[1] K. Jain, M. N. Murty, and P. J. Flynn. "Data clustering: a review". ACM
Computing Surveys (CSUR) 31(3):264-323, 1999.
[2] M. K. Jiawei Han. "Data Mining: Concepts and Techniques", Chapter 8,
pages 335-393. Morgan Kaufmann Publishers, 2001.
[3] Carlotta Domeniconi, Dimitris Papadopoulos, Dimitrios Gunopulos, Sheng
Ma, "Subspace Clustering of High Dimensional Data", 517-521, SIAM,
1998.
[4] Kun Niu, Shubo Zhang, and Junliang Chen, "Subspace clustering through
attribute clustering", Front. Electr. Electron. Eng. China 2008, 3(1): 44-
48.
[5] A. K. Jain and R. C. Dubes. "Algorithms for clustering data". Prentice-
Hall, Inc., 1988.
[6] M. Zait and H. Messatfa. "A comparative study of clustering methods".
Future Generation Computer Systems, 13(2-3):149-159, November 1997.
[7] J. Ghosh. Handbook of Data Mining, chapter Scalable Clustering Methods
for Data Mining. Lawrence Erlbaum Assoc, 2003.
[8] J. Han, M. Kamber, and A. K. H. Tung. "Geographic Data Mining and
Knowledge Discovery", chapter Spatial clustering methods in data
mining: A survey, pages 188-217. Taylor and Francis, 2001.
[9] I. H. Witten and E. Frank. Data Mining: Practical Machine Leaning Tools
and Techniques with Java Implementations, chapter 6.6, pages 210-228.
Morgan Kaufmann, 2000.
[10] C. C. Aggarwal and P. S. Yu. "Finding generalized projected clusters in
high dimensional spaces". In Proceedings of the 2000 ACM SIGMOD
international conference on Management of data, pages 70-81. ACM
Press, 2000.
[11] P. Berkhin. Survey of clustering data mining techniques. Technical report,
Accrue Software, San Jose, CA, 2002.
[12] L. ErtÄoz, M. Steinbach, and V. Kumar. "Finding clusters of deferent
sizes, shapes, and densities in noisy, high dimensional data". In
Proceedings of the 2003 SIAM International Conference on Data Mining,
2003.
[13] Lance Parsons, Ehtesham Haque, Huan Liu. "Subspace Clustering for
High Dimensional Data: A Review"; Supported in part by grants from
Prop 301 (No. ECR A601) and CEINT 2004.
[14] Skowron, A., Pawlak, Z., Komorowski, J., & Polkowski, L. "A Rough set
perspective on data and knowledge". Handbook of data mining and
knowledge discovery, pp. 134-149, Oxford University Press, 2002.
[15] Anil K. Jain; "Data Clustering: 50 Years Beyond K-Means"; To appear in
Pattern Recognition Letters, 2009.
[16] Drineas, P., Frieze, A., Kannan, R., Vempala, S., & Vinay, V. Clustering
large graphs via the singular value decomposition. Machine Learning,
56(1-3), 9-33, 1999.
[17] R.C. Belwal, J. Varshney, S. Ahmed Khan, A. Sharma & M. Bhattacharya,
" Hiding Sensitive Association Rules Efficiently By Introducing
New Variable Hiding Counter IEEE International Conference on
Service, Operations & Logistics and Informatics, proceedings, Vol. I, pp:
130-134, 12th - 15th Oct. Beijing, China 2008.
[18] G. Liu, J. Li, K. Sim, and L. Wong. "Distance based subspace clustering
with flexible dimension partitioning". In Proc. IEEE ICDE, pages 1250-
1254, 2007.
[19] Sunita Jahirabadkar, and Parag Kulkarni, ISC - Intelligent Subspace
Clustering, A Density based Clustering approach for High Dimensional
Dataset, World Academy of Science, Engineering and Technology 55
2009.
[20] Hans-Peter Kriegel, Peer Krger, Matthias Renz, Sebastian Wurst, "A
Generic Framework for Efficient Subspace Clustering of High-
Dimensional Data ", In proc. 5th IEEE International Conference of Data
Mining (ICDM), Houston, TX, 2005.
[21] McQueen J (1967) some methods for classification and analysis of
multivariate observations. In: Proceedings of the fifth Berkeley
symposium on mathematical statistics and probability, pp 281-297.
[22] http://www.uni-koeln.de/themen/statistik/data/cluster/milk.dat.
[23] Dunn, 1974. Dunn, J. (1974) well separated clusters and optimal fuzzy
partitions. Journal of Cybernetics ,4, 95-104.
[24] Davies & Bouldin, 1979. Davies, D.L., Bouldin, D.W., (2000) A cluster
separation measure. IEEE Trans. Pattern Anal. Machine Intell., 1(4), 224-
227.
[25] Arun Jagota. Novelty detection on a very large number of memories
stored in a Hopfield-style network. In Proceedings of the International
Joint Conference on Neural Networks (IJCNN-91), 1991.