A Distributed Algorithm for Intrinsic Cluster Detection over Large Spatial Data

Clustering algorithms help to understand the hidden information present in datasets. A dataset may contain intrinsic and nested clusters, the detection of which is of utmost importance. This paper presents a Distributed Grid-based Density Clustering algorithm capable of identifying arbitrary shaped embedded clusters as well as multi-density clusters over large spatial datasets. For handling massive datasets, we implemented our method using a 'sharednothing' architecture where multiple computers are interconnected over a network. Experimental results are reported to establish the superiority of the technique in terms of scale-up, speedup as well as cluster quality.




References:
[1] J. Han and M. Kamber, Data Mining: Concepts and Techniques. India:
Morgan Kaufmann Publishers, 2004.
[2] M. Ester, H. P. Kriegel, J. Sander and X. Xu, "A Density-Based
Algorithm for Discovering Clusters in Large Spatial Databases with
Noise", in International Conference on Knowledge Discovery in
Databases and Data Mining (KDD-96), Portland, Oregon, 1996, pp.
226-231.
[3] C. Hsu and M. Chen, "Subspace Clustering of High Dimensional Spatial
Data with Noises", PAKDD, 2004, pp. 31-40.
[4] W. Wang, J. Yang, and R. R. Muntz, "STING: A Statistical Information
Grid Approach to Spatial data Mining", in Proc. 23rd International
Conference on Very Large Databases, (VLDB), Athens, Greece,
Morgan Kaufmann Publishers, 1997, pp. 186 - 195.
[5] G. Sheikholeslami, S. Chatterjee and A. Zhang, "Wavecluster: A Multiresolution
Clustering approach for very large spatial database", in
SIGMOD'98, Seattle, 1998.
[6] R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, "Automatic
subspace clustering of high dimensional data for data mining
applications", in SIGMOD Record ACM Special Interest Group on
Management of Data, 1998, pp. 94-105.
[7] H. S. Nagesh, S. Goil and A. N. Choudhary, "A scalable parallel
subspace clustering algorithm for massive data sets", in Proc.
International Conference on Parallel Processing, 2000, pp. 477.
[8] L. Ertoz, M. Steinbach and V. Kumar, "Finding Clusters of Different
Sizes, Shapes, and Densities in Noisy, High Dimensional Data", in SIAM
International Conference on Data Mining (SDM '03), 2003.
[9] G. Karypis, Han and V. Kumar, "CHAMELEON: A hierarchical
clustering algorithm using dynamic modeling", IEEE Computer, 32(8),
pp 68-75, 1999.
[10] Y. Zhao, S. Mei, X. Fan, S. Jun-de. 2003. Clustering Datasets
Containing Clusters of Various Densities. Journal of Beijing University
of Posts and Telecommunications, 26(2):42-47.
[11] H. S. Kim, S. Gao, Y. Xia, G. B. Kim and H. Y. Bae, "DGCL: An
Efficient Density and Grid Based Clustering Algorithm for Large Spatial
Database", Advances in Web-Age Information Management (WAIM'06),
pp. 362-371, 2006.
[12] M. Ankerst, M. M. Breuing, H. P. Kriegel and J. Sander, "OPTICS:
Ordering Points To Identify the Clustering Structure", in ACMSIGMOD,
pp. 49-60, 1999.
[13] S. Roy and D. K. Bhattacharyya, "An Approach to Find Embedded
Clusters Using Density Based Techniques", in Proc. ICDCIT, LNCS
3816, pp. 523-535, 2005.
[14] B. Borah, D. K. Bhattacharyya and R. K. Das, "A Parallel Density-Based
Data Clustering Technique on Distributed Memory Multicomputers", in
Proc. ADCOM, Ahmedabad, 2004.
[15] I. S. Dhilon and D. S. Modha, "A Data-Clustering Algorithm on
Distributed Memory Multiprocessors", in International Conference on
Knowledge Discovery and Data Mining (SIGKDD 99), 1999.
[16] X. Xu, J. Jager and H. P. Kriegel, "A Fast Parallel Clustering Algorithm
for Large Spatial Databases", Data Mining and Knowledge Discovery, 3,
Kluwer Academic Publisher, pp. 263-290, 1999.
[17] E. Januzaj, H. P. Kriegel and M. Pfeifle, "Towards Effective and
Efficient Distributed Clustering.Workshop on Clustering Large Data
Sets", ICDM'03.Melbourne, Florida, 2003.
[18] D. Foti, D. Lipari, C. Pizzuti and D. Talia, "Scalable Parallel Clustering
for Data Mining on Multicomputers",15 IPDPS workshops, pp. 390-398,
2000.
[19] E.K. Johnson and H. Kargupta, "Collective Hierarchical Clustering from
Distributed, Heterogeneous Data", Large Scale Parallel data Mining,
LNCS 1759, Springer, 2000.
[20] S. Sarmah, R. Das and D. K. Bhattacharyya, "Intrinsic Cluster Detection
Using Adaptive Grids", in Proc. ADCOM'07, Guwahati, 2007.
[21] Available: http//steve.hollasch.net /cgindex/math /barycentric.html