DCBOR: A Density Clustering Based on Outlier Removal

Data clustering is an important data exploration technique with many applications in data mining. We present an enhanced version of the well known single link clustering algorithm. We will refer to this algorithm as DCBOR. The proposed algorithm alleviates the chain effect by removing the outliers from the given dataset. So this algorithm provides outlier detection and data clustering simultaneously. This algorithm does not need to update the distance matrix, since the algorithm depends on merging the most k-nearest objects in one step and the cluster continues grow as long as possible under specified condition. So the algorithm consists of two phases; at the first phase, it removes the outliers from the input dataset. At the second phase, it performs the clustering process. This algorithm discovers clusters of different shapes, sizes, densities and requires only one input parameter; this parameter represents a threshold for outlier points. The value of the input parameter is ranging from 0 to 1. The algorithm supports the user in determining an appropriate value for it. We have tested this algorithm on different datasets contain outlier and connecting clusters by chain of density points, and the algorithm discovers the correct clusters. The results of our experiments demonstrate the effectiveness and the efficiency of DCBOR.




References:
[1] M. Ankerst , M. M. Breunig and H-P. Kriegel , "OPTICS: Ordering Points
to Identify the Clustering Structure". in Proc. ACM SIGMOD, 1999, pp.
49-60.
[2] M. Emre Celebi , Y. Alp Aslandogan , and P. R. Bergstresser
"Mining biomedical images with density-based clustering." In ITCC -05:
Proceedings of the International Conference on Information Technology:
Coding and Computing, volume I, pages 163-168, Washington, DC, USA,
2005. IEEE Computer Society.
[3] M. Ester , H.-P. Kriegel , J. Sander and X. Xu , "A Density-Based
Algorithm for Discovering Clusters in Large Spatial Databases with
Noise". in Proc. Knowledge Discovery and Data Mining, 1996, pp. 226-
231.
[4] L. Ertoz , M. Steinbach and V. Kumar , "A new shared nearest neighbor
clustering algorithm and its applications", AHPCRC, Tech. Rep. 134,
2002.
[5] S. Guha, R. Rastogi , and K. Shim, "CURE: An Efficient Clustering
Algorithms for Large Databases." Proc. ACM SIGMOD Int. Conf. on
Management of Data. Seattle, WA, 1998, pp. 73-84.
[6] V. Hautamaeki , S. Cherednichenko , I. Kaerkkaeinen , T. Kinnunen , and
P. Fraenti "Improving K-Means by Outlier Removal", LNCS Springer
Berlin / Heidelberg, may 2005, pp. 978-987.
[7] A. Hinneburg and D. A. Keim , "An Efficient Approach to Clustering in
Large Multimedia Databases with Noise," in Proc. Knowledge Discovery
and Data Mining, 1998, pp. 58-65.
[8] A. K. Jain and, R. C.Dubes "Algorithms for Clustering Data." Prentice
Hall, 1988.
[9] A. K. Jain , M. N. Murty , and P. J. Flynn , "Data Clustering: A Review",
ACM Computing Surveys, vol. 31, no 3, pp. 264-323, Sep. 1999.
[10] G. Karypis , E. H. Han and V. Kumar, "CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling." Computer,32, pp. 68-
75, 1999.
[11] A. McCallum , K. Nigam and L. H. Ungar "Efficient Clustering of
HighDimensional Data Sets with Application to Reference Matching." In
Proceedings of KDD-2000. pp.169-178.
[12] R. T. Ng and J. Han "Efficient and Effective Clustering Methods for
Spatial Data Mining". Proc. 20th Int. Conf. on Very Large Data Bases.
Morgan Kaufmann Publishers, San Francisco, CA, 1994, p. 144-155.
[13] J. Sander , M. Ester , H-P. Kriegel , and X. Xu "Density-based clustering
in spatial databases: The algorithm gdbscan and its applications." Data
Mining and Knowledge Discovery, 2(2):169-194, 1998.
[14] R. Sibson, SLINK: an optimally efficient algorithm for the single-link
cluster method. The Comp. Journal, 1973, 16(1), p. 30-34.
[15] T. Zhang , R. Ramakrishnan and M. Linvy BIRCH: An Efficient Data
Clustering Method for Very Large Databases. Proc. ACM SIGMOD Int.
Conf. on Management of Data. ACM Press, New York, 1996, p. 103-114.