DCBOR: A Density Clustering Based on Outlier Removal
Data clustering is an important data exploration technique
with many applications in data mining. We present an enhanced
version of the well known single link clustering algorithm. We will
refer to this algorithm as DCBOR. The proposed algorithm alleviates
the chain effect by removing the outliers from the given dataset.
So this algorithm provides outlier detection and data clustering
simultaneously. This algorithm does not need to update the distance
matrix, since the algorithm depends on merging the most k-nearest
objects in one step and the cluster continues grow as long as possible
under specified condition. So the algorithm consists of two phases;
at the first phase, it removes the outliers from the input dataset. At
the second phase, it performs the clustering process. This algorithm
discovers clusters of different shapes, sizes, densities and requires
only one input parameter; this parameter represents a threshold for
outlier points. The value of the input parameter is ranging from 0 to
1. The algorithm supports the user in determining an appropriate
value for it. We have tested this algorithm on different datasets
contain outlier and connecting clusters by chain of density points,
and the algorithm discovers the correct clusters. The results of
our experiments demonstrate the effectiveness and the efficiency of
DCBOR.
[1] M. Ankerst , M. M. Breunig and H-P. Kriegel , "OPTICS: Ordering Points
to Identify the Clustering Structure". in Proc. ACM SIGMOD, 1999, pp.
49-60.
[2] M. Emre Celebi , Y. Alp Aslandogan , and P. R. Bergstresser
"Mining biomedical images with density-based clustering." In ITCC -05:
Proceedings of the International Conference on Information Technology:
Coding and Computing, volume I, pages 163-168, Washington, DC, USA,
2005. IEEE Computer Society.
[3] M. Ester , H.-P. Kriegel , J. Sander and X. Xu , "A Density-Based
Algorithm for Discovering Clusters in Large Spatial Databases with
Noise". in Proc. Knowledge Discovery and Data Mining, 1996, pp. 226-
231.
[4] L. Ertoz , M. Steinbach and V. Kumar , "A new shared nearest neighbor
clustering algorithm and its applications", AHPCRC, Tech. Rep. 134,
2002.
[5] S. Guha, R. Rastogi , and K. Shim, "CURE: An Efficient Clustering
Algorithms for Large Databases." Proc. ACM SIGMOD Int. Conf. on
Management of Data. Seattle, WA, 1998, pp. 73-84.
[6] V. Hautamaeki , S. Cherednichenko , I. Kaerkkaeinen , T. Kinnunen , and
P. Fraenti "Improving K-Means by Outlier Removal", LNCS Springer
Berlin / Heidelberg, may 2005, pp. 978-987.
[7] A. Hinneburg and D. A. Keim , "An Efficient Approach to Clustering in
Large Multimedia Databases with Noise," in Proc. Knowledge Discovery
and Data Mining, 1998, pp. 58-65.
[8] A. K. Jain and, R. C.Dubes "Algorithms for Clustering Data." Prentice
Hall, 1988.
[9] A. K. Jain , M. N. Murty , and P. J. Flynn , "Data Clustering: A Review",
ACM Computing Surveys, vol. 31, no 3, pp. 264-323, Sep. 1999.
[10] G. Karypis , E. H. Han and V. Kumar, "CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling." Computer,32, pp. 68-
75, 1999.
[11] A. McCallum , K. Nigam and L. H. Ungar "Efficient Clustering of
HighDimensional Data Sets with Application to Reference Matching." In
Proceedings of KDD-2000. pp.169-178.
[12] R. T. Ng and J. Han "Efficient and Effective Clustering Methods for
Spatial Data Mining". Proc. 20th Int. Conf. on Very Large Data Bases.
Morgan Kaufmann Publishers, San Francisco, CA, 1994, p. 144-155.
[13] J. Sander , M. Ester , H-P. Kriegel , and X. Xu "Density-based clustering
in spatial databases: The algorithm gdbscan and its applications." Data
Mining and Knowledge Discovery, 2(2):169-194, 1998.
[14] R. Sibson, SLINK: an optimally efficient algorithm for the single-link
cluster method. The Comp. Journal, 1973, 16(1), p. 30-34.
[15] T. Zhang , R. Ramakrishnan and M. Linvy BIRCH: An Efficient Data
Clustering Method for Very Large Databases. Proc. ACM SIGMOD Int.
Conf. on Management of Data. ACM Press, New York, 1996, p. 103-114.
[1] M. Ankerst , M. M. Breunig and H-P. Kriegel , "OPTICS: Ordering Points
to Identify the Clustering Structure". in Proc. ACM SIGMOD, 1999, pp.
49-60.
[2] M. Emre Celebi , Y. Alp Aslandogan , and P. R. Bergstresser
"Mining biomedical images with density-based clustering." In ITCC -05:
Proceedings of the International Conference on Information Technology:
Coding and Computing, volume I, pages 163-168, Washington, DC, USA,
2005. IEEE Computer Society.
[3] M. Ester , H.-P. Kriegel , J. Sander and X. Xu , "A Density-Based
Algorithm for Discovering Clusters in Large Spatial Databases with
Noise". in Proc. Knowledge Discovery and Data Mining, 1996, pp. 226-
231.
[4] L. Ertoz , M. Steinbach and V. Kumar , "A new shared nearest neighbor
clustering algorithm and its applications", AHPCRC, Tech. Rep. 134,
2002.
[5] S. Guha, R. Rastogi , and K. Shim, "CURE: An Efficient Clustering
Algorithms for Large Databases." Proc. ACM SIGMOD Int. Conf. on
Management of Data. Seattle, WA, 1998, pp. 73-84.
[6] V. Hautamaeki , S. Cherednichenko , I. Kaerkkaeinen , T. Kinnunen , and
P. Fraenti "Improving K-Means by Outlier Removal", LNCS Springer
Berlin / Heidelberg, may 2005, pp. 978-987.
[7] A. Hinneburg and D. A. Keim , "An Efficient Approach to Clustering in
Large Multimedia Databases with Noise," in Proc. Knowledge Discovery
and Data Mining, 1998, pp. 58-65.
[8] A. K. Jain and, R. C.Dubes "Algorithms for Clustering Data." Prentice
Hall, 1988.
[9] A. K. Jain , M. N. Murty , and P. J. Flynn , "Data Clustering: A Review",
ACM Computing Surveys, vol. 31, no 3, pp. 264-323, Sep. 1999.
[10] G. Karypis , E. H. Han and V. Kumar, "CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling." Computer,32, pp. 68-
75, 1999.
[11] A. McCallum , K. Nigam and L. H. Ungar "Efficient Clustering of
HighDimensional Data Sets with Application to Reference Matching." In
Proceedings of KDD-2000. pp.169-178.
[12] R. T. Ng and J. Han "Efficient and Effective Clustering Methods for
Spatial Data Mining". Proc. 20th Int. Conf. on Very Large Data Bases.
Morgan Kaufmann Publishers, San Francisco, CA, 1994, p. 144-155.
[13] J. Sander , M. Ester , H-P. Kriegel , and X. Xu "Density-based clustering
in spatial databases: The algorithm gdbscan and its applications." Data
Mining and Knowledge Discovery, 2(2):169-194, 1998.
[14] R. Sibson, SLINK: an optimally efficient algorithm for the single-link
cluster method. The Comp. Journal, 1973, 16(1), p. 30-34.
[15] T. Zhang , R. Ramakrishnan and M. Linvy BIRCH: An Efficient Data
Clustering Method for Very Large Databases. Proc. ACM SIGMOD Int.
Conf. on Management of Data. ACM Press, New York, 1996, p. 103-114.
@article{"International Journal of Information, Control and Computer Sciences:59133", author = "A. M. Fahim and G. Saake and A. M. Salem and F. A. Torkey and M. A. Ramadan", title = "DCBOR: A Density Clustering Based on Outlier Removal", abstract = "Data clustering is an important data exploration technique
with many applications in data mining. We present an enhanced
version of the well known single link clustering algorithm. We will
refer to this algorithm as DCBOR. The proposed algorithm alleviates
the chain effect by removing the outliers from the given dataset.
So this algorithm provides outlier detection and data clustering
simultaneously. This algorithm does not need to update the distance
matrix, since the algorithm depends on merging the most k-nearest
objects in one step and the cluster continues grow as long as possible
under specified condition. So the algorithm consists of two phases;
at the first phase, it removes the outliers from the input dataset. At
the second phase, it performs the clustering process. This algorithm
discovers clusters of different shapes, sizes, densities and requires
only one input parameter; this parameter represents a threshold for
outlier points. The value of the input parameter is ranging from 0 to
1. The algorithm supports the user in determining an appropriate
value for it. We have tested this algorithm on different datasets
contain outlier and connecting clusters by chain of density points,
and the algorithm discovers the correct clusters. The results of
our experiments demonstrate the effectiveness and the efficiency of
DCBOR.", keywords = "Data Clustering, Clustering Algorithms, Handling
Noise, Arbitrary Shape of Clusters.", volume = "2", number = "9", pages = "3121-6", }