In large datasets, identifying exceptional or rare cases
with respect to a group of similar cases is considered very significant
problem. The traditional problem (Outlier Mining) is to find
exception or rare cases in a dataset irrespective of the class label of
these cases, they are considered rare events with respect to the whole
dataset. In this research, we pose the problem that is Class Outliers
Mining and a method to find out those outliers. The general
definition of this problem is “given a set of observations with class
labels, find those that arouse suspicions, taking into account the
class labels". We introduce a novel definition of Outlier that is Class
Outlier, and propose the Class Outlier Factor (COF) which measures
the degree of being a Class Outlier for a data object. Our work
includes a proposal of a new algorithm towards mining of the Class
Outliers, presenting experimental results applied on various domains
of real world datasets and finally a comparison study with other
related methods is performed.
[1] Angiulli, F., Pizzuti, C.: Fast Outlier detection in high dimensional
spaces, In Proc. of the Sixth European Conference on the Principles of
Data Mining and Knowledge Discovery, pp. 15-26, 2002.
[2] Barbarà, D., Chen, P.: Using the fractal dimension to cluster datasets,
In: Proc. KDD, pp. 260-264, 2000.
[3] Barnett, V., Lewis, T.: Outliers in Statistical Data, John Wiley, 1994.
[4] Bay, S. D., and Schwabacher, M.: Mining Distance-Based Outliers in
Near Linear Time with Randomization and a Simple Pruning Rule, Proc.
of The Ninth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2003.
[5] Blake C., Keogh E., Merz C. J.: UCI Repository of Machine Learning
Databases, http://www.ics.uci.edu/~mlearn/MLRepository.htm, 1998.
[6] Bolton, R. J., Hand, D. J.: Statistical fraud detection: A review (with
discussion), Statistical Science, 17(3): pp. 235-255, 2002.
[7] Breunig, M., Kriegel, H., Ng, R., Sander, J.: LOF: Identifying densitybased
local outliers, In: Proc. SIGMOD Conf, pp. 93-104, 2000.
[8] Eskin E., Arnold A., Prerau M., Portnoy L., Stolfo S.: A geometric
framework for unsupervised anomaly detection: Detecting intrusions in
unlabeled data, In Data Mining for Security Applications, 2002.
[9] Ester M., Kriegel H.-P., Sander J., Xu X.: A Density-Based Algorithm
for Discovering Clusters in Large Spatial Databases with Noise, Proc.
2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD'96),
Portland, OR. pp. 226-231, 1996.
[10] Han, J., Kamber, M.: Data Mining: Concepts and Techniques, San
Francisco, Morgan Kaufmann, 2001.
[11] Hawkins, D.: Identification of Outliers, Chapman and Hall, 1980.
[12] Hawkins, S., He, H. X., Williams, G. J., Baxter, R. A.: Outlier detection
using replicator neural networks, In Proc. of the Fifth Int. Conf. and
Data Warehousing and Knowledge Discovery (DaWaK02), 2002.
[13] He, Z., Deng, S., Xu., X.: Outlier detection integrating semantic
knowledge, In: Proc. of WAIM-02, pp. 126-131, 2002.
[14] He, Z., Xu, X., Huang, J., Deng, S.: Mining Class Outliers: Concepts,
Algorithms and Applications in CRM, Expert Systems with Applications
(ESWA'04), 27(4): pp. 681-697, 2004.
[15] Jain, A., Murty, M., Flynn, P.: Data clustering: A review, ACM Comp,
Surveys 31, 264-323, 1999.
[16] Johnson, T., Kwok, I., Ng, R.: Fast computation of 2-dimensional depth
contours, In: Proc. KDD. pp. 224-228, 1998.
[17] Knorr E. M., Ng. R. T.: Finding intensional knowledge of distancebased
outliers, In Proc. of the 25th VLDB Conference, 1999.
[18] Knorr, E., Ng, R., Tucakov, V.: Distance-based outliers: Algorithms and
applications, VLDB Journal 8, pp. 237-253, 2000.
[19] Knorr, E., Ng, R.: A unified notion of outliers: Properties and
computation, In: Proc. KDD. pp. 219-222, 1997.
[20] Knorr, E., Ng, R.: Finding intentional knowledge of distance-based
outliers, In: Proc. VLDB. pp. 211-222, 1999.
[21] Knorr, E.M., Ng, R.: Algorithms for mining distance-based outliers in
large datasets, In: Proc. VLDB pp. 392-403, 1998.
[22] Lane, T., Brodley, C. E.: Temporal sequence learning and data
reduction for anomaly detection, ACM Transactions on Information and
System Security, 2(3): pp. 295-331, 1999.
[23] Michalski, R. S., Winston, P. H.: Variable Precision Logic, Artificial
Intelligence Journal 29, Elsevier Science Publishers B.V. (North-
Holland), pp. 121-146,1986.
[24] Papadimitriou, S., Faloutsos C.: Cross-outlier detection, In: Proc. of
SSTD-03, pp. 199-213, 2003.
[25] Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining
outliers from large data sets, In Proc. of the ACM SIGMOD
Conference, pp. 427-438, 2000.
[26] Rousseeuw, P., Leroy, A.: Robust Regression and Outlier Detection,
John Wiley and Sons, 1987.
[27] Rulequest Research, Gritbot, http://www.rulequest.com
[28] Witten, I. H., Frank, E.: Data Mining: Practical Machine Learning Tools
and Techniques, (Second Edition), San Francisco, Morgan Kaufmann,
2005.
[1] Angiulli, F., Pizzuti, C.: Fast Outlier detection in high dimensional
spaces, In Proc. of the Sixth European Conference on the Principles of
Data Mining and Knowledge Discovery, pp. 15-26, 2002.
[2] Barbarà, D., Chen, P.: Using the fractal dimension to cluster datasets,
In: Proc. KDD, pp. 260-264, 2000.
[3] Barnett, V., Lewis, T.: Outliers in Statistical Data, John Wiley, 1994.
[4] Bay, S. D., and Schwabacher, M.: Mining Distance-Based Outliers in
Near Linear Time with Randomization and a Simple Pruning Rule, Proc.
of The Ninth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2003.
[5] Blake C., Keogh E., Merz C. J.: UCI Repository of Machine Learning
Databases, http://www.ics.uci.edu/~mlearn/MLRepository.htm, 1998.
[6] Bolton, R. J., Hand, D. J.: Statistical fraud detection: A review (with
discussion), Statistical Science, 17(3): pp. 235-255, 2002.
[7] Breunig, M., Kriegel, H., Ng, R., Sander, J.: LOF: Identifying densitybased
local outliers, In: Proc. SIGMOD Conf, pp. 93-104, 2000.
[8] Eskin E., Arnold A., Prerau M., Portnoy L., Stolfo S.: A geometric
framework for unsupervised anomaly detection: Detecting intrusions in
unlabeled data, In Data Mining for Security Applications, 2002.
[9] Ester M., Kriegel H.-P., Sander J., Xu X.: A Density-Based Algorithm
for Discovering Clusters in Large Spatial Databases with Noise, Proc.
2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD'96),
Portland, OR. pp. 226-231, 1996.
[10] Han, J., Kamber, M.: Data Mining: Concepts and Techniques, San
Francisco, Morgan Kaufmann, 2001.
[11] Hawkins, D.: Identification of Outliers, Chapman and Hall, 1980.
[12] Hawkins, S., He, H. X., Williams, G. J., Baxter, R. A.: Outlier detection
using replicator neural networks, In Proc. of the Fifth Int. Conf. and
Data Warehousing and Knowledge Discovery (DaWaK02), 2002.
[13] He, Z., Deng, S., Xu., X.: Outlier detection integrating semantic
knowledge, In: Proc. of WAIM-02, pp. 126-131, 2002.
[14] He, Z., Xu, X., Huang, J., Deng, S.: Mining Class Outliers: Concepts,
Algorithms and Applications in CRM, Expert Systems with Applications
(ESWA'04), 27(4): pp. 681-697, 2004.
[15] Jain, A., Murty, M., Flynn, P.: Data clustering: A review, ACM Comp,
Surveys 31, 264-323, 1999.
[16] Johnson, T., Kwok, I., Ng, R.: Fast computation of 2-dimensional depth
contours, In: Proc. KDD. pp. 224-228, 1998.
[17] Knorr E. M., Ng. R. T.: Finding intensional knowledge of distancebased
outliers, In Proc. of the 25th VLDB Conference, 1999.
[18] Knorr, E., Ng, R., Tucakov, V.: Distance-based outliers: Algorithms and
applications, VLDB Journal 8, pp. 237-253, 2000.
[19] Knorr, E., Ng, R.: A unified notion of outliers: Properties and
computation, In: Proc. KDD. pp. 219-222, 1997.
[20] Knorr, E., Ng, R.: Finding intentional knowledge of distance-based
outliers, In: Proc. VLDB. pp. 211-222, 1999.
[21] Knorr, E.M., Ng, R.: Algorithms for mining distance-based outliers in
large datasets, In: Proc. VLDB pp. 392-403, 1998.
[22] Lane, T., Brodley, C. E.: Temporal sequence learning and data
reduction for anomaly detection, ACM Transactions on Information and
System Security, 2(3): pp. 295-331, 1999.
[23] Michalski, R. S., Winston, P. H.: Variable Precision Logic, Artificial
Intelligence Journal 29, Elsevier Science Publishers B.V. (North-
Holland), pp. 121-146,1986.
[24] Papadimitriou, S., Faloutsos C.: Cross-outlier detection, In: Proc. of
SSTD-03, pp. 199-213, 2003.
[25] Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining
outliers from large data sets, In Proc. of the ACM SIGMOD
Conference, pp. 427-438, 2000.
[26] Rousseeuw, P., Leroy, A.: Robust Regression and Outlier Detection,
John Wiley and Sons, 1987.
[27] Rulequest Research, Gritbot, http://www.rulequest.com
[28] Witten, I. H., Frank, E.: Data Mining: Practical Machine Learning Tools
and Techniques, (Second Edition), San Francisco, Morgan Kaufmann,
2005.
@article{"International Journal of Information, Control and Computer Sciences:60902", author = "Nabil M. Hewahi and Motaz K. Saad", title = "Class Outliers Mining: Distance-Based Approach", abstract = "In large datasets, identifying exceptional or rare cases
with respect to a group of similar cases is considered very significant
problem. The traditional problem (Outlier Mining) is to find
exception or rare cases in a dataset irrespective of the class label of
these cases, they are considered rare events with respect to the whole
dataset. In this research, we pose the problem that is Class Outliers
Mining and a method to find out those outliers. The general
definition of this problem is “given a set of observations with class
labels, find those that arouse suspicions, taking into account the
class labels". We introduce a novel definition of Outlier that is Class
Outlier, and propose the Class Outlier Factor (COF) which measures
the degree of being a Class Outlier for a data object. Our work
includes a proposal of a new algorithm towards mining of the Class
Outliers, presenting experimental results applied on various domains
of real world datasets and finally a comparison study with other
related methods is performed.", keywords = "Class Outliers, Distance-Based Approach, Outliers
Mining.", volume = "1", number = "9", pages = "2816-14", }