Class Outliers Mining: Distance-Based Approach

In large datasets, identifying exceptional or rare cases with respect to a group of similar cases is considered very significant problem. The traditional problem (Outlier Mining) is to find exception or rare cases in a dataset irrespective of the class label of these cases, they are considered rare events with respect to the whole dataset. In this research, we pose the problem that is Class Outliers Mining and a method to find out those outliers. The general definition of this problem is “given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels". We introduce a novel definition of Outlier that is Class Outlier, and propose the Class Outlier Factor (COF) which measures the degree of being a Class Outlier for a data object. Our work includes a proposal of a new algorithm towards mining of the Class Outliers, presenting experimental results applied on various domains of real world datasets and finally a comparison study with other related methods is performed.




References:
[1] Angiulli, F., Pizzuti, C.: Fast Outlier detection in high dimensional
spaces, In Proc. of the Sixth European Conference on the Principles of
Data Mining and Knowledge Discovery, pp. 15-26, 2002.
[2] Barbarà, D., Chen, P.: Using the fractal dimension to cluster datasets,
In: Proc. KDD, pp. 260-264, 2000.
[3] Barnett, V., Lewis, T.: Outliers in Statistical Data, John Wiley, 1994.
[4] Bay, S. D., and Schwabacher, M.: Mining Distance-Based Outliers in
Near Linear Time with Randomization and a Simple Pruning Rule, Proc.
of The Ninth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2003.
[5] Blake C., Keogh E., Merz C. J.: UCI Repository of Machine Learning
Databases, http://www.ics.uci.edu/~mlearn/MLRepository.htm, 1998.
[6] Bolton, R. J., Hand, D. J.: Statistical fraud detection: A review (with
discussion), Statistical Science, 17(3): pp. 235-255, 2002.
[7] Breunig, M., Kriegel, H., Ng, R., Sander, J.: LOF: Identifying densitybased
local outliers, In: Proc. SIGMOD Conf, pp. 93-104, 2000.
[8] Eskin E., Arnold A., Prerau M., Portnoy L., Stolfo S.: A geometric
framework for unsupervised anomaly detection: Detecting intrusions in
unlabeled data, In Data Mining for Security Applications, 2002.
[9] Ester M., Kriegel H.-P., Sander J., Xu X.: A Density-Based Algorithm
for Discovering Clusters in Large Spatial Databases with Noise, Proc.
2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD'96),
Portland, OR. pp. 226-231, 1996.
[10] Han, J., Kamber, M.: Data Mining: Concepts and Techniques, San
Francisco, Morgan Kaufmann, 2001.
[11] Hawkins, D.: Identification of Outliers, Chapman and Hall, 1980.
[12] Hawkins, S., He, H. X., Williams, G. J., Baxter, R. A.: Outlier detection
using replicator neural networks, In Proc. of the Fifth Int. Conf. and
Data Warehousing and Knowledge Discovery (DaWaK02), 2002.
[13] He, Z., Deng, S., Xu., X.: Outlier detection integrating semantic
knowledge, In: Proc. of WAIM-02, pp. 126-131, 2002.
[14] He, Z., Xu, X., Huang, J., Deng, S.: Mining Class Outliers: Concepts,
Algorithms and Applications in CRM, Expert Systems with Applications
(ESWA'04), 27(4): pp. 681-697, 2004.
[15] Jain, A., Murty, M., Flynn, P.: Data clustering: A review, ACM Comp,
Surveys 31, 264-323, 1999.
[16] Johnson, T., Kwok, I., Ng, R.: Fast computation of 2-dimensional depth
contours, In: Proc. KDD. pp. 224-228, 1998.
[17] Knorr E. M., Ng. R. T.: Finding intensional knowledge of distancebased
outliers, In Proc. of the 25th VLDB Conference, 1999.
[18] Knorr, E., Ng, R., Tucakov, V.: Distance-based outliers: Algorithms and
applications, VLDB Journal 8, pp. 237-253, 2000.
[19] Knorr, E., Ng, R.: A unified notion of outliers: Properties and
computation, In: Proc. KDD. pp. 219-222, 1997.
[20] Knorr, E., Ng, R.: Finding intentional knowledge of distance-based
outliers, In: Proc. VLDB. pp. 211-222, 1999.
[21] Knorr, E.M., Ng, R.: Algorithms for mining distance-based outliers in
large datasets, In: Proc. VLDB pp. 392-403, 1998.
[22] Lane, T., Brodley, C. E.: Temporal sequence learning and data
reduction for anomaly detection, ACM Transactions on Information and
System Security, 2(3): pp. 295-331, 1999.
[23] Michalski, R. S., Winston, P. H.: Variable Precision Logic, Artificial
Intelligence Journal 29, Elsevier Science Publishers B.V. (North-
Holland), pp. 121-146,1986.
[24] Papadimitriou, S., Faloutsos C.: Cross-outlier detection, In: Proc. of
SSTD-03, pp. 199-213, 2003.
[25] Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining
outliers from large data sets, In Proc. of the ACM SIGMOD
Conference, pp. 427-438, 2000.
[26] Rousseeuw, P., Leroy, A.: Robust Regression and Outlier Detection,
John Wiley and Sons, 1987.
[27] Rulequest Research, Gritbot, http://www.rulequest.com
[28] Witten, I. H., Frank, E.: Data Mining: Practical Machine Learning Tools
and Techniques, (Second Edition), San Francisco, Morgan Kaufmann,
2005.