FCNN-MR: A Parallel Instance Selection Method Based on Fast Condensed Nearest Neighbor Rule

Instance selection (IS) technique is used to reduce
the data size to improve the performance of data mining methods.
Recently, to process very large data set, several proposed methods
divide the training set into some disjoint subsets and apply IS
algorithms independently to each subset. In this paper, we analyze
the limitation of these methods and give our viewpoint about how to
divide and conquer in IS procedure. Then, based on fast condensed
nearest neighbor (FCNN) rule, we propose a large data sets instance
selection method with MapReduce framework. Besides ensuring the
prediction accuracy and reduction rate, it has two desirable properties:
First, it reduces the work load in the aggregation node; Second
and most important, it produces the same result with the sequential
version, which other parallel methods cannot achieve. We evaluate the
performance of FCNN-MR on one small data set and two large data
sets. The experimental results show that it is effective and practical.




References:
[1] S. Garcła, J. Luengo, and F. Herrera, “Data preprocessing in data
mining,” Computer Science, vol. 72, 2015.
[2] B. J. Han, “Data mining. concepts and techniques. 3rd ed,” Data Mining
Concepts Models Methods & Algorithms Second Edition, vol. 5, no. 4,
pp. 1 – 18, 2000.
[3] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE
Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.
[4] D. R. Wilson and T. R. Martinez, “Reduction techniques for
instance-based learning algorithms,” Machine Learning, vol. 38, no. 3,
pp. 257–286, 2000.
[5] M. Kudo and J. Sklansky, “Comparison of algorithms that select features
for pattern classifiers,” Pattern Recognition, vol. 33, no. 1, pp. 25–41,
2000.
[6] H. Liu and H. Motoda, “Feature extraction construction and selection:
A data mining perspective,” Springer International, vol. 94, no. 448, p.
014004, 1999.
[7] I. Triguero, J. Derrac, S. Garcia, and F. Herrera, “A taxonomy
and experimental study on prototype generation for nearest neighbor
classification,” Systems Man & Cybernetics Part C Applications &
Reviews IEEE Transactions on, vol. 42, no. 1, pp. 86–100, 2012.
[8] J. Hamidzadeh, R. Monsefi, and H. S. Yazdi, “Irahc: Instance reduction
algorithm using hyperrectangle,” Pattern Recognition, vol. 48, no. 5, pp.
1878–1889, 2015.
[9] Haro-Garc, A. Aida, Garc, and N. A-Pedrajas, “A divide-and-conquer
recursive approach for scaling up instance selection algorithms,” Data
Mining and Knowledge Discovery, vol. 18, no. 3, pp. 392–418, 2009.
[10] I. Triguero, D. Peralta, J. Bacardit, S. Garcła, and F. Herrera, “Mrpr: A
mapreduce solution for prototype reduction in big data classification,”
Neurocomputing, vol. 150, no. 150, p. 331C345, 2015.
[11] J. Zhai, X. Wang, and X. Pang, “Voting-based instance selection
from large data sets with mapreduce and random weight networks,”
Information Sciences, vol. 367, pp. 1066–1077, 2016.
[12] H. Liu and H. Motoda, “On issues of instance selection.” Data Mining
and Knowledge Discovery, vol. 6, no. 2, pp. 115–130, 2002.
[13] J. R. Cano, F. Herrera, and M. Lozano, “Stratification for scaling up
evolutionary prototype selection,” Pattern Recognition Letters, vol. 26,
no. 7, pp. 953–963, 2005.
[14] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing
on large clusters.” in Conference on Symposium on Opearting Systems
Design & Implementation, 2004, pp. 107–113.
[15] F. Angiulli, “Fast condensed nearest neighbor rule,” in International
Conference, 2005, pp. 25–32.
[16] Angiulli, “Fast nearest neighbor condensation for large data sets
classification,” IEEE Transactions on Knowledge & Data Engineering,
vol. 19, no. 11, pp. 1450–1464, 2007.
[17] B. P. E. Hart, “The condensed nearest neighbor rule,” in IEEE Trans.
Information Theory, 1968.
[18] C. H. Chou, B. H. Kuo, and F. Chang, “The generalized condensed
nearest neighbor rule as a data reduction method,” vol. 2, pp. 556–559,
2006.
[19] G. W. Gates, “The reduced nearest neighbor rule,” IEEE Transactions
on Information Theory, vol. 18, no. 3, pp. 431 – 433, 1972.
[20] D. L. Wilson, “Asymptotic properties of nearest neighbor rules using
edited data,” IEEE Transactions on Systems Man & Cybernetics, vol. 2,
no. 3, pp. 408–421, 1972.
[21] W. C. Lin, C. F. Tsai, S. W. Ke, C. W. Hung, and W. Eberle, “Learning
to detect representative data for large scale instance selection,” Journal
of Systems & Software, vol. 106, no. C, pp. 1–8, 2015.
[22] A. Onan, “A fuzzy-rough nearest neighbor classifier combined with
consistency-based subset evaluation and instance selection for automated
diagnosis of breast cancer,” Expert Systems with Applications, vol. 42,
no. 20, pp. 6844–6852, 2015.
[23] J. A. Olvera-Lpez, J. A. Carrasco-Ochoa, and J. F. Martłnez-Trinidad,
“A new fast prototype selection method based on clustering,” Pattern
Analysis and Applications, vol. 13, no. 2, pp. 131–141, 2010.
[24] J. R. Cano, F. Herrera, and M. Lozano, “Using evolutionary algorithms
as instance selection for data reduction in kdd: an experimental study,”
IEEE Transactions on Evolutionary Computation, vol. 7, no. 6, pp.
561–575, 2004.
[25] D. Borthakur, “The hadoop distributed file system: Architecture and
design,” Hadoop Project Website, vol. 11, no. 11, pp. 1 – 10, 2007.
[26] D. B. Skalak, “Prototype and feature selection by sampling and random
mutation hill climbing algorithms,” Machine Learning Proceedings, pp.
293–301, 1994.
[27] V. S. Devi and M. N. Murty, “An incremental prototype set building
technique,” Pattern Recognition, vol. 35, no. 2, pp. 505–513, 2002.