Feature Selection with Kohonen Self Organizing Classification Algorithm

In this paper a one-dimension Self Organizing Map algorithm (SOM) to perform feature selection is presented. The algorithm is based on a first classification of the input dataset on a similarity space. From this classification for each class a set of positive and negative features is computed. This set of features is selected as result of the procedure. The procedure is evaluated on an in-house dataset from a Knowledge Discovery from Text (KDT) application and on a set of publicly available datasets used in international feature selection competitions. These datasets come from KDT applications, drug discovery as well as other applications. The knowledge of the correct classification available for the training and validation datasets is used to optimize the parameters for positive and negative feature extractions. The process becomes feasible for large and sparse datasets, as the ones obtained in KDT applications, by using both compression techniques to store the similarity matrix and speed up techniques of the Kohonen algorithm that take advantage of the sparsity of the input matrix. These improvements make it feasible, by using the grid, the application of the methodology to massive datasets.




References:
[1] T. Hastie, R. Tibshiranie, J. H. Friedman "The Elements of Statistical
Learning. Data Mining, Inference and Prediction," Springer, New York.
2003.
[2] S. Smit, H. C. J. Hoefsloot, A. K. Smilde "Statistical data processing in
clinical proteomics," Journal of Chromatography B, Vol. 866, pp. 77-88,
2008.
[3] A. Choudhary, M. Brun, J. Hua, J. Lowey, E. Suh, E. R. Dougherty
"Genetic test bed for feature selection," Bioinformatics, vol. 22, no. 7,
pp 837-842, 2006.
[4] K.V. Mardia, J. T. Kent, J. M. Bibby "Multivariate Analysis," Academic
Press, London, 1980.
[5] L.J.P. Van der Maaten, E.O. Postma, H. J. van den Herik
"Dimensionality reduction: a comparative review," Submitted to
Neurocognition, 2008.
[6] Y. Saeys, I. Inza, P. Larranaga "A review of feature selection techniques
in bioinformatics," Bioinformatics, vol. 23 no. 19, pp. 2507-2517,
2007.
[7] F. Model, P. Adorjàn, A. Olek, C. Piepenbrock, "Feature selection for
DNA methylation based cancer classification," Bioinformatics, vol. 17
(suppl. 1), pp. 157-164, 2001.
[8] A. Ben-Dor, N. Friedman, Z. Yakhini "Class discovery in gene
expression data" in Proc of the 5th annual international conference on
computational molecular biology, pp 31-38, 2001.
[9] R. Kohavi, G. H. John, "Wrappers for feature subset selection,"
Artificial Intelligence, vol. 97, no. 1-2, pp. 273-324, 1997.
[10] I. Guyon, S.Gunn, M. Nikravesh, I. Zadeh, L. (Editors) "Feature
Extraction, Foundations and Applications (Studies in Fuzziness and Soft
Computing)," Chap. 6: Embedded methods. Springer, 2006.
[11] I. Guyon, A. Elisseeff "An introduction to variable and feature
selection,". Journal of Machine Learning Research, vol 3, pp. 1157-
1182, 2003.
[12] T. Kohonen "Self Organizing Maps," Springer, 2000.
[13] S. Kaski, J. Kangas, T. Kohonen "Bibliography of self organizing map
(SOM) Papers: 1981 - 1997." Neural Computing Survey, vol. 1, no. 3,
pp. 102ÔÇö350, 1998.
[14] M. Oja, S. Kaski, T. Kohonen "Bibliography of self organizing map
(SOM) papers: 1998 - 2001 Addendum," Neural Computing Survey,
vol. 3, no. 1, pp. 1ÔÇö156, 2003.
[15] M. Cottrel J.C. Fort, P. Letremy "Advantages and drawbacks of the
batch Kohonen Algorithm," in Proc. 10th European Symp. On Artificial
Neural Network, pp. 223ÔÇö230. Bruges (Belgium), 2005.
[16] A. Faro, D. Giordano, F. Maiorana "Discovering complex regularities by
adaptive Self Organizing classification,". Proceedings of WASET, vol.
4, pp. 27ÔÇö30, 2005: http://www.waset.org/pwaset/v4/v4-8.pdf
[17] A. Faro, D. Giordano, F. Maiorana "Discovering complex regularities
from tree to semi - lattice classifications," International Journal of
Computational Intelligence, vol. 2, no. 1, pp. 34ÔÇö39, 2005:
http://www.waset.org/ijci/v2/v2-1-6.pdf
[18] T. Fawcett "An introduction to ROC analysis" Pattern Recognition
Letters Vol. 27, pp. 861-874, 2006.
[19] E. Spertus, M. Sahami, O. Buyukkokten "Evaluating similarity
measures: a large scale study in the Orkut Social Network," In Proc. of
the eleventh ACM SIGKDD international conference on knowledge
discovery in data mining, pp. 678-684, 2005.
[20] A. Faro, D. Giordano, F. Maiorana, C. Spanpinato, "Discovering Genes-
Diseases Associations from Specialized Literature using the GRID." To
appear on IEEE Transaction on Information Technology in Biomedicine.
[21] I. Guyon, "Design of experiments for the NIPS 2003 variable selection
benchmark," Technical Report, 2003.
http://www.nipsfsc.ecs.soton.ac.uk/papers/Datasets.pdf.
[22] I. Guyon, "Experimental design of the WCCI 2006 performance
prediction challenge," Technical Report,2005.
[23] I. Guyon, S. Gunn, A. Ben-Hur, G. Dror, G, "Result analysis of the
NIPS 2003 feature selection challenge," in Proc NIPS, 2004.
http://books.nips.cc/papers/files/nips17/NIPS2004_0194.pdf.
[24] I. Guyon, J. Li, T. Mader., P. A. Pletscher, G. Schneider, M. Uhr,
"Competitive baseline methods set new standards for the NIPS 2003
feature selection benchmark," Pattern Recognition Letters, vol 28, pp.
1438-1444, 2007.