Feature Selection with Kohonen Self Organizing Classification Algorithm
In this paper a one-dimension Self Organizing Map
algorithm (SOM) to perform feature selection is presented. The
algorithm is based on a first classification of the input dataset on a
similarity space. From this classification for each class a set of
positive and negative features is computed. This set of features is
selected as result of the procedure. The procedure is evaluated on an
in-house dataset from a Knowledge Discovery from Text (KDT)
application and on a set of publicly available datasets used in
international feature selection competitions. These datasets come
from KDT applications, drug discovery as well as other applications.
The knowledge of the correct classification available for the training
and validation datasets is used to optimize the parameters for positive
and negative feature extractions. The process becomes feasible for
large and sparse datasets, as the ones obtained in KDT applications,
by using both compression techniques to store the similarity matrix
and speed up techniques of the Kohonen algorithm that take
advantage of the sparsity of the input matrix. These improvements
make it feasible, by using the grid, the application of the
methodology to massive datasets.
[1] T. Hastie, R. Tibshiranie, J. H. Friedman "The Elements of Statistical
Learning. Data Mining, Inference and Prediction," Springer, New York.
2003.
[2] S. Smit, H. C. J. Hoefsloot, A. K. Smilde "Statistical data processing in
clinical proteomics," Journal of Chromatography B, Vol. 866, pp. 77-88,
2008.
[3] A. Choudhary, M. Brun, J. Hua, J. Lowey, E. Suh, E. R. Dougherty
"Genetic test bed for feature selection," Bioinformatics, vol. 22, no. 7,
pp 837-842, 2006.
[4] K.V. Mardia, J. T. Kent, J. M. Bibby "Multivariate Analysis," Academic
Press, London, 1980.
[5] L.J.P. Van der Maaten, E.O. Postma, H. J. van den Herik
"Dimensionality reduction: a comparative review," Submitted to
Neurocognition, 2008.
[6] Y. Saeys, I. Inza, P. Larranaga "A review of feature selection techniques
in bioinformatics," Bioinformatics, vol. 23 no. 19, pp. 2507-2517,
2007.
[7] F. Model, P. Adorjàn, A. Olek, C. Piepenbrock, "Feature selection for
DNA methylation based cancer classification," Bioinformatics, vol. 17
(suppl. 1), pp. 157-164, 2001.
[8] A. Ben-Dor, N. Friedman, Z. Yakhini "Class discovery in gene
expression data" in Proc of the 5th annual international conference on
computational molecular biology, pp 31-38, 2001.
[9] R. Kohavi, G. H. John, "Wrappers for feature subset selection,"
Artificial Intelligence, vol. 97, no. 1-2, pp. 273-324, 1997.
[10] I. Guyon, S.Gunn, M. Nikravesh, I. Zadeh, L. (Editors) "Feature
Extraction, Foundations and Applications (Studies in Fuzziness and Soft
Computing)," Chap. 6: Embedded methods. Springer, 2006.
[11] I. Guyon, A. Elisseeff "An introduction to variable and feature
selection,". Journal of Machine Learning Research, vol 3, pp. 1157-
1182, 2003.
[12] T. Kohonen "Self Organizing Maps," Springer, 2000.
[13] S. Kaski, J. Kangas, T. Kohonen "Bibliography of self organizing map
(SOM) Papers: 1981 - 1997." Neural Computing Survey, vol. 1, no. 3,
pp. 102ÔÇö350, 1998.
[14] M. Oja, S. Kaski, T. Kohonen "Bibliography of self organizing map
(SOM) papers: 1998 - 2001 Addendum," Neural Computing Survey,
vol. 3, no. 1, pp. 1ÔÇö156, 2003.
[15] M. Cottrel J.C. Fort, P. Letremy "Advantages and drawbacks of the
batch Kohonen Algorithm," in Proc. 10th European Symp. On Artificial
Neural Network, pp. 223ÔÇö230. Bruges (Belgium), 2005.
[16] A. Faro, D. Giordano, F. Maiorana "Discovering complex regularities by
adaptive Self Organizing classification,". Proceedings of WASET, vol.
4, pp. 27ÔÇö30, 2005: http://www.waset.org/pwaset/v4/v4-8.pdf
[17] A. Faro, D. Giordano, F. Maiorana "Discovering complex regularities
from tree to semi - lattice classifications," International Journal of
Computational Intelligence, vol. 2, no. 1, pp. 34ÔÇö39, 2005:
http://www.waset.org/ijci/v2/v2-1-6.pdf
[18] T. Fawcett "An introduction to ROC analysis" Pattern Recognition
Letters Vol. 27, pp. 861-874, 2006.
[19] E. Spertus, M. Sahami, O. Buyukkokten "Evaluating similarity
measures: a large scale study in the Orkut Social Network," In Proc. of
the eleventh ACM SIGKDD international conference on knowledge
discovery in data mining, pp. 678-684, 2005.
[20] A. Faro, D. Giordano, F. Maiorana, C. Spanpinato, "Discovering Genes-
Diseases Associations from Specialized Literature using the GRID." To
appear on IEEE Transaction on Information Technology in Biomedicine.
[21] I. Guyon, "Design of experiments for the NIPS 2003 variable selection
benchmark," Technical Report, 2003.
http://www.nipsfsc.ecs.soton.ac.uk/papers/Datasets.pdf.
[22] I. Guyon, "Experimental design of the WCCI 2006 performance
prediction challenge," Technical Report,2005.
[23] I. Guyon, S. Gunn, A. Ben-Hur, G. Dror, G, "Result analysis of the
NIPS 2003 feature selection challenge," in Proc NIPS, 2004.
http://books.nips.cc/papers/files/nips17/NIPS2004_0194.pdf.
[24] I. Guyon, J. Li, T. Mader., P. A. Pletscher, G. Schneider, M. Uhr,
"Competitive baseline methods set new standards for the NIPS 2003
feature selection benchmark," Pattern Recognition Letters, vol 28, pp.
1438-1444, 2007.
[1] T. Hastie, R. Tibshiranie, J. H. Friedman "The Elements of Statistical
Learning. Data Mining, Inference and Prediction," Springer, New York.
2003.
[2] S. Smit, H. C. J. Hoefsloot, A. K. Smilde "Statistical data processing in
clinical proteomics," Journal of Chromatography B, Vol. 866, pp. 77-88,
2008.
[3] A. Choudhary, M. Brun, J. Hua, J. Lowey, E. Suh, E. R. Dougherty
"Genetic test bed for feature selection," Bioinformatics, vol. 22, no. 7,
pp 837-842, 2006.
[4] K.V. Mardia, J. T. Kent, J. M. Bibby "Multivariate Analysis," Academic
Press, London, 1980.
[5] L.J.P. Van der Maaten, E.O. Postma, H. J. van den Herik
"Dimensionality reduction: a comparative review," Submitted to
Neurocognition, 2008.
[6] Y. Saeys, I. Inza, P. Larranaga "A review of feature selection techniques
in bioinformatics," Bioinformatics, vol. 23 no. 19, pp. 2507-2517,
2007.
[7] F. Model, P. Adorjàn, A. Olek, C. Piepenbrock, "Feature selection for
DNA methylation based cancer classification," Bioinformatics, vol. 17
(suppl. 1), pp. 157-164, 2001.
[8] A. Ben-Dor, N. Friedman, Z. Yakhini "Class discovery in gene
expression data" in Proc of the 5th annual international conference on
computational molecular biology, pp 31-38, 2001.
[9] R. Kohavi, G. H. John, "Wrappers for feature subset selection,"
Artificial Intelligence, vol. 97, no. 1-2, pp. 273-324, 1997.
[10] I. Guyon, S.Gunn, M. Nikravesh, I. Zadeh, L. (Editors) "Feature
Extraction, Foundations and Applications (Studies in Fuzziness and Soft
Computing)," Chap. 6: Embedded methods. Springer, 2006.
[11] I. Guyon, A. Elisseeff "An introduction to variable and feature
selection,". Journal of Machine Learning Research, vol 3, pp. 1157-
1182, 2003.
[12] T. Kohonen "Self Organizing Maps," Springer, 2000.
[13] S. Kaski, J. Kangas, T. Kohonen "Bibliography of self organizing map
(SOM) Papers: 1981 - 1997." Neural Computing Survey, vol. 1, no. 3,
pp. 102ÔÇö350, 1998.
[14] M. Oja, S. Kaski, T. Kohonen "Bibliography of self organizing map
(SOM) papers: 1998 - 2001 Addendum," Neural Computing Survey,
vol. 3, no. 1, pp. 1ÔÇö156, 2003.
[15] M. Cottrel J.C. Fort, P. Letremy "Advantages and drawbacks of the
batch Kohonen Algorithm," in Proc. 10th European Symp. On Artificial
Neural Network, pp. 223ÔÇö230. Bruges (Belgium), 2005.
[16] A. Faro, D. Giordano, F. Maiorana "Discovering complex regularities by
adaptive Self Organizing classification,". Proceedings of WASET, vol.
4, pp. 27ÔÇö30, 2005: http://www.waset.org/pwaset/v4/v4-8.pdf
[17] A. Faro, D. Giordano, F. Maiorana "Discovering complex regularities
from tree to semi - lattice classifications," International Journal of
Computational Intelligence, vol. 2, no. 1, pp. 34ÔÇö39, 2005:
http://www.waset.org/ijci/v2/v2-1-6.pdf
[18] T. Fawcett "An introduction to ROC analysis" Pattern Recognition
Letters Vol. 27, pp. 861-874, 2006.
[19] E. Spertus, M. Sahami, O. Buyukkokten "Evaluating similarity
measures: a large scale study in the Orkut Social Network," In Proc. of
the eleventh ACM SIGKDD international conference on knowledge
discovery in data mining, pp. 678-684, 2005.
[20] A. Faro, D. Giordano, F. Maiorana, C. Spanpinato, "Discovering Genes-
Diseases Associations from Specialized Literature using the GRID." To
appear on IEEE Transaction on Information Technology in Biomedicine.
[21] I. Guyon, "Design of experiments for the NIPS 2003 variable selection
benchmark," Technical Report, 2003.
http://www.nipsfsc.ecs.soton.ac.uk/papers/Datasets.pdf.
[22] I. Guyon, "Experimental design of the WCCI 2006 performance
prediction challenge," Technical Report,2005.
[23] I. Guyon, S. Gunn, A. Ben-Hur, G. Dror, G, "Result analysis of the
NIPS 2003 feature selection challenge," in Proc NIPS, 2004.
http://books.nips.cc/papers/files/nips17/NIPS2004_0194.pdf.
[24] I. Guyon, J. Li, T. Mader., P. A. Pletscher, G. Schneider, M. Uhr,
"Competitive baseline methods set new standards for the NIPS 2003
feature selection benchmark," Pattern Recognition Letters, vol 28, pp.
1438-1444, 2007.
@article{"International Journal of Information, Control and Computer Sciences:61339", author = "Francesco Maiorana", title = "Feature Selection with Kohonen Self Organizing Classification Algorithm", abstract = "In this paper a one-dimension Self Organizing Map
algorithm (SOM) to perform feature selection is presented. The
algorithm is based on a first classification of the input dataset on a
similarity space. From this classification for each class a set of
positive and negative features is computed. This set of features is
selected as result of the procedure. The procedure is evaluated on an
in-house dataset from a Knowledge Discovery from Text (KDT)
application and on a set of publicly available datasets used in
international feature selection competitions. These datasets come
from KDT applications, drug discovery as well as other applications.
The knowledge of the correct classification available for the training
and validation datasets is used to optimize the parameters for positive
and negative feature extractions. The process becomes feasible for
large and sparse datasets, as the ones obtained in KDT applications,
by using both compression techniques to store the similarity matrix
and speed up techniques of the Kohonen algorithm that take
advantage of the sparsity of the input matrix. These improvements
make it feasible, by using the grid, the application of the
methodology to massive datasets.", keywords = "Clustering algorithm, Data mining, Feature
selection, Grid, Kohonen Self Organizing Map.", volume = "2", number = "9", pages = "3169-6", }