Clustering Multivariate Empiric Characteristic Functions for Multi-Class SVM Classification

A dissimilarity measure between the empiric characteristic functions of the subsamples associated to the different classes in a multivariate data set is proposed. This measure can be efficiently computed, and it depends on all the cases of each class. It may be used to find groups of similar classes, which could be joined for further analysis, or it could be employed to perform an agglomerative hierarchical cluster analysis of the set of classes. The final tree can serve to build a family of binary classification models, offering an alternative approach to the multi-class SVM problem. We have tested this dendrogram based SVM approach with the oneagainst- one SVM approach over four publicly available data sets, three of them being microarray data. Both performances have been found equivalent, but the first solution requires a smaller number of binary SVM models.




References:
[1] B.E. Boser, I.M. Guyon, and V.N. Vapnik, "A training algorithm for
optimal margin classifiers", in Proceedings of the 5th Annual ACM
Workshop on Computational Learning Theory, ACM Press, Pittsburgh,
1992, pp. 144-152.
[2] V. Vapnik, Statistical Learning Theory, John Wiley, New York, 1998.
[3] N. Cristianini, J. Shawe-Taylor. An Introduction to Support Vector
Machines, Cambridge University Press, Cambridge, 2002.
[4] I. Guyon, J. Weston, S. Barnhill, V. Vapnik . Gene selection for cancer
classification using support vector machines, Machine Learning, 46(1):
389-422, 2002.
[5] L. Wang, J. Zhu, H. Zou. Hybrid huberized support vector machines for
microarray classification and gene selection. Bioinformatics 24(3): 412-
419, 2008.
[6] J. Zhu, S. Rosset, T. Hastie, R. Tibshirani. 1-norm support vector
machines. Advances in Neural Information Processing Systems 16(1):
49-56, 2004.
[7] L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, L. Jackel, Y.
LeCun, U. Muller, E. Sackinger, P. Simard, and V. Vapnik. Comparison
of classifier methods: A case study in handwriting digit recognition, in
Proceedings of the International Conference on Pattern Recognition,
1994, pp. 77-87.
[8] A.S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning
revisited: A stepwise procedure for building and training a neural
network, in Neurocomputing: Algorithms, Architectures and
Applications, J. Fogelman, Ed. New York: Springer-Verlag, 1990.
[9] J. C. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin DAG-s for
multiclass classification, in Advances in Neural Information Processing
Systems. Cambridge, MA: MIT Press, 2000, vol. 12, pp.547-553.
[10] C.W. Hsu, and C.J. Lin. A comparison of Methods for Multiclass
Support Vector Machines, IEEE Transactions on Neural Networks,
13(2), pp.415-425, 2002.
[11] K.Benabdeslem, and Y. Bennani. Dendrogram-based SVM for Multi-
Class Classification. Journal of Computing and Information
Technology, 14(4) pp. 283-286, 2006.
[12] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer and and A. Weingessel.
e1071: Misc Functions of the Department of Statistics (e1071), TU
Wien. R package version 1.5-18, 2008.
[13] R Development Core Team. R: A language and environment for
statistical computing. R Foundation for Statistical Computing, Vienna,
Austria. ISBN 3-900051-07-0, URL http://www.R-project.org, 2012.
[14] C.C., Chang, and C.J. Lin. LIBSVM: a library for support vector
machines.URL: http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.ps.gz,
2001.
[15] A. Feuerverger, R.A. Murieka. The empiric characteristic function and
its application, The Annals of Statistics 5, 88-97, 1977.
[16] J. Khan, J. Wei, M. Ringner, L. Saal, M. Ladanyi, F. Westermann, F.
Berthold, M. Schwab, C. Atonescu, C. Peterson, P. Meltzer.
Classification and diagnostic prediction of cancers using gene
expression profiling and artificial neural networks. Nature Med. 7, 673-
679, 2001.
[17] http:// research.nhgri.nih.gov/microarray/Supplement/Images/
supplemental_data.
[18] S. Deshmukh, S. Purohit. Microarray data. Statistical Analysis Using R,
Alpha Science International Ltd., Oxford, 2007.
[19] F. Leisch, E. Dimitriadou. mlbench: Machine Learning Benchmark
Problems. R package version 1.1-6, 2009.
[20] Material from the book's webpage, R port and packaging by Kjetil
Halvorsen . ElemStatLearn: Data sets, functions and examples from the
book: "The Elements of Statistical Learning, Data Mining, Inference,
and Prediction" by Trevor Hastie, Robert Tibshirani and Jerome
Friedman. R package version 0.1-6. 2007.