Evaluation of the Impact of Dataset Characteristics for Classification Problems in Biological Applications

Availability of high dimensional biological datasets such as from gene expression, proteomic, and metabolic experiments can be leveraged for the diagnosis and prognosis of diseases. Many classification methods in this area have been studied to predict disease states and separate between predefined classes such as patients with a special disease versus healthy controls. However, most of the existing research only focuses on a specific dataset. There is a lack of generic comparison between classifiers, which might provide a guideline for biologists or bioinformaticians to select the proper algorithm for new datasets. In this study, we compare the performance of popular classifiers, which are Support Vector Machine (SVM), Logistic Regression, k-Nearest Neighbor (k-NN), Naive Bayes, Decision Tree, and Random Forest based on mock datasets. We mimic common biological scenarios simulating various proportions of real discriminating biomarkers and different effect sizes thereof. The result shows that SVM performs quite stable and reaches a higher AUC compared to other methods. This may be explained due to the ability of SVM to minimize the probability of error. Moreover, Decision Tree with its good applicability for diagnosis and prognosis shows good performance in our experimental setup. Logistic Regression and Random Forest, however, strongly depend on the ratio of discriminators and perform better when having a higher number of discriminators.





References:
[1] R. Clarke et al., "The properties of high-dimensional data spaces:
implication for exploring gene and protein and expression data", Nature Reviews Cancer, vol. 8, pp. 37-49, January, 2008.
[2] T. Hastie, R. Tibshirani, and J. Friedman, The Element of Statistical
Learning: Data Mining, Inference and Prediction, Springer, 2009.
[3] A. C. Tan, D. Q. Naiman, L. Xu, R. L. Winslow, and D. Geman, "Simple
decision rules for classifying human cancers from gene expression profiles", Bioinformatics, vol. 21, pp. 3869-3904, August, 2005.
[4] R. Diaz-Uriarte, and S. Alvarez de Andres, "Gene selection and
classification of microarray data using random forest", BMC Bioinformatics, vol. 7, January, 2006.
[5] A. Statnikov, L. Wang, and C. F. Aliferis, "A comprehensive
comparison of random forests and support vector machines for microarray-based cancer classification", BMC Bioinformatics, vol. 9,
July, 2008.
[6] M. Pirooznia, J. Y. Yang, M. Q. Yang, and Y. Deng, "A comparative study of different machine learning methods on microarray gene
expression data", BMC Genomics, vol. 9, March, 2008.
[7] S. Cho, and H. Won, "Machine Learning in DNA Microarray Analysis
for Cancer Classification", Proc. of the First Asia-Pacific bioinformatics
conference on Bioinformatics, Australia, 2003, vol. 19, pp. 189-198.
[8] Z. R. Yang, "Biological applications of support vector machines", BRIEF IN BIOIFORMATICS, vol. 5, no. 4, pp. 328-338, December,
2004.
[9] M. Netzer, G. Millonig, M. Osl, B. Pfeifer, S. Praun, J. Villinger, W.
Vogel, C. Baumgartner, "A new ensemble-based algorithm for
identifying breath gas marker candidates in liver disease using ion molecule reaction mass spectrometry", Bioinformatics, vol. 25, pp. 941-947, April, 2009.
[10] D. W. Hosmer, and S. Lemeshow, Applied logistic regression, John
Wiley and Sons, New York, USA, 2000.
[11] G. Tripepi, K. J. Jager, F. W. Dekker, and C. Zoccali, "Linear and logistic regression analysis", Kidney International, vol. 73, pp. 806-810, 2008.
[12] C. Baumgartner, and A. Graber, “Data mining and knowledge discovery
in metabolomics”, in F. Masseglia, P. Poncelet, M. Teisseire (eds.)
Successes and new directions in data mining, Idea Group Inc., 2007, pp.
141-166.
[13] I. H. Witten, and E. Frank, Data mining: practical machine learning
tools and techniques, Morgan Kaufmann, 2005.
[14] H. Pang, I. Kim, and H. Zhao, “Pathway-Based Methods for Analyzing
Microarray Data”, in F. Emmert-Streib, M. Dehmer (eds.) Analysis of
Microarray Data, WILEY-VCH, 2008, pp. 356-358.
[15] F. Hong, and R. Breitling, “A comparison of meta-analysis methods for
detecting differentially expressed genes in microarray experiments”,
BIOINFORMATICS, vol. 24, no. 3, pp. 374–382, December, 2008.
[16] T. A. Lasko, J. G. Bhagwat, K. H. Zou, and L. Ohno-Machado “The use
of receiver operating characteristic curves in biomedical informatics”,
Journal of Biomedical Informatics, vol. 38, pp. 404-415, April, 2005.
[17] E. Frank, M. Hall, L. Trigg, G. Holmes, and I. H. Witten, “Data mining
in bioinformatics using Weka”, BIOINFORMATICS, vol. 20, no. 15, pp.
2479-2481, April, 2004.