Latent Topic Based Medical Data Classification

This paper discusses the classification process for medical data. In this paper, we use the data from ACM KDDCup 2008 to demonstrate our classification process based on latent topic discovery. In this data set, the target set and outliers are quite different in their nature: target set is only 0.6% size in total, while the outliers consist of 99.4% of the data set. We use this data set as an example to show how we dealt with this extremely biased data set with latent topic discovery and noise reduction techniques. Our experiment faces two major challenge: (1) extremely distributed outliers, and (2) positive samples are far smaller than negative ones. We try to propose a suitable process flow to deal with these issues and get a best AUC result of 0.98.





References:
[1] D.M.J. Tax, "One-class classification" , PhD Thesis, Delft University of
Technology, http://www.ph.tn.tudelft.nl/˜davidt/thesis.pdf ISBN:
90-75691-05-x, 2001.
[2] Claudia Perlich , Prem Melville , Yan Liu , Grzegorz Swirszcz , Richard
Lawrence , Saharon Rosset, "Breast cancer identification: KDD CUP
winner's report", ACM SIGKDD Explorations Newsletter, v.10 n.2,
December 2008.
[3] M. Girolami and A. Kaban, "On an equivalence between PLSI and LDA",
Proceedings of the 26th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 433-434, 2003.
[4] Thomas Landauer, P. W. Foltz, and D. Laham, Introduction to Latent
Semantic Analysis, Discourse Processes 25: 259-284, 1998.
[5] T. Hofmann, "Unsupervised learning by probabilistic latent semantic
analysis", Machine Learning, vol. 42, no. 1, pp. 177-196, 2001.
[6] D. M. Blei, A. Y. Ng and M. I. Jordan, "Latent Dirichlet Allocation",
Journal of Machine Learning Research, vol. 3, no. 5, pp. 993-1022, 2003.
[7] Grubbs, F. E., "Procedures for detecting outlying observations in
samples", Technometrics 11, 1-21, 1969.
[8] Rousseeuw, P. and Leroy, A., "Robust Regression and Outlier Detection",
John Wiley & Sons., 3rd edition, 1996.
[9] Juszczak, P., Tax, D. M. J., & Duin, R. P. W., "Feature scaling in support
vector data description", In N., Japkowicz (Ed.), Learning from
Imbalanced Data Sets (pp. 25-30). Menlo Park, CA: AAAI Press, 2000.
[10] Salton, Gerard and Buckley, C., "Term-weighting approaches in
automatic text retrieval," Information Processing & Management 24 (5):
513-523, 1988.
[11] Jian-hua Yeh, Chun-hsing Chen, "Protein Remote Homology Detection
Based on Latent Topic Vector Model", in Proceedings of 2012
International Conference on Database and Data Mining(ICDDM2010) ,
Manila, Philippine, June 2010.
[12] Vapnik VN. Statistical Learning Theory. New York, 1998.
[13] R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using the
second order information for training SVM. Journal of Machine Learning
Research 6, 1889-1918, 2005.
[14] Gribskov, M. and Robinson, N.L., "Use of receiver operating
characteristic(ROC) analysis to evaluate sequence matching", Comput.
Chem., 20, 25-33, 1996.