Breast Cancer Survivability Prediction via Classifier Ensemble

This paper presents a classifier ensemble approach for
predicting the survivability of the breast cancer patients using the
latest database version of the Surveillance, Epidemiology, and End
Results (SEER) Program of the National Cancer Institute. The system
consists of two main components; features selection and classifier
ensemble components. The features selection component divides the
features in SEER database into four groups. After that it tries to find
the most important features among the four groups that maximizes the
weighted average F-score of a certain classification algorithm. The
ensemble component uses three different classifiers, each of which
models different set of features from SEER through the features
selection module. On top of them, another classifier is used to give
the final decision based on the output decisions and confidence
scores from each of the underlying classifiers. Different classification
algorithms have been examined; the best setup found is by using the
decision tree, Bayesian network, and Na¨ıve Bayes algorithms for the
underlying classifiers and Na¨ıve Bayes for the classifier ensemble
step. The system outperforms all published systems to date when
evaluated against the exact same data of SEER (period of 1973-2002).
It gives 87.39% weighted average F-score compared to 85.82% and
81.34% of the other published systems. By increasing the data size to
cover the whole database (period of 1973-2014), the overall weighted
average F-score jumps to 92.4% on the held out unseen test set.




References:
[1] “World health organization,” in World Cancer Report, 2014, pp. Chapter
1.1, ISBN 92–832–0429–8.
[2] “International agency for research on cancer,” in World Cancer Report,
2008.
[3] “Breast cancer. nci,” in SEER Stat Fact Sheets, 2014.
[4] Z.-H. Zhou and Y. Jiang, “Medical diagnosis with c4.5 rule preceded
by artificial neural network ensemble,” Information Technology in
Biomedicine, IEEE Transactions on, vol. 7, no. 1, pp. 37–42, March
2003.
[5] M. Lundin, J. Lundin, H. B. Burke, S. Toikkanen, L. Pylkk¨anen, and
H. Joensuu, “Artificial neural networks applied to survival prediction in
breast cancer,” Oncology, vol. 57, no. 4, pp. 281–286, 1999. (Online).
Available: http://www.karger.com/DOI/10.1159/000012061
[6] D. Delen, G. Walker, and A. Kadam, “Predicting breast cancer
survivability: a comparison of three data mining methods,” Artificial
Intelligence in Medicine, vol. 34, no. 2, pp. 113–127, Jun 2005.
[Online]. Available: http://www.aiimjournal.com/article/S0933-3657(04)
00101-0/abstract
[7] “Seer cancer statistics review. surveillance, epidemiology, and end results
(seer) program (www.seer.cancer.gov) public-use data (1973-2000).
national cancer institute, surveillance research program, cancer statistics
branch, released april 2003. based on the november 2002 submission.
diagnosis period 1973-2000, registries 1-9.”
[8] A. Bellaachia and E. Guven, “Predicting breast cancer survivability using
data mining techniques,” in Ninth Workshop on Mining Scientific and
Engineering Datasets in conjunction with the Sixth SIAM International
Conference on Data Mining (SDM 2006), April 22 2006.
[9] “Seer cancer statistics review. surveillance, epidemiology, and end results
(seer) program (www.seer.cancer.gov) public-use data (1973-2002).
national cancer institute, surveillance research program, cancer statistics
branch, released april 2005. based on the november 2004 submission.” [10] “Surveillance, epidemiology, and end results (seer) program
(www.seer.cancer.gov) research data (1973-2011), national cancer
institute, dccps, surveillance research program, surveillance systems
branch, released april 2014, based on the november 2013 submission.”
[11] R. Eskander, M. Al-Badrashiny, N. Habash, and O. Rambow, “Foreign
words and the automatic processing of arabic social media text written in
roman script,” In Proceedings of the First Workshop on Computational
Approaches to Code-Switching. EMNLP 2014, Conference on Empirical
Methods in Natural Language Processing, October, 2014, Doha, Qatar,
2014.
[12] J. Kittler and F. Roli, Eds., Multiple Classifier Systems, First International
Workshop, MCS 2000, Cagliari, Italy, June 21-23, 2000, Proceedings,
ser. Lecture Notes in Computer Science, vol. 1857. Springer,
2000.
[13] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.
Witten, “The weka data mining software: an update,” ACM SIGKDD
Explorations Newsletter, vol. 11, no. 1, 2009.