Novel Hybrid Method for Gene Selection and Cancer Prediction

Microarray data profiles gene expression on a whole genome scale, therefore, it provides a good way to study associations between gene expression and occurrence or progression of cancer. More and more researchers realized that microarray data is helpful to predict cancer sample. However, the high dimension of gene expressions is much larger than the sample size, which makes this task very difficult. Therefore, how to identify the significant genes causing cancer becomes emergency and also a hot and hard research topic. Many feature selection algorithms have been proposed in the past focusing on improving cancer predictive accuracy at the expense of ignoring the correlations between the features. In this work, a novel framework (named by SGS) is presented for stable gene selection and efficient cancer prediction . The proposed framework first performs clustering algorithm to find the gene groups where genes in each group have higher correlation coefficient, and then selects the significant genes in each group with Bayesian Lasso and important gene groups with group Lasso, and finally builds prediction model based on the shrinkage gene space with efficient classification algorithm (such as, SVM, 1NN, Regression and etc.). Experiment results on real world data show that the proposed framework often outperforms the existing feature selection and prediction methods, say SAM, IG and Lasso-type prediction model.




References:
[1] T. Golub: Genome-wide views of cancer. New England Journal of
Medicine, 344, 8, 601-602, 2001.
[2] S. Ramaswamy, T. Golub: DNA microarrays in clinical oncology. Journal
of clinical oncology, 20, 7, 1932-1941, 2002.
[3] H. Peng, F. Long, C. Ding: Feature selection based on mutual information:
criteria of max-dependency, max-relevance, and min-redundancy. IEEE
Trans. on Pattern analysis and machine intelligence, 27, 1226-1238, 2005.
[4] A. Appice, M. Ceci, S. Rawles, P. Flach: Redundant feature elimination
for multi-class problems. Proc. of the 21st ICML, 33-40, 2004.
[5] T. Golub, C. van-Loan: Matrix Computations baltimore. Johns Hopkins
Univ. Press, 1996.
[6] S. Ma, M. Kosorok, M, J. Fine: Additive risk models for survival data
with high dimensional covariates. Biometrics, 62, 202-210, 2006.
[7] J. Costa, H. Alonso, L. Roque, A weighted principal component analysis
and its application to gene expression data, IEEE/ACM Trans. on
computational biology and bioinformatics, 17 Jul. 2009. IEEE computer
Society Digital Library. IEEE Computer Society.
[8] D. Nguyen, D. Rocker: Partial least squares proportional hazard regression
for application to DNA microarray survival data. Bioinformatics, 18,
12, 1625-1632, 2002.
[9] J. Gui, H. Li: Penalized Cix regression analysis in the high-dimensional
and low-sample size settings, with applications to microarray gene
expression data. Bioinformatics, 21, 3001-3008, 2005.
[10] I. Guyon, J. Weston, S. Barnhill: Gene selection for cancer classification
using support vector machines. Machine Learning, 46, 1-3, 389-422, 2002
[11] Y. Ding, D. Wilkins: Improving the performance of SVM-RFE to select
genes in microarray data. BMC Bioinformatics, 7(Suppl 2), 1-8, 2006.
[12] S. Shevade, S. Keerthi: A simple and efficient algorithm for gene
selection using sparse logistic regression. Bioinformatics, 19, 17, 2246-
2253, 2003.
[13] G. Cawley, N. Talbot: Gene selection in cancer classification using sparse
logistic regression with bayesian regularization. Bioinformatics, 22, 2348-
2355, 2006.
[14] L. Ein-Dor, I. Kela, G. Getz, D. Givol, E. Domany: Outcome signature
genes in breast cancer: is there a unique set? Bioinformatics, 21, 171-178,
2005.
[15] A. kalousis, J. Prados, M. Hilario: Stability of feature selection algorithms:
a study on high-dimensional spaces. Knowledge and information
systems, 12, 95-116, 2007.
[16] G. Unger, B. Chor: Linear separability of gene expression datasets. IEEE
Trans. on computational biology and bioinformatics, Aug., 2008.
[17] R. Tibshirani: Regression shrinkage and selection via the lasso. J. R.
Statist. Soc. B: Statist. Methodol. 58, 267-288, 1996.
[18] H. Zou, T. Hastie: Regularization and variable selection via the elastic
net. J. R. Statist. Soc. B: Statist. Methodol. 67, 301-320, 2005.
[19] H. Zou: The adaptive lasso and its oracle properties. J. Amer. Statist.
Assoc. 101, 1418-1429, 2006.
[20] H. Zou, H. Zhang: On the adaptive elastic-net with a diverging number
of parameters. The Annals of statistics, 37, 4, 1733-1751, 2009.
[21] M. Yuan, Y. Lin: Model selection and estimation in regression with
grouped variables. JRSSB, 68, 49-67, 2006.
[22] L. Meier, S. Geer, P. Buhlmann: The group lasso for logistic regression.
JRSSB, 70, 53-71, 2008.
[23] D. Donoho, J. Jin: Higher criticism thresholding: optimal feature selection
when useful features are rare and weak. Proc. Natl. Acad. Sci. USA,
105, 14790-14795, 2008.
[24] J. Jin: Impossibility of successful classification when useful features are
rare and weak. Proc. Natl. Acad. Sci. USA, 106, 8859-8864, 2009.
[25] R. De, A. Ghosh: Interval based fuzzy systems for identification of
important genes from microarray gene expression data: application to
carcinogenic development. Journal of Biomedical Informatics, online
available, Jul.2009.
[26] Y. Yang, J. Pedersen: A comparative study on feature selection in text
categorization. Proc. of the 14th ICML, 412-420, 1997.
[27] A. Dasgupta, P. Drineas, B. Harb: Feature selection methods for text
classification. Proc. of KDD, San Jose, CA, USA, 2007.
[28] T. Jirapech-Umpai, S. Aitken: Feature selection and classification for
microarray data analysis: evolutionary methods for identifying predictive
genes. BMC Bioinformatics, 6, 148:1-11, 2005.
[29] T. Mitchell: Machine learning. McCraw Hill, 1996.
[30] T. Golub et al.: Molecular classification of cancer: class discovery and
class prediction by gene expression monitoring. Science, 286, 531-537,
1999.
[31] V. Tusher, R. Tibshirani, G. Chu: Significance analysis of microarray
applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA,
98, 9, 5116-5121, 2001.
[32] L. Yu, C. Ding, S. Loscalzo: Stable feature selection via dense feature
groups. Proc. of SIG KDD, Las Vegas, Nevada, USA, 803-811, 2008.
[33] S. Loscalzo, L. Yu, C. Ding: Consensus group stable feature selection.
Proc. of SIG KDD, Paris, France, 567-575, 2009.
[34] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, A.
Levine: Broad patterns of gene expression revealed by clustering analysis
of tumor and normal colon tissues probed by oligonucleotide arrays. Proc.
Natl. Acad. Sci. USA, 96, 6745-6750, 1999.
[35] M. West et al.: Predicting the clinical status of human breast cancer
by using gene expression profiles. Proc. Natl. Acad. Sci. USA, 98, 20,
11462-11467, 2001.
[36] H. Kishino, P. Waddell: Correspondence analysis of genes and tissue
types and finding genetic links from microarray data. Genome information,
11, 83-95, 2000.
[37] E. Feng, M. Ng:On sparse Fisher discriminant method for microarray
data analysis. Bioinformation, 2(5), 230-234, 2007.
[38] C. Bolmont, A. Lilienbaum, D. Paulin, J. Grimaud: Expression of desmin
gene in skeletal and smooth muscle by in situ hybridization using a human
desmin gene probe. Journal of Submicrosc Cytol Pathol., 22(1), 117-122,
1990.
[39] Y. Li, C. Campbell, M. Tipping: Bayesian automatic relevance determination
algorithms for classifying gene expression data. Bioinformatics,
18, 1332-1339, 2002.
[40] L. Young, S. Sanduja, K. Bemis-Standoli, E. Pena, R. Price, D. Dixon:
The mRNA binding protiens HuR and tristetraprolin regulate cyclooxygenase
2 expression during colon carcinogenesis. Gastroenterology, 136(5),
1669-1679, 2009.
[41] U. Knippschild, S. Wolff, G. Giamas, C. Brockschmidt, M. Wittau, P.
Wai, T. Eismann, M. Stier: The role of the casein kinase 1 family in
different signaling pathways linked to cancer development. Onkologie,
28, 508-514, 2005.
[42] L. Kaufman, P. Rousseeuw: Finding groups in data: an introduction to
cluster analysis, Wiley, 1990.
[43] A. Strehl: Relationship-based clustering and cluster ensembles for highdimensional
data mining. Ph.D thesis, The University of Texas at Austin,
2002.
[44] T. Attwood, D. Smith: Introduction to bioinformatics. Prentice Hall,
1999.
[45] L. Jacob, G. Obozinski, J. Vert: Group lasso with overlap and graph
lasso. In Proc. of the 26th ICML, Montreal, Canada, 2009.
[46] T. Park, G. Casella: The Bayesian Lasso. Journal of the American
Statistical Association, 103, 482, 681-686, 2008.
[47] B. Scholkopf, C. Burges, A. Smola: Advances in kernel methods: support
vector learning. MIT Press, Cambridge, MA, 1999.
[48] G. Shakhnarovich, T. Darrell, P. Indyk: Nearest-Neighbor methods in
learning and vision. The MIT Press, 2005.
[49] D. Hosmer, S. Lemeshow: Applied logistic Regression, 2nd ed.. New
York; Chichester, Wiley, 2000.