A Hybrid Approach for Selection of Relevant Features for Microarray Datasets

Developing an accurate classifier for high dimensional microarray datasets is a challenging task due to availability of small sample size. Therefore, it is important to determine a set of relevant genes that classify the data well. Traditionally, gene selection method often selects the top ranked genes according to their discriminatory power. Often these genes are correlated with each other resulting in redundancy. In this paper, we have proposed a hybrid method using feature ranking and wrapper method (Genetic Algorithm with multiclass SVM) to identify a set of relevant genes that classify the data more accurately. A new fitness function for genetic algorithm is defined that focuses on selecting the smallest set of genes that provides maximum accuracy. Experiments have been carried on four well-known datasets1. The proposed method provides better results in comparison to the results found in the literature in terms of both classification accuracy and number of genes selected.





References:
[1] Alon U., Barkai N., Notterman DA., Gish K., Ybarra S., Mack D.,
Levine AJ., "Broad patterns of gene expression revealed by clustering
analysis of tumor and normal colon tissues probed by oligonucleotide
arrays", In Proc. Natnl. Acad. Sci. USA ,96,1990.
[2] Ben-Dor A., Bruhn L., Friedman N., Nachman I., Schummer M.,
Yakhini Z., "Tissue classification with gene expression profiles",
Journal of Computational Biology, 7(3-4),pp.559-583, 2000.
[3] Golub TR., Slonim DK. et al, "Molecular classification of cancer: class
discovery and class prediction by gene expression monitoring", Science,
286, pp.531-537, 1999.
[4] Kohavi R., John G., "Wrapper for feature subset selection", Artificial
Intelligence, 97(1-2), pp.273-324, 1997.
[5] Langley P., "Selection of relevant features in machine learning", In
AAAI Fall Symposium on Relevance, 1994.
[6] Ding C., Peng HC., "Minimum redundancy feature selection from
microarray gene expression data", In IEEE Computer Society
Bioinformatics Conf, pp. 523-528, 2003.
[7] Jaeger J., Sengupta R., Ruzzo WL., "Improved gene selection for
classification of microarray", In PSB, pp. 53-64. 2003.
[8] Li L., Weinberg CR., Darden TA., Pedersen LG. "Gene Selection for
sample classification based on gene expression data: study of sensitivity
to choice of parameters of the GA/KNN method", Bioinformatics,
17(12), pp.131-142, 2001.
[9] Jourdan L., "Meatheuristics for knowledge discovery: Application to
genetic data", PhD thesis, University of Lille, 2003.
[10] Peng S., Xu Q., Ling XB., Peng X., Du W., Chen L., "Molecular
classification of cancer types from microarray data using the
combination of genetic algorithms and support vector machines", FEBS
Letter, 555(2), pp.358-362, 2003.
[11] Deb K., Goldberg DE., "An investigation of niche and species formation
in genetic function optimization", In Schaffer J. D. (Ed) Proc. 3rd
Internat. Conf. Genetic Algorithm, Morgan Kaufmann, San Mateo, pp.
42-50, 1989.
[12] Bins J., Draper B., "Feature selection from huge feature sets", In Proc.
Internat. Conf. Computer Vision, 2, pp.159-165, 2001.
[13] Hong JH., Cho SB., "Efficient huge scale feature selection with
speciated genetic algorithm", Pattern Recognition letters, 27, pp.143-
150, 2006.
[14] Huerta EB., Duval B., Hao J., "A hybrid GA/SVM approach for Gene
Selection and Classification of microarray data", EvoWorkshops 2006,
LNCS 3907 , pp.34-44,2006.
[15] Reddy AR., Deb K., "Classification of two-class cancer data reliably
using evolutionary algorithms", Technical Report KanGAL, 2003.
[16] Fu L.M., Liu CSF., "Evaluation of gene importance in microarray data
based upon probability of selection", BMC Bioinformatics, 6(67), 2005.
[17] Khan J., Wei JS., Ringer M., Saal LH., Ladanyin, Westermann F.,
Berthold F., Schwab M., Antonescu CR., Petterson C., Meltzer PS.,
"Classification and diagnostic prediction of cancers using gene
expression profiling and artificial neural networks", Nat. Med., 7,
pp.673-679, 2001.
[18] Li T., Zhang C., Ogihara MA., "Comparative study of feature selection
and multi class classification methods for tissue classification based on
gene expression", Bioinformatics, 20, pp.2429-2437, 2004.
[19] Souza BF., Carvalho APLF., "Gene Selection based on multi-class
support vector machines and Genetic algorithms", Genetics and
Molecular Research", 4(3), pp.599-607, 2005.
[20] Li W., Yang Y., "How many genes are needed for a discriminant
microarray data analysis in Critical Assessment of Techniques for
Microarray", Data Mining Workshop, pp.137-150, 2000.
[21] Su Y., Murali T.M., Pavlovic V., Kasif S. "RankGene: identification of
diagnostic genes based on expression data", Bioinformatics, pp.1578-79,
2003.
[22] http://www-genome.wi.mit.edu/cgi-bin/cancer/publications
[23] http://research.nhgri.nih.gov/microarray/supplement/.
[24] http://llmpp.nih.gov/lymphoma
[25] Dietterich TG., Bakiri G., "Solving multi-class learning via errorcorrecting
output codes", General of Artificial Intelligence Research, 2,
pp.263-86, 1995.
[26] Guyon I., Weston J., Barnhill S., Vapnik V. "Gene Selection for cancer
classification using support vector machines", Machine Learning, 46,
pp.389-422, 2003.
[27] Tibshirani R., Hastie T., Narasimhan B., Chu G., "Diagnosis of multiple
cancer types by shrunken centroids of gene expression", In Proc. Natl
Acad. Sci., U.S.A., 99, pp.6567-6572, 2002.
[28] Lee Y., Lee C., "Classification of multiple cancer types by multi
category support vector machines using gene expression data",
Bioinformatics, 19, pp.1132-1139, 2003.
[29] Corts C., Vapnik VN., "Support Vector Networks", Machine Learning,
2, pp.273-297, 1995.
[30] Vapnik VN., The Nature of Statistical Learning Theory. Springer, Berlin
Heidelberg New York 1995.
[31] Rifkin R., Klautau A., "In Defence of One-Vs.-All Classification",
Journal of Machine Learning, 5, pp.101-141, 2004.
[32] Hsu CW., Lin CJ., "A comparison of methods for Multi-class Support
vector machine", IEEE Transactions on Neural Networks, 13(2),
pp.415-425, 2002.
[33] Goldberg DE., Genetic algorithm in search, optimization and machine
learning. Addison Wesley, 1989.
[34] Ramaswamy S.,Tamayo P. et al , "Multiclass cancer diagnosis using
tumor gene expression signature", Proc Natl. Acad Sci. USA, 98(26), pp
15149-15154,2001.