Reducing SAGE Data Using Genetic Algorithms

Serial Analysis of Gene Expression is a powerful quantification technique for generating cell or tissue gene expression data. The profile of the gene expression of cell or tissue in several different states is difficult for biologists to analyze because of the large number of genes typically involved. However, feature selection in machine learning can successfully reduce this problem. The method allows reducing the features (genes) in specific SAGE data, and determines only relevant genes. In this study, we used a genetic algorithm to implement feature selection, and evaluate the classification accuracy of the selected features with the K-nearest neighbor method. In order to validate the proposed method, we used two SAGE data sets for testing. The results of this study conclusively prove that the number of features of the original SAGE data set can be significantly reduced and higher classification accuracy can be achieved.




References:
[1] V.E. Velculescu, L. Zhang, B. Vogelstein and K.W. Kinzler, "Serial
analysis of gene expression", Science, vol. 270, no. 5235, pp. 484-487,
October 1995.
[2] L. Zhang, W. Zhou, V.E. Velculescu, S.E. Kern, R.H. Hruban, S.R.
Hamilton, B. Vogelstein and K.W. Kinzler, "Gene Expression Profiles in
Normal and Cancer Cells", Science, vol. 276, no. 5316, pp. 1268-1272,
May 1997.
[3] T.C. He, A.B. Sparks, C. Rago, H. Hermeking, L. Zawel, L. T. da Costa,
P.J. Morin, B. Vogelstein and K.W. Kinzler, "Identification of Myc as a
target of the APC pathway", Science, vol. 281, no. 5382, pp. 1509-1512,
September 1998.
[4] V.E. Velculescu, S.L. Madden, L. Zhang, A.E. Lash, J. Yu, C. Rago, A.
Lal, C.J. Wang, G.A. Beaudry, K.M. Ciriello, B.P. Cook, M.R. Dufault,
A.T. Ferguson, Y. Gao, T.C. He, H. Hermeking, S.K. Hiraldo, P.M.
Hwang, M.A. Lopez, H.F. Luderer, B. Mathews, J.M. Petroziello, K.
Polyak, L. Zawel, W. Zhang, X. Zhang, W. Zhou, F.G. Haluska, J. Jen, S.
Sukumar, G.M. Landes, G.J. Riggins, B. Vogelstein and K.W. Kinzler,
"Analysis of human transcriptomes", Nature Genetics, vol. 23, no. 4, pp.
387-388, December 1999.
[5] T. Barrett, D.B. Troup, S.E. Wilhite, P. Ledoux, D. Rudnev, C.
Evangelista, I.F. Kim, A. Soboleva, M. Tomashevsky and R. Edgar,
"NCBI GEO: mining tens of millions of expression profiles--database and
tools update", Nucleic acids research, vol. 35, pp. 760-765, January
2007.
[6] GEO (Gene Expression Omnibus), "GSM14731",
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM14731.
[7] G. Tzanis and I. Vlahavas, "Accurate Classification of SAGE Data Based
on Frequent Patterns of Gene Expression", 19th IEEE International
Conference on Tools with Artificial Intelligence, vol. 1, pp. 96-100,
October 2007.
[8] G. Gamberoni and S. Storari, "Supervised and unsupervised learning
techniques for profiling SAGE results", In Proceedings of the
ECML/PKDD Discovery Challenge Workshop, pp. 121-126, September
2004.
[9] H.T. Lin and L. Li, "Analysis of SAGE Results with Combined Learning
Techniques", In Proceedings of the ECML/PKDD Discovery Challenge
Workshop, pp. 102-113, October 2005.
[10] A. Alves, N. Zagoruiko, O. Okun, O. Kutnenko, and I. Borisova,
"Predictive Analysis of Gene Expression Data from Human SAGE
Libraries", In Proceedings of the ECML/PKDD Discovery Challenge
Workshop, pp. 60-71, October 2005.
[11] Y.F. Shi and Y.P. Zhao, "Comparison of Text Categorization
Algorithms", Wuhan University Journal of Natural Sciences, vol. 9, no.
5, pp. 798-804, October 2004.
[12] L.Y. Chuang, C.H. Ke and C.H. Yang, "A Hybrid Both Filter and
Wrapper Feature Selection Method for Microarray Classification",
International MultiConference of Engineers and Computer Scientists
2008, vol. 1, pp. 146-150, March 2008.
[13] J.R. Quinlan, C4.5: programs for machine learning, Morgan Kaufmann,
San Francisco, CA, USA, 1993.
[14] Wikipedia, "Feature Selection",
http://en.wikipedia.org/wiki/Feature_selection.
[15] E. Elbeltagi, T. Hegazy and D. Grierson, "Comparison among five
evolutionary-based optimization algorithms", Advanced Engineering
Informatics, vol. 19, Issue 1, pp. 43-53, January 2005.
[16] Wikipedia, "k-nearest neighbor algorithm",
http://en.wikipedia.org/wiki/K-nearest_neighbor.
[17] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, S. Levy, "A
comprehensive evaluation of multicategory classification methods for
microarray gene expression cancer diagnosis", Bioinformatics, vol. 21,
no. 5, pp. 631-643, March 2005.