Serial Analysis of Gene Expression is a powerful
quantification technique for generating cell or tissue gene expression
data. The profile of the gene expression of cell or tissue in several
different states is difficult for biologists to analyze because of the large
number of genes typically involved. However, feature selection in
machine learning can successfully reduce this problem. The method
allows reducing the features (genes) in specific SAGE data, and
determines only relevant genes. In this study, we used a genetic
algorithm to implement feature selection, and evaluate the
classification accuracy of the selected features with the K-nearest
neighbor method. In order to validate the proposed method, we used
two SAGE data sets for testing. The results of this study conclusively
prove that the number of features of the original SAGE data set can be
significantly reduced and higher classification accuracy can be
achieved.
[1] V.E. Velculescu, L. Zhang, B. Vogelstein and K.W. Kinzler, "Serial
analysis of gene expression", Science, vol. 270, no. 5235, pp. 484-487,
October 1995.
[2] L. Zhang, W. Zhou, V.E. Velculescu, S.E. Kern, R.H. Hruban, S.R.
Hamilton, B. Vogelstein and K.W. Kinzler, "Gene Expression Profiles in
Normal and Cancer Cells", Science, vol. 276, no. 5316, pp. 1268-1272,
May 1997.
[3] T.C. He, A.B. Sparks, C. Rago, H. Hermeking, L. Zawel, L. T. da Costa,
P.J. Morin, B. Vogelstein and K.W. Kinzler, "Identification of Myc as a
target of the APC pathway", Science, vol. 281, no. 5382, pp. 1509-1512,
September 1998.
[4] V.E. Velculescu, S.L. Madden, L. Zhang, A.E. Lash, J. Yu, C. Rago, A.
Lal, C.J. Wang, G.A. Beaudry, K.M. Ciriello, B.P. Cook, M.R. Dufault,
A.T. Ferguson, Y. Gao, T.C. He, H. Hermeking, S.K. Hiraldo, P.M.
Hwang, M.A. Lopez, H.F. Luderer, B. Mathews, J.M. Petroziello, K.
Polyak, L. Zawel, W. Zhang, X. Zhang, W. Zhou, F.G. Haluska, J. Jen, S.
Sukumar, G.M. Landes, G.J. Riggins, B. Vogelstein and K.W. Kinzler,
"Analysis of human transcriptomes", Nature Genetics, vol. 23, no. 4, pp.
387-388, December 1999.
[5] T. Barrett, D.B. Troup, S.E. Wilhite, P. Ledoux, D. Rudnev, C.
Evangelista, I.F. Kim, A. Soboleva, M. Tomashevsky and R. Edgar,
"NCBI GEO: mining tens of millions of expression profiles--database and
tools update", Nucleic acids research, vol. 35, pp. 760-765, January
2007.
[6] GEO (Gene Expression Omnibus), "GSM14731",
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM14731.
[7] G. Tzanis and I. Vlahavas, "Accurate Classification of SAGE Data Based
on Frequent Patterns of Gene Expression", 19th IEEE International
Conference on Tools with Artificial Intelligence, vol. 1, pp. 96-100,
October 2007.
[8] G. Gamberoni and S. Storari, "Supervised and unsupervised learning
techniques for profiling SAGE results", In Proceedings of the
ECML/PKDD Discovery Challenge Workshop, pp. 121-126, September
2004.
[9] H.T. Lin and L. Li, "Analysis of SAGE Results with Combined Learning
Techniques", In Proceedings of the ECML/PKDD Discovery Challenge
Workshop, pp. 102-113, October 2005.
[10] A. Alves, N. Zagoruiko, O. Okun, O. Kutnenko, and I. Borisova,
"Predictive Analysis of Gene Expression Data from Human SAGE
Libraries", In Proceedings of the ECML/PKDD Discovery Challenge
Workshop, pp. 60-71, October 2005.
[11] Y.F. Shi and Y.P. Zhao, "Comparison of Text Categorization
Algorithms", Wuhan University Journal of Natural Sciences, vol. 9, no.
5, pp. 798-804, October 2004.
[12] L.Y. Chuang, C.H. Ke and C.H. Yang, "A Hybrid Both Filter and
Wrapper Feature Selection Method for Microarray Classification",
International MultiConference of Engineers and Computer Scientists
2008, vol. 1, pp. 146-150, March 2008.
[13] J.R. Quinlan, C4.5: programs for machine learning, Morgan Kaufmann,
San Francisco, CA, USA, 1993.
[14] Wikipedia, "Feature Selection",
http://en.wikipedia.org/wiki/Feature_selection.
[15] E. Elbeltagi, T. Hegazy and D. Grierson, "Comparison among five
evolutionary-based optimization algorithms", Advanced Engineering
Informatics, vol. 19, Issue 1, pp. 43-53, January 2005.
[16] Wikipedia, "k-nearest neighbor algorithm",
http://en.wikipedia.org/wiki/K-nearest_neighbor.
[17] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, S. Levy, "A
comprehensive evaluation of multicategory classification methods for
microarray gene expression cancer diagnosis", Bioinformatics, vol. 21,
no. 5, pp. 631-643, March 2005.
[1] V.E. Velculescu, L. Zhang, B. Vogelstein and K.W. Kinzler, "Serial
analysis of gene expression", Science, vol. 270, no. 5235, pp. 484-487,
October 1995.
[2] L. Zhang, W. Zhou, V.E. Velculescu, S.E. Kern, R.H. Hruban, S.R.
Hamilton, B. Vogelstein and K.W. Kinzler, "Gene Expression Profiles in
Normal and Cancer Cells", Science, vol. 276, no. 5316, pp. 1268-1272,
May 1997.
[3] T.C. He, A.B. Sparks, C. Rago, H. Hermeking, L. Zawel, L. T. da Costa,
P.J. Morin, B. Vogelstein and K.W. Kinzler, "Identification of Myc as a
target of the APC pathway", Science, vol. 281, no. 5382, pp. 1509-1512,
September 1998.
[4] V.E. Velculescu, S.L. Madden, L. Zhang, A.E. Lash, J. Yu, C. Rago, A.
Lal, C.J. Wang, G.A. Beaudry, K.M. Ciriello, B.P. Cook, M.R. Dufault,
A.T. Ferguson, Y. Gao, T.C. He, H. Hermeking, S.K. Hiraldo, P.M.
Hwang, M.A. Lopez, H.F. Luderer, B. Mathews, J.M. Petroziello, K.
Polyak, L. Zawel, W. Zhang, X. Zhang, W. Zhou, F.G. Haluska, J. Jen, S.
Sukumar, G.M. Landes, G.J. Riggins, B. Vogelstein and K.W. Kinzler,
"Analysis of human transcriptomes", Nature Genetics, vol. 23, no. 4, pp.
387-388, December 1999.
[5] T. Barrett, D.B. Troup, S.E. Wilhite, P. Ledoux, D. Rudnev, C.
Evangelista, I.F. Kim, A. Soboleva, M. Tomashevsky and R. Edgar,
"NCBI GEO: mining tens of millions of expression profiles--database and
tools update", Nucleic acids research, vol. 35, pp. 760-765, January
2007.
[6] GEO (Gene Expression Omnibus), "GSM14731",
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM14731.
[7] G. Tzanis and I. Vlahavas, "Accurate Classification of SAGE Data Based
on Frequent Patterns of Gene Expression", 19th IEEE International
Conference on Tools with Artificial Intelligence, vol. 1, pp. 96-100,
October 2007.
[8] G. Gamberoni and S. Storari, "Supervised and unsupervised learning
techniques for profiling SAGE results", In Proceedings of the
ECML/PKDD Discovery Challenge Workshop, pp. 121-126, September
2004.
[9] H.T. Lin and L. Li, "Analysis of SAGE Results with Combined Learning
Techniques", In Proceedings of the ECML/PKDD Discovery Challenge
Workshop, pp. 102-113, October 2005.
[10] A. Alves, N. Zagoruiko, O. Okun, O. Kutnenko, and I. Borisova,
"Predictive Analysis of Gene Expression Data from Human SAGE
Libraries", In Proceedings of the ECML/PKDD Discovery Challenge
Workshop, pp. 60-71, October 2005.
[11] Y.F. Shi and Y.P. Zhao, "Comparison of Text Categorization
Algorithms", Wuhan University Journal of Natural Sciences, vol. 9, no.
5, pp. 798-804, October 2004.
[12] L.Y. Chuang, C.H. Ke and C.H. Yang, "A Hybrid Both Filter and
Wrapper Feature Selection Method for Microarray Classification",
International MultiConference of Engineers and Computer Scientists
2008, vol. 1, pp. 146-150, March 2008.
[13] J.R. Quinlan, C4.5: programs for machine learning, Morgan Kaufmann,
San Francisco, CA, USA, 1993.
[14] Wikipedia, "Feature Selection",
http://en.wikipedia.org/wiki/Feature_selection.
[15] E. Elbeltagi, T. Hegazy and D. Grierson, "Comparison among five
evolutionary-based optimization algorithms", Advanced Engineering
Informatics, vol. 19, Issue 1, pp. 43-53, January 2005.
[16] Wikipedia, "k-nearest neighbor algorithm",
http://en.wikipedia.org/wiki/K-nearest_neighbor.
[17] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, S. Levy, "A
comprehensive evaluation of multicategory classification methods for
microarray gene expression cancer diagnosis", Bioinformatics, vol. 21,
no. 5, pp. 631-643, March 2005.
@article{"International Journal of Information, Control and Computer Sciences:59529", author = "Cheng-Hong Yang and Tsung-Mu Shih and Li-Yeh Chuang", title = "Reducing SAGE Data Using Genetic Algorithms", abstract = "Serial Analysis of Gene Expression is a powerful
quantification technique for generating cell or tissue gene expression
data. The profile of the gene expression of cell or tissue in several
different states is difficult for biologists to analyze because of the large
number of genes typically involved. However, feature selection in
machine learning can successfully reduce this problem. The method
allows reducing the features (genes) in specific SAGE data, and
determines only relevant genes. In this study, we used a genetic
algorithm to implement feature selection, and evaluate the
classification accuracy of the selected features with the K-nearest
neighbor method. In order to validate the proposed method, we used
two SAGE data sets for testing. The results of this study conclusively
prove that the number of features of the original SAGE data set can be
significantly reduced and higher classification accuracy can be
achieved.", keywords = "Serial Analysis of Gene Expression, Feature
selection, Genetic Algorithm, K-nearest neighbor method.", volume = "3", number = "5", pages = "1392-5", }