A Novel Prediction Method for Tag SNP Selection using Genetic Algorithm based on KNN

Single nucleotide polymorphisms (SNPs) hold much promise as a basis for disease-gene association. However, research is limited by the cost of genotyping the tremendous number of SNPs. Therefore, it is important to identify a small subset of informative SNPs, the so-called tag SNPs. This subset consists of selected SNPs of the genotypes, and accurately represents the rest of the SNPs. Furthermore, an effective evaluation method is needed to evaluate prediction accuracy of a set of tag SNPs. In this paper, a genetic algorithm (GA) is applied to tag SNP problems, and the K-nearest neighbor (K-NN) serves as a prediction method of tag SNP selection. The experimental data used was taken from the HapMap project; it consists of genotype data rather than haplotype data. The proposed method consistently identified tag SNPs with considerably better prediction accuracy than methods from the literature. At the same time, the number of tag SNPs identified was smaller than the number of tag SNPs in the other methods. The run time of the proposed method was much shorter than the run time of the SVM/STSA method when the same accuracy was reached.





References:
[1] D. Brinza and A. Zelikovsky, "2SNP: scalable phasing based on 2-SNP
haplotypes," Bioinformatics, vol. 22, pp. 371-3, Feb 1 2006.
[2] S. Buch, C. Schafmayer, H. Volzke, C. Becker, A. Franke, H. von
Eller-Eberstein, C. Kluck, I. Bassmann, M. Brosch, F. Lammert, J. F.
Miquel, F. Nervi, M. Wittig, D. Rosskopf, B. Timm, C. Holl, M. Seeger,
A. ElSharawy, T. Lu, J. Egberts, F. Fandrich, U. R. Folsch, M. Krawczak,
S. Schreiber, P. Nurnberg, J. Tepel, and J. Hampe, "A genome-wide
association scan identifies the hepatic cholesterol transporter ABCG8 as
a susceptibility factor for human gallstone disease," Nat Genet, vol. 39,
pp. 995-9, Aug 2007.
[3] B. W. Zanke, C. M. Greenwood, J. Rangrej, R. Kustra, A. Tenesa, S. M.
Farrington, J. Prendergast, S. Olschwang, T. Chiang, E. Crowdy, V.
Ferretti, P. Laflamme, S. Sundararajan, S. Roumy, J. F. Olivier, F.
Robidoux, R. Sladek, A. Montpetit, P. Campbell, S. Bezieau, A. M.
O'Shea, G. Zogopoulos, M. Cotterchio, P. Newcomb, J. McLaughlin, B.
Younghusband, R. Green, J. Green, M. E. Porteous, H. Campbell, H.
Blanche, M. Sahbatou, E. Tubacher, C. Bonaiti-Pellie, B. Buecher, E.
Riboli, S. Kury, S. J. Chanock, J. Potter, G. Thomas, S. Gallinger, T. J.
Hudson, and M. G. Dunlop, "Genome-wide association scan identifies a
colorectal cancer susceptibility locus on chromosome 8q24," Nat Genet,
vol. 39, pp. 989-94, Aug 2007.
[4] Y. J. Yoo, J. Tang, R. A. Kaslow, and K. Zhang, "Haplotype inference for
present absent genotype data using previously identified haplotypes and
haplotype patterns." vol. 23: Oxford Univ Press, 2007, p. 2399.
[5] O. Delaneau, C. Coulonges, P. Y. Boelle, G. Nelson, J. L. Spadoni, and J.
F. Zagury, "ISHAPE: new rapid and accurate software for haplotyping,"
BMC Bioinformatics, vol. 8, p. 205, 2007.
[6] S. B. Gabriel, S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy, B.
Blumenstiel, J. Higgins, M. DeFelice, A. Lochner, M. Faggart, S. N.
Liu-Cordero, C. Rotimi, A. Adeyemo, R. Cooper, R. Ward, E. S. Lander,
M. J. Daly, and D. Altshuler, "The structure of haplotype blocks in the
human genome," Science, vol. 296, pp. 2225-9, Jun 21 2002.
[7] M. J. Daly, J. D. Rioux, S. F. Schaffner, T. J. Hudson, and E. S. Lander,
"High-resolution haplotype structure in the human genome," Nat Genet,
vol. 29, pp. 229-32, Oct 2001.
[8] N. Patil, A. J. Berno, D. A. Hinds, W. A. Barrett, J. M. Doshi, C. R.
Hacker, C. R. Kautzer, D. H. Lee, C. Marjoribanks, D. P. McDonough, B.
T. Nguyen, M. C. Norris, J. B. Sheehan, N. Shen, D. Stern, R. P.
Stokowski, D. J. Thomas, M. O. Trulson, K. R. Vyas, K. A. Frazer, S. P.
Fodor, and D. R. Cox, "Blocks of limited haplotype diversity revealed by
high-resolution scanning of human chromosome 21," Science, vol. 294,
pp. 1719-23, Nov 23 2001.
[9] X. Ke and L. R. Cardon, "Efficient selective screening of haplotype tag
SNPs," Bioinformatics, vol. 19, pp. 287-8, Jan 22 2003.
[10] K. Zhang, M. Deng, T. Chen, M. S. Waterman, and F. Sun, "A dynamic
programming algorithm for haplotype block partitioning," Proc Natl
Acad Sci U S A, vol. 99, pp. 7335-9, May 28 2002.
[11] K. Zhang and L. Jin, "HaploBlockFinder: haplotype block analyses,"
Bioinformatics, vol. 19, pp. 1300-1, Jul 1 2003.
[12] G. C. Johnson, L. Esposito, B. J. Barratt, A. N. Smith, J. Heward, G. Di
Genova, H. Ueda, H. J. Cordell, I. A. Eaves, F. Dudbridge, R. C. Twells,
F. Payne, W. Hughes, S. Nutland, H. Stevens, P. Carr, E.
Tuomilehto-Wolf, J. Tuomilehto, S. C. Gough, D. G. Clayton, and J. A.
Todd, "Haplotype tagging for the identification of common disease
genes," Nat Genet, vol. 29, pp. 233-7, Oct 2001.
[13] T. U. M. Phuong, Z. Lin, and R. B. Altman, "CHOOSING SNPs USING
FEATURE SELECTION." vol. 4, 2006, pp. 241-257.
[14] V. Bafna, B. V. Halldorsson, R. Schwartz, A. G. Clark, and S. Istrail,
"Haplotypes and informative SNP selection algorithms: don't block out
information," ACM New York, NY, USA, 2003, pp. 19-27.
[15] B. V. Halldorsson, V. Bafna, R. Lippert, R. Schwartz, F. M. De La Vega,
A. G. Clark, and S. Istrail, "Optimal haplotype block-free selection of
tagging SNPs for genome-wide association studies," Genome Res, vol.
14, pp. 1633-40, Aug 2004.
[16]E. Halperin, G. Kimmel, and R. Shamir, "Tag SNP selection in genotype
data for maximizing SNP prediction accuracy," Bioinformatics, vol. 21
Suppl 1, pp. i195-203, Jun 2005.
[17]P. H. Lee and H. Shatkay, "BNTagger: improved tagging SNP selection
using Bayesian networks," Bioinformatics, vol. 22, pp. e211-9, Jul 15
2006.
[18] Z. Liu, S. Lin, and M. Tan, "Genome-wide tagging SNPs with
entropy-based Monte Carlo method," J Comput Biol, vol. 13, pp.
1606-14, Nov 2006.
[19] C. S. Carlson, M. A. Eberle, M. J. Rieder, Q. Yi, L. Kruglyak, and D. A.
Nickerson, "Selecting a maximally informative set of single-nucleotide
polymorphisms for association analyses using linkage disequilibrium,"
Am J Hum Genet, vol. 74, pp. 106-20, Jan 2004.
[20] K. Zhang, Z. Qin, T. Chen, J. S. Liu, M. S. Waterman, and F. Sun,
"HapBlock: haplotype block partitioning and tag SNP selection software
using a set of dynamic programming algorithms," Bioinformatics, vol.
21, pp. 131-4, Jan 1 2005.
[21] J. He and A. Zelikovsky, "MLR-tagging: informative SNP selection for
unphased genotypes based on multiple linear regression,"
Bioinformatics, vol. 22, pp. 2558-61, Oct 15 2006.
[22] J. He and A. Zelikovsky, "Informative SNP selection methods based on
SNP prediction," IEEE Trans Nanobioscience, vol. 6, pp. 60-7, Mar
2007.
[23] K. Zhang, T. Chen, M. S. Waterman, and F. Sun, "A Set of Dynamic
Programming Algorithms for Haplotype Block Partitioning and Tag SNP
Selection via Haplotype Data or Genotype Data," pp. 1-26.
[24] B. Devlin and N. Risch, "A comparison of linkage disequilibrium
measures for fine-scale mapping," Genomics, vol. 29, pp. 311-22, Sep 20
1995.
[25] H. I. Avi-Itzhak, X. Su, and F. M. De La Vega, "Selection of minimum
subsets of single nucleotide polymorphisms to capture haplotype block
diversity," Pac Symp Biocomput, pp. 466-77, 2003.
[26] J. H. Holland, Adaptation in natural and artificial systems: MIT Press
Cambridge, MA, USA, 1992.
[27] E. Fix and J. Hodges, "Discriminatory Analysis-Nonparametric
Discrimination: Consistency Properties," Storming Media, 1951.
[28] D. E. Goldberg and K. Deb, "A comparative analysis of selection schemes
used in genetic algorithms." vol. 1, 1991, pp. 69-93.
[29] G. A. Thorisson, A. V. Smith, L. Krishnan, and L. D. Stein, "The
International HapMap Project Web site," Genome Res, vol. 15, pp.
1592-3, Nov 2005.
[30] J. He, K. Westbrooks, and A. Zelikovsky, "Linear reduction method for
predictive and informative tag SNP selection," Int J Bioinform Res Appl,
vol. 1, pp. 249-60, 2005.