An Improved Fast Search Method Using Histogram Features for DNA Sequence Database

In this paper, we propose an efficient hierarchical DNA sequence search method to improve the search speed while the accuracy is being kept constant. For a given query DNA sequence, firstly, a fast local search method using histogram features is used as a filtering mechanism before scanning the sequences in the database. An overlapping processing is newly added to improve the robustness of the algorithm. A large number of DNA sequences with low similarity will be excluded for latter searching. The Smith-Waterman algorithm is then applied to each remainder sequences. Experimental results using GenBank sequence data show the proposed method combining histogram information and Smith-Waterman algorithm is more efficient for DNA sequence search.




References:
[1] J. C. Venter, M. D. Adams, E. W. Myers, etc., "The sequence of the
human genome", Science, vol. 291, no. 5507, pp. 1304 -1351, 2001.
[2] F. S. Collins, M. Morgan, A. Patrinos, "The human genome project:
lessons from Large-Scale Biology", Science, vol. 300, no. 5617, pp.
286-290, 2003.
[3] S.B.Needlman and C.D.Wunsch, "A general method applicable to the
search for similarities in the amino acid sequence of two proteins",
Journal of Molecular Biology, vol. 48, pp. 443- 453, 1970.
[4] T. F. Smith and M. S.Waterman, "Identification of common molecular
subsequences", Journal of Molecular Biology, vol. 47, pp. 195- 197,
1981.
[5] S. F. Altscgul, W.Gish, W. Miller, E. W. Myers and D. J. Lipman, "Basic
local alignment search tool", Journal of Molecular Biology, vol. 215, pp.
403- 410, 1990.
[6] D. Lipman and W. R. Pearson, "Rapid and sensitive protein similarity
searches", Science, vol. 227, pp. 1435-1441, 1985.
[7] GenBank, ftp://ftp.ncbi.nih.gov/genbank/
[8] http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html.
[9] M. Li, and B. Ma, "PatternHunter II: highly sensitive and fast homology
search", Genome Informatics, vol. 14, pp. 164-175, 2003.
[10] B. Ma, J. Tromp, and M. Li, "PatternHunter: faster and more sensitive
homology search", Bioinformatics, vol. 18, no. 3, pp. 440- 445, 2002.
[11] Q. Chen, K. Kotani, F. Lee, and T. Ohmi, "A Fast Retrieval of DNA
Sequences Using Histogram Information", 2009 Int-l Conf. on Future
Information Technology and Management Engineering (FITME 2009),
pp. 529-532, Sanya, China, Dec., 2009.