Fast Database Indexing for Large Protein Sequence Collections Using Parallel N-Gram Transformation Algorithm

With the rapid development in the field of life sciences and the flooding of genomic information, the need for faster and scalable searching methods has become urgent. One of the approaches that were investigated is indexing. The indexing methods have been categorized into three categories which are the lengthbased index algorithms, transformation-based algorithms and mixed techniques-based algorithms. In this research, we focused on the transformation based methods. We embedded the N-gram method into the transformation-based method to build an inverted index table. We then applied the parallel methods to speed up the index building time and to reduce the overall retrieval time when querying the genomic database. Our experiments show that the use of N-Gram transformation algorithm is an economical solution; it saves time and space too. The result shows that the size of the index is smaller than the size of the dataset when the size of N-Gram is 5 and 6. The parallel N-Gram transformation algorithm-s results indicate that the uses of parallel programming with large dataset are promising which can be improved further.




References:
[1] R. Bader, "OpenMP - Parallel programming on shared memory systems
": Leibniz-rechenzentrum. Retrieved September 14,2008 from:
http://www.lrz-muenchen.de/services/software/parallel/openmp/, 2008.
[2] B. Barney, "Introduction to Parallel Computing." vol. 2008: Livermore
Computing,National Laboratory. Retrieved July 4, 2008 from:
https://computing.llnl.gov/tutorials/parallel_comp, 2007.
[3] O. Beng Chin, P. Hwee Hwa, W. Hao, W. Limsoon, and Y. Cui, "Fast
filter-and-refine algorithms for subsequence selection," in Database
Engineering and Applications Symposium, 2002. Proceedings.
International, 2002, pp. 243-254.
[4] A. Califano and I. Rigoutsos, "FLASH: a fast look-up algorithm for
string homology," in Computer Vision and Pattern Recognition, 1993.
Proceedings CVPR '93., 1993 IEEE Computer Society Conference on,
New York, NY, USA, 1993, pp. 353-359.
[5] G. Cooper, M. Raymer, T. Doom, D. Krane, and N. Futamura, "Indexing
genomic databases," in Bioinformatics and Bioengineering, 2004. BIBE
2004. Proceedings. Fourth IEEE Symposium on, 2004, pp. 587-591.
[6] Q. Cory, "Introduction to programming shared-memory and distributedmemory
parallel computers," Crossroads, vol. 8, pp. 16-22, 2002.
[7] H. Ela, P. A. Malcolm, and W. I. Robert, "A Database Index to Large
Biological Sequences," in Proceedings of the 27th International
Conference on Very Large Data Bases: Morgan Kaufmann Publishers
Inc., 2001.
[8] H. Ela, P. A. Malcolm, and W. I. Robert, "Database indexing for large
DNA and protein sequence collections," The VLDB Journal, vol. 11, pp.
256-271, 2002.
[9] Z. Elberrichi and B. Aljohar, "N-grams in Texts Categorization,"
Scientific Journal of King Faisal University (Basic and Applied
Sciences), vol. Vol.8 No.2, pp. 25-38, 2007.
[10] C. Fondrat and P. Dessen, "A rapid access motif database (RAMdb) with
a search algorithm for the retrieval patterns in nucleic acids or protein
databanks," Comput. Appl. Biosci., vol. 11, pp. 273-279, June 1, 1995
1995.
[11] I. Foster, Designing and Building Parallel Programs. Boston, MA, USA:
Addison-Wesley Longman Publishing Co., Inc, 1995.
[12] A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel
Computing, Second Edition ed.: Addison Wesley, 2003.
[13] U. Hobohm and C. Sander, "A Sequence Property Approach to
Searching Protein Databases," Journal of Molecular Biology, vol. 251,
pp. 390-399, 1995.
[14] L. Hsiao Ping, T. Yin Te, S. Ching Hua, S. Tzu Fang, and T. Chuan Yi,
"An IDC-based algorithm for efficient homology filtration with
guaranteed seriate coverage," in Bioinformatics and Bioengineering,
2004. BIBE 2004. Proceedings. Fourth IEEE Symposium on, 2004, pp.
395-402.
[15] M. Huerta, F. Haseltine, and Y. Liu. vol. 2008: National Institute of
Mental Health (NIH) , Biomedical information science and technology
initiative (BISTI), Retrieved February 6, 2008, from:
http://www.bisti.nih.gov/CompuBioDef.pdf., 2000.
[16] M. N. Hwang and J. Kim, "Protein Sequence Search based on N-gram
Indexing," Bioinformatics and Biosystems, vol. Vol. 1, pp. 53-57, 2006.
[17] X. Jiang, P. Zhang, X. Liu, and S. S. T. Yau, "Survey on index based
homology search algorithms," Springer Science and Business /media, pp.
40:185-212, 23 March 2007 2007.
[18] T. Kahveci and A. K. Singh, "An Efficient Index Structure for String
Databases," in Proceedings of the37th VLDB conference, Roma, Italy,
2001, pp. 351--360.
[19] T. Kahveci and A. K. Singh, "MAP: searching large genome databases,"
in Pacific Symposium on Biocomputing, Hawaii, 2003, pp. 303-314.
[20] W. J. Kent, "BLAT---The BLAST-Like Alignment Tool," Genome Res.,
pp. GR-2292R, March 20, 2002.
[21] M. S. Kim, K. Y. Whang, J. G. Lee, and M. J. Lee, "n-Gram/2L: A
Space and Time Efficient Two-Level n-Gram Inverted Index Structure,"
in VLDB, Trondheim, Norway, 2005, pp. 325-336.
[22] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to
Information Retrieval: Cambridge University Press, 2008.
[23] S. Microsystems, "Multithreaded programming guide," Sun
Microsystems, Inc Business. Retrieved September 14,2008 from:
http://docsun.cites.uiuc.edu/sun_docs/C/solaris_9/SUNWdev/MTP/toc.h
tml 2002.
[24] Z. B. Miled, N. Li, M. Mahoui, and O. Bukhres, "Information Retrieval
in Biomedical Research," Wiley encyclopedia of biomedical
engineering, 2006.
[25] Z. Ning, A. J. Cox, and J. C. Mullikin, "SSAHA: A Fast Search Method
for Large DNA Databases," Genome Res., vol. 11, pp. 1725-1729,
October 1, 2001 2001.
[26] T. H. Ong, K. L. Tan, and H. Wang, "Indexing Genomic Databases for
Fast Homology Searching," in Proceedings of the 13th International
Conference on Database and Expert Systems Applications, 2002.
[27] B. Parhami, Introduction to Parallel Processing Algorithms and
Architectures New York: Kluwer Academic Publishers, 2002.
[28] C. Weimin and K. Aberer, "Efficient querying on genomic databases by
using metric space indexing techniques," in Proceedings of the 8th
International Workshop on Database and Expert Systems Applications:
IEEE Computer Society, 1997.
[29] H. Williams and J. Zobel, "Indexing nucleotide databases for fast query
evaluation," in Advances in Database Technology ÔÇö EDBT '96, 1996,
pp. 275-288.
[30] H. E. WILLIAMS, "Effective query filtering for fast homology
searching," Pacific Symposium on Biocomputing, pp. 214 - 225 1999.
[31] H. E. Williams and J. Zobel, "Indexing and retrieval for genomic
databases," Knowledge and Data Engineering, IEEE Transactions on,
vol. 14, pp. 63-78, 2002.
[32] C. Xia, L. Shuai Cheng, O. Beng Chin, and K. H. T. Anthony, "Piers: an
efficient model for similarity search in DNA sequence databases,"
SIGMOD Rec., vol. 33, pp. 39-44, 2004.
[33] L. Yip Chi and B. Kao, "A study on n-gram indexing of musical
features," in International Conference on Multimedia and Expo, 2000
IEEE New York, NY, USA, 2000, pp. 869-872 vol.2.