Approximation Algorithm for the Shortest Approximate Common Superstring Problem

The Shortest Approximate Common Superstring (SACS) problem is : Given a set of strings f={w1, w2, ... , wn}, where no wi is an approximate substring of wj, i ≠ j, find a shortest string Sa, such that, every string of f is an approximate substring of Sa. When the number of the strings n>2, the SACS problem becomes NP-complete. In this paper, we present a greedy approximation SACS algorithm. Our algorithm is a 1/2-approximation for the SACS problem. It is of complexity O(n2*(l2+log(n))) in computing time, where n is the number of the strings and l is the length of a string. Our SACS algorithm is based on computation of the Length of the Approximate Longest Overlap (LALO).




References:
[1] D. Maier, J. A. Storer, A note on complexity of the superstring problem,
TR-233, Princeton University, Dept. EECS : (1977).
[2] D. Maier, The complexity of some problems on subsequences and
supersequences, J. of ACM, Vol. 25, N┬░2 : (April 1978), p322-336.
[3] M. R. Garey, D. S. Johnson, Computers and intractability, Freeman
(Ed.) : (1979).
[4] J. K. Gallant, D. Maier, J. Storer, On finding minimal length
superstrings, J. Computer and System Sciences, Vol. 20, N┬░1 : (1980),
p50-58.
[5] D. S. Hirschberg, Recent results on the complexity of commonsubsequence
problems, Time Warps, String Edits and Macromolecules
The Theory and Practice of Sequence Comparison, Sankoff and Kruskal
(Eds.), Addison-Wesley Pub. Inc. : (1983), p325-330.
[6] H. Peltola, H. Söderlund, E. Ukkonen, SEQAID : A DNA sequence
assembling program based on a mathematical model, Nucleic Acids
Research, Vol. 12, N┬░1 : (1984), p307-321.
[7] S. Dear, R. Staden, A sequence assembly and editing program for
efficient management of large projects, Nucleic Acids Research, Vol. 19
: (1991), p3907-3911.
[8] X. Huang, A contig assembly program based on sensitive detection of
string overlaps, Genomics, N┬░14 : (1992), p18-25.
[9] J. Kececioglu, E. Myers, Combinatorial algorithms for DNA sequence
assembly, Algorithmica, Vol. 13, N┬░12, Springer International Eds. :
(1995), p7-51.
[10] G. G. Sutton, O. White, M. D. Adams, A. R. Kerlavage, TIGR
Assembler: A new tool for assembing large shotgun sequencing projects,
Genome Science & Technology, Vol. 1, N┬░1, Mary Ann Liebert, Inc. :
(1995), p9-19.
[11] M. Elloumi, DNA Sequence Assembly Algorithms Based on Clustering
Approaches, The 2000 International Conference on Mathematics and
Engineering Techniques in Medicine and Biological Sciences,
METMBS'2000 (Las Vegas, Nevada, USA) : (June 2000).
[12] J. Cohen, Bioinformatics-An Introduction For Computer Scientists,
ACM Computing Surveys, Vol 36, No 2, June 2004, p122-158.
[13] S. Rahmann, The Shortest Common Supersequence Problem In a
Microarray Production Setting, Bioinformatics, Vol 19, (2003), p ii156-
ii161.
[14] E. Ukkonen, A Linear time algorithm for finding approximate shortest
common superstrings, Algorithmica, N┬░5: (1990), p313-323.
[15] J. Kececioglu, Exact and approximation algorithms for DNA sequence
reconstruction, Ph.D. dissertation, Technical Report, N┬░91-26, Dept. of
Computer Science, The University of Arisona, Tucson, AZ 85721 :
(1991).
[16] J. Tarhio, E. Ukkonen, A greedy approximation algorithm for
constructing shortest common superstrings, Theo. Comput. Sci., N┬░57:
(1988), p131-145.
[17] S. Teng, F. Yao, Approximation shortest superstrings, 34th IEEE
Symposium on Foundation of Computer Science : (1993).
[18] T. A. Jenkyns, The greedy travelling salesman's problem, Networks N┬░9
: (1979), p363-373.
[19] J. K. Gallant, The complexity of the overlap method for sequencing
biopolymers, J. Theor. Biol., N┬░101 : (1982), p1-17.
[20] J. Van Leeuwen, Graph algorithms, Handbook of Theoretical Computer
Science, Vol. A : Algorithms and Complexity, J. Van Leeuwen (Ed.),
Elsevier Science Pub. B. V. : (1990), p527-631.
[21] R. E. Bellman, Dynamic Programming, Princeton University Press, New
Jersey : (1957)
[22] R. E. Bellman, S. E. Dreyfus, Applied Dynamic Programming, Princeton
University Press, New Jersey : (1962)
[23] R. A. Wagner, M. J. Fischer, The string-to-string correction problem, J.
of ACM, Vol 21 No 1 : (1974), p168-173.
[24] P. H. Sellers, The theory and computation of Evolutionary distances :
Pattern recognition, J. Algorithms, No 1 : (1980) p359-373.
[25] M. Elloumi, An algorithm for the approximate string-matching problem,
Atlantic Symposium on Computational Biology, Genome Information
Systems & Technology, CBGI'2001, (Durham, North Carolina,
U.S.A.) : (March 2001).
[26] J. S. Vitter, Ph. Flajolet, Average-case analysis of algorithms and data
structures, Handbook of Theoretical Computer Science, Vol. A :
Algorithms and Complexity, J. Van Leeuwen (Ed.), Elsevier Science
Pub. B. V. : (1990), p431-524.