Abstract: A practical and simple self-indexing data structure, Partitioned Elias-Fano (PEF) - Compressed Suffix Arrays (CSA), is built in linear time for the CSA based on PEF indexes. Moreover, the PEF-CSA is compared with two classical compressed indexing methods, Ferragina and Manzini implementation (FMI) and Sad-CSA on different type and size files in Pizza & Chili. The PEF-CSA performs better on the existing data in terms of the compression ratio, count, and locates time except for the evenly distributed data such as proteins data. The observations of the experiments are that the distribution of the φ is more important than the alphabet size on the compression ratio. Unevenly distributed data φ makes better compression effect, and the larger the size of the hit counts, the longer the count and locate time.
Abstract: As a structure for processing string problem, suffix
array is certainly widely-known and extensively-studied. But if the
string access pattern follows the “90/10" rule, suffix array can not take
advantage of the fact that we often find something that we have just
found. Although the splay tree is an efficient data structure for small
documents when the access pattern follows the “90/10" rule, it
requires many structures and an excessive amount of pointer
manipulations for efficiently processing and searching large
documents. In this paper, we propose a new and conceptually powerful
data structure, called splay suffix arrays (SSA), for string search. This
data structure combines the features of splay tree and suffix arrays into
a new approach which is suitable to implementation on both
conventional and clustered computers.
Abstract: Approximate tandem repeats in a genomic sequence are
two or more contiguous, similar copies of a pattern of nucleotides.
They are used in DNA mapping, studying molecular evolution
mechanisms, forensic analysis and research in diagnosis of inherited
diseases. All their functions are still investigated and not well
defined, but increasing biological databases together with tools for
identification of these repeats may lead to discovery of their specific
role or correlation with particular features. This paper presents a new
approach for finding approximate tandem repeats in a given sequence,
where the similarity between consecutive repeats is measured using
the Hamming distance. It is an enhancement of a method for finding
exact tandem repeats in DNA sequences based on the Burrows-
Wheeler transform.