Distributed Splay Suffix Arrays: A New Structure for Distributed String Search
As a structure for processing string problem, suffix
array is certainly widely-known and extensively-studied. But if the
string access pattern follows the “90/10" rule, suffix array can not take
advantage of the fact that we often find something that we have just
found. Although the splay tree is an efficient data structure for small
documents when the access pattern follows the “90/10" rule, it
requires many structures and an excessive amount of pointer
manipulations for efficiently processing and searching large
documents. In this paper, we propose a new and conceptually powerful
data structure, called splay suffix arrays (SSA), for string search. This
data structure combines the features of splay tree and suffix arrays into
a new approach which is suitable to implementation on both
conventional and clustered computers.
[1] I. H. Written, A. Moffat, and T. C. Bell. Managing Gigabytes:
Compressing and Indexing Documents and Images. Van Nostrand
Reinhold, 1994.
[2] W. B. Cavnar. Using an n-gram based document representation with a
vector processing retrieval model. In Proc. Of the 3rd Text Retrieval
Conference (TREC-3), pages 269-277. NIST special publication, 1995.
[3] G. A. Stephen. String Searching Algorithms. World Scientific Publishing,
1994.
[4] M. Crochemore and W. Rytter. Text Algorithms. Oxford Univ. Press,
New York, 1994
[5] D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge
Univ. Press, 1997
[6] E. M. McCreighl. A Space-Economical Suffix Tree Construction
Algorithm. Journal of the ACM, 23, pp. 262-272, 1976.
[7] U. Manber and G. Myers Suffix Arrays: A New Method for On-Line
String Searches. SIAM J. Comput. 22(5), pp 935-948, 1993.
[8] G. H. Gonnet, R. A. Baeza-Yates, et al. New indices for text: PAT trees
and PAT arrays. In W. B. Frakes and R. A. Baeza-Yates, editors,
Information Retrieval: Data Structure and Algorithms, pages 66-82.
Prentice-Hall, New Jersey, 1992.
[9] M. Nagao and S. Mori. A new method of n-gram statistics for large
number of n and automatic extraction of words and phrase form large text
data of Japanese. In Proc of COLING-94, pages 611-615,1994.
[10] A. Apostolico. The myriad virtues of subword trees. In Combinatorial
Algorithms on Words, pages 85-96. Springer-Verlag, 1985.
[11] A. Fellah. Concurrent and Distributed Data Structures for Multikeys
Sorting on Computer Clusters. IEEE Proc. Of the 16th Intern. Symp. On
High Performance Comput. Systems and Applications. Pp. 281-,
Moncton, Canada, 2002.
[12] A. Fellah, A. Maamir and M. Abaza. Distributed Data Structures for
Multikey Sorting. Intern. J. of Parallel and Distributed Systems and
Networks, Vol. 2(2), pp. 62-68, 1999.
[13] Fellah, A. and Mawson, R. Distributed multidimensional suffix arrays for
string search. Communications, Computers and signal Processing, 2003.
PACRIM. 2003 IEEE Pacific Rim Conference on , Volume: 2, pp.
792-795, 2003.
[14] K. Sadakane. A fast algorithm for making suffix arrays and for
burrows-wheeler transformation. In Proc. DCC-98, pages 129-138, 1998.
[15] D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer
Programming. Addison-Wesley, 1973.
[16] J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching
strings. In the 8th Annual ACM-SIAM Sympo. On Descrete Algorithms,
pages 360-369, 1997.
[17] J. Kitajima, G. Navarro, B. Ribeiro, and N. Ziviani. Distributed generation
of suffix arrays: a quicksort-based approach. In Proc. WSP-97, pages
53-69. Carleton University Press, 1997.
[18] J. Kitajima, B. Ribeiro, and N. Ziviani. Network and memory analysis in
distributed parallel generation of PAT arrays. In Proc. 8th Brazilian Symp.
On Comp. Arch. - High-Performance Processing, pages 193-202.
Brazilian Comp. Soc., 1996.
[19] G. Navarro, J. Kitajima, B. Ribeiro, and N. Ziviani. Distributed generation
of suffix arrays. In Proc. CPM-97, LNCS 1264, pages 102-115, 1997.
[20] Kitajima, J.P., Navarro, G. A fast distributed suffix array generation
algorithm. String Processing and
[21] Information Retrieval Symposium,1999 and International Workshop on
Groupware, Sept,1999,22-24 Pages:97-104
[22] D. D. Sleator and R. E. Tarjan. Self-adjusting Binary Search Trees.
Journal of the ACM 32, pp. 652-686, 1985.
[1] I. H. Written, A. Moffat, and T. C. Bell. Managing Gigabytes:
Compressing and Indexing Documents and Images. Van Nostrand
Reinhold, 1994.
[2] W. B. Cavnar. Using an n-gram based document representation with a
vector processing retrieval model. In Proc. Of the 3rd Text Retrieval
Conference (TREC-3), pages 269-277. NIST special publication, 1995.
[3] G. A. Stephen. String Searching Algorithms. World Scientific Publishing,
1994.
[4] M. Crochemore and W. Rytter. Text Algorithms. Oxford Univ. Press,
New York, 1994
[5] D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge
Univ. Press, 1997
[6] E. M. McCreighl. A Space-Economical Suffix Tree Construction
Algorithm. Journal of the ACM, 23, pp. 262-272, 1976.
[7] U. Manber and G. Myers Suffix Arrays: A New Method for On-Line
String Searches. SIAM J. Comput. 22(5), pp 935-948, 1993.
[8] G. H. Gonnet, R. A. Baeza-Yates, et al. New indices for text: PAT trees
and PAT arrays. In W. B. Frakes and R. A. Baeza-Yates, editors,
Information Retrieval: Data Structure and Algorithms, pages 66-82.
Prentice-Hall, New Jersey, 1992.
[9] M. Nagao and S. Mori. A new method of n-gram statistics for large
number of n and automatic extraction of words and phrase form large text
data of Japanese. In Proc of COLING-94, pages 611-615,1994.
[10] A. Apostolico. The myriad virtues of subword trees. In Combinatorial
Algorithms on Words, pages 85-96. Springer-Verlag, 1985.
[11] A. Fellah. Concurrent and Distributed Data Structures for Multikeys
Sorting on Computer Clusters. IEEE Proc. Of the 16th Intern. Symp. On
High Performance Comput. Systems and Applications. Pp. 281-,
Moncton, Canada, 2002.
[12] A. Fellah, A. Maamir and M. Abaza. Distributed Data Structures for
Multikey Sorting. Intern. J. of Parallel and Distributed Systems and
Networks, Vol. 2(2), pp. 62-68, 1999.
[13] Fellah, A. and Mawson, R. Distributed multidimensional suffix arrays for
string search. Communications, Computers and signal Processing, 2003.
PACRIM. 2003 IEEE Pacific Rim Conference on , Volume: 2, pp.
792-795, 2003.
[14] K. Sadakane. A fast algorithm for making suffix arrays and for
burrows-wheeler transformation. In Proc. DCC-98, pages 129-138, 1998.
[15] D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer
Programming. Addison-Wesley, 1973.
[16] J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching
strings. In the 8th Annual ACM-SIAM Sympo. On Descrete Algorithms,
pages 360-369, 1997.
[17] J. Kitajima, G. Navarro, B. Ribeiro, and N. Ziviani. Distributed generation
of suffix arrays: a quicksort-based approach. In Proc. WSP-97, pages
53-69. Carleton University Press, 1997.
[18] J. Kitajima, B. Ribeiro, and N. Ziviani. Network and memory analysis in
distributed parallel generation of PAT arrays. In Proc. 8th Brazilian Symp.
On Comp. Arch. - High-Performance Processing, pages 193-202.
Brazilian Comp. Soc., 1996.
[19] G. Navarro, J. Kitajima, B. Ribeiro, and N. Ziviani. Distributed generation
of suffix arrays. In Proc. CPM-97, LNCS 1264, pages 102-115, 1997.
[20] Kitajima, J.P., Navarro, G. A fast distributed suffix array generation
algorithm. String Processing and
[21] Information Retrieval Symposium,1999 and International Workshop on
Groupware, Sept,1999,22-24 Pages:97-104
[22] D. D. Sleator and R. E. Tarjan. Self-adjusting Binary Search Trees.
Journal of the ACM 32, pp. 652-686, 1985.
@article{"International Journal of Information, Control and Computer Sciences:58093", author = "Tu Kun and Gu Nai-jie and Bi Kun and Liu Gang and Dong Wan-li", title = "Distributed Splay Suffix Arrays: A New Structure for Distributed String Search", abstract = "As a structure for processing string problem, suffix
array is certainly widely-known and extensively-studied. But if the
string access pattern follows the “90/10" rule, suffix array can not take
advantage of the fact that we often find something that we have just
found. Although the splay tree is an efficient data structure for small
documents when the access pattern follows the “90/10" rule, it
requires many structures and an excessive amount of pointer
manipulations for efficiently processing and searching large
documents. In this paper, we propose a new and conceptually powerful
data structure, called splay suffix arrays (SSA), for string search. This
data structure combines the features of splay tree and suffix arrays into
a new approach which is suitable to implementation on both
conventional and clustered computers.", keywords = "suffix arrays, splay tree, string search, distributedalgorithm", volume = "1", number = "10", pages = "3158-5", }