Compression of Semistructured Documents

EGOTHOR is a search engine that indexes the Web and allows us to search the Web documents. Its hit list contains URL and title of the hits, and also some snippet which tries to shortly show a match. The snippet can be almost always assembled by an algorithm that has a full knowledge of the original document (mostly HTML page). It implies that the search engine is required to store the full text of the documents as a part of the index. Such a requirement leads us to pick up an appropriate compression algorithm which would reduce the space demand. One of the solutions could be to use common compression methods, for instance gzip or bzip2, but it might be preferable if we develop a new method which would take advantage of the document structure, or rather, the textual character of the documents. There already exist a special compression text algorithms and methods for a compression of XML documents. The aim of this paper is an integration of the two approaches to achieve an optimal level of the compression ratio




References:
[1] Adiego, J., Feunte, P.: On the Use of Words as Source Alphabet Symbols
in PPM. Data Compression Conference, IEEE CS Press, Los Alamitos,
CA, USA (2006) 435.
[2] Arnavut, Z.: Move-to-front and inversion coding. Data Compression
Conference, IEEE CS Press, Los Alamitos, CA, USA (2000) 193-202.
[3] Burrows, M., Wheeler, D. J.: A Block Sorting Loseless Data Compression
Algorithm. Technical report, Digital Equipment Corporation, Palo Alto,
CA, U.S.A (2003).
[4] Dvorsky, J., Pokorny, J., Snasel, V.: Word-based Compression Methods
for Large Text Documents. Data Compression Conference, IEEE CS
Press, Los Alamitos, CA, USA (1999) 523.
[5] Gailly, J. L.: Gzip program and documentation (1993). Source code
aviable from ftp://prep.ai.mit.edu/pub/gnu/gzip-*.tar
[6] Galambos, L.: Dynamization in IR Systems. Mieczyslaw A. Klopotek,
Slawomir T. Wierzchon, Krzysztof Trojanowski (Eds.): IIPWM, Proc. of
the Int. IIS: IIPWM-04, Poland, 2004. ASC Springer 2004, ISBN 3-540-
21331-7.
[7] Galambos, L.: EGOTHOR. http://www.egothor.org/
[8] Cheney, J.: Compressing XML with Multiplexed Hierarchical PPM Models.
Data Compression Conference, IEEE CS Press, Los Alamitos, CA,
USA (2001) 163.
[9] Chernik, K., Lansky, J., Galambos, L.: Syllable-based compression for
XML documents. In: Snasel, V., Richta, K., and Pokorny, J.: Proceedings
of the Dateso 2006 Annual International Workshop on DAtabases, TExts,
Specifications and Objects. CEUR-WS, Vol. 176, (2006) 21-31
[10] Isal, R.Y.K., Moffat, A.: Word-based Block-sorting Text Compression.
Proc. 24th Australasian Computer Science Conference, Gold Coast,
Australia, (2001) 92-99
[11] Lansky, J., Zemlicka, M.: Compression of a Dictionary. In: Snasel, V.,
Richta, K., and Pokorny, J.: Proceedings of the Dateso 2006 Annual
International Workshop on DAtabases, TExts, Specifications and Objects.
CEUR-WS, Vol. 176, (2006) 11-20
[12] Lansky, J., Zemlicka, M.: Compression of Small Text Files Using
Syllables. Data Compression Conference, IEEE CS Press, Los Alamitos,
CA, USA (2006) 458.
[13] Lansky, J., Zemlicka, M.: Text Compression: Syllables. In: Richta,
K., Snasel, V., Pokorny, J.: Proceedings of the Dateso 2005 Annual
International Workshop on DAtabases, TExts, Specifications and Objects.
CEUR-WS, Vol. 129, (2005) 32-45
[14] Liefke, H., Suciu, D.: XMill: an Efficient Compressor for XML Data.
In Proc. ACM SIGMOD Conference (2000) 153-164
[15] Megginson, D.: SAX: A Simple API for XML. http://www.
saxproject.org
[16] Min, J. K., Park, M. J., Chung, C. W.: XPRESS: A Queriable Compression
for XML Data. SIGMOD 2003, San Diego, CA, USA (2003)
122-133
[17] Moffat, A., Neal, R. M., Witten, I. H.: Arithmetic Coding Revisited.
ACM Transactions on Information Systems, 16, (1998) 256-294
[18] O-Neill, E. T., Lavoie, B. F., Bennett, R.: Trends in the Evolution of the
Public Web: 1998-2002. D-Lib Magazine (2003) 1082-9873
[19] Seward, J.: On the Performance of BWT Sorting Algorithms. DCC,
IEEE CS Press, Los Alamitos, CA, USA (2000) 173.
[20] Seward, J.: The bzip2 and libbzip2 official home page. http://
sources.redhat.com/bzip2/
[21] Tolani, P., Haritsa, J. R.: XGrind: A Query-friendly XML Compressor.
In Proc. IEEE International Conference on Data Engineering (2002).
[22] Turpin, A., Moffat, A.: Housekeeping for prefix coding. IEEE Transaction
on Communications, 48(4), (2000) 622-628.
[23] Welch, T. A.: A technique for high performance data compression. IEEE
Computer, 17(6) (1984) 8-19.
[24] Witten, I., Moffat, A., Bell, T.: Managing Gigabytes: Compressing and
Indexing Documents and Images. Van Nostrand Reinhold (1994).
[25] World Wide Web Consorcium: Extensive Markup Language (XML).
http://www.w3.org/XML/
[26] World Wide Web Consorcium: HyperText Markup Language (HTML).
http://www.w3.org/MarkUp/