Boosting Text Compression with Word-Based Statistical Encoding

نویسندگان

  • Antonio Fariña
  • Gonzalo Navarro
  • José R. Paramá
چکیده

Semistatic word-based byte-oriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 30-35%, they allow fast direct searching of compressed text. In this article we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors benefit from compressing not the original text, but the compressed representation obtained by a word-based byte-oriented statistical compressor. For example, p7zip with a dense-coding preprocessing achieves even better compression ratios and much faster compression than p7zip alone. We reach compression ratios below 17% in typical large English texts, which was obtained only by the slow PPM compressors. Furthermore, searches perform much faster if the final compressor operates over word-based compressed text. We show that typical self-indexes also profit from our preprocessing step. They achieve much better space and time performance when indexing is preceded by a compression step. Apart from using the well-known Tagged Huffman code, we present a new suffix-free Dense-Code-based compressor that compresses slightly better. We also show how some self-indexes can handle non-suffix-free codes. As a result, the compressed/indexed text requires around 35% of the space of the original text and allows indexed searches for both words and phrases.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Morphological Analysis and Diacritical Arabic Text Compression

Morphological analysis of Arabic words allows decreasing the storage requirements of the Arabic dictionaries, more efficient encoding of diacritical Arabic text, faster spelling and efficient Optical character recognition. All these factors allow efficient storage and archival of multilingual digital libraries that include Arabic texts. This paper presents a lossless compression algorithm based...

متن کامل

Compact In-Memory Models for Compression of Large Text Databases

For compression of text databases, semi-static wordbased models are a pragmatic choice. They provide good compression with a model of moderate size, and allow independent decompression of stored documents. Previous experiments have shown that, where there is not sufficient memory to store a full word-based model, encoding rare words as sequences of characters can still allow good compression, w...

متن کامل

A Dictionary-Based Multi-Corpora Text Compression System

In this paper we introduce StarZip, a multi-corpora text compression system, together with its transform engine StarNT. StarNT achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. StarNT is a dictionary-based fast lossless text transform. The main idea is to recode each English word with a representation of no more than three symbols. This transfo...

متن کامل

Word-Based Text Compression

Today there are many universal compression algorithms, but in most cases is for specific data better using specific algorithm JPEG for images, MPEG for movies, etc. For textual documents there are special methods based on PPM algorithm or methods with non-character access, e.g. word-based compression. In the past, several papers describing variants of wordbased compression using Huffman encodin...

متن کامل

LIPT: A Reversible Lossless Text Transform to Improve Compression Performance

Lossless compression researchers have developed highly sophisticated approaches, such as Huffman encoding, arithmetic encoding, the Lempel-Ziv family, Dynamic Markov Compression (DMC), Prediction by Partial Matching (PPM), and Burrows-Wheeler Transform (BWT) based algorithms. We propose an alternative approach in this paper to develop a reversible transformation that can be applied to a source ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Comput. J.

دوره 55  شماره 

صفحات  -

تاریخ انتشار 2012