Burrows Wheeler - Alternatives to Move to Front

نویسندگان

  • Peter M. Fenwick
  • Mark R. Titchener
  • Michelle Lorenz
چکیده

The Burrows Wheeler Transform, first published in 1994, is a relatively new approach to text compression and has already proven to produce excellent results. However, there has been much research directed towards improving the efficienc y of the Move to Front algorithm with varying degrees of complexity. This paper examines a relatively simple technique using a dual modeling system and achieves very promising results. Because the MTF produces a highly skewed symbol distribution with little contextual information, the data can be more effectively utilized by splitting the alphabet into two models before the final encoding. One method suggested by Fenwick relies on a cache system where the most probable symbols are stored in a prominent foreground model, and the bulk remaining symbols stored in a larger background model. A similar approach has been suggested by Balkenhol but where the background is encoded in the original ASCII alphabet instead of the conventional MTF. Both methods produce superior results to the conventional MTF data stream, producing an overall bits/byte average of 2.345 and 2.361 over the Calgary Corpus respectively. In comparison with Wirth’s results, the overall average of Fenwick’s method reveals an approximate equal if not slightly superior outcome. This is very promising for a relatively simple technique that has much potential for further refinement. Information loss during the Burrows Wheeler Transform is also examined using Deterministic Information Theory measuring techniques. By utilizing the information measurement tools developed by Titchener a clearer understanding of informational changes over BWT has been established. Most importantly, MTF loses information making it amenable for compression, however it also contains little contextual structure and closely resembles the data stream after applying the BWT algorithm. These findings suggest it is viable to focus efforts on developing a better representation of the MTF stream, or even omitting it entirely. Since submitting this paper, further work was conducted using a cache directly on the BWT stream without the MTF recoding, achieving preliminary averages of 2.612bits/byte over the Calgary Corpus, with much promise for further improvements.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High-performance BWT-based Encoders

In 1994, Burrows and Wheeler [5] developed a data compression algorithm which performs significantly better than Lempel-Ziv based algorithms. Since then, a lot of work has been done in order to improve their algorithm, which is based on a reversible transformation of the input string, called BWT (the Burrows-Wheeler transformation). In this paper, we propose a compression scheme based on BWT, M...

متن کامل

Second step algorithms in the Burrows-Wheeler compression algorithm

In this paper we fix our attention on the second step algorithms of the Burrows–Wheeler compression algorithm, which in the original version is the Move To Front transform. We discuss many of its replacements presented so far, and compare compression results obtained using them. Then we propose a new algorithm that yields a better compression ratio than the previous ones.

متن کامل

Improvements to the Burrows-Wheeler Compression Algorithm: After BWT Stages

The lossless Burrows-Wheeler Compression Algorithm has received considerable attention over recent years for both its simplicity and effectiveness. It is based on a permutation of the input sequence − the Burrows-Wheeler Transform − which groups symbols with a similar context close together. In the original version, this permutation was followed by a Move-To-Front transformation and a final ent...

متن کامل

Incremental frequency count - a post BWT-stage for the Burrows-Wheeler compression algorithm

The stage after the Burrows-Wheeler Transform (BWT) has a key function inside the Burrows-Wheeler compression algorithm as it transforms the BWT output from a local context into a global context. This paper presents the Incremental Frequency Count stage, a post-BWT stage. The new stage is paired with a run length encoding stage between the BWT and entropy coding stage of the algorithm. It offer...

متن کامل

The Engineering of a Compression Boosting Library: Theory vs Practice in BWT Compression

Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to design and engineer compression algorithms based on the BWT paradigm. In particular, Move-to-Front Encoding is...

متن کامل

Syllable-Based Burrows-Wheeler Transform

The Burrows-Wheeler Transform (BWT) is a compression method which reorders an input string into the form, which is preferable to another compression. Usually Move-To-Front transform and then Huffman coding is used to the permutated string. The original method [3] from 1994 was designed for an alphabet compression. In 2001, versions working with word and n-grams alphabet were presented. The newe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003