Burrows Wheeler - Alternatives to Move to Front
نویسندگان
چکیده
The Burrows Wheeler Transform, first published in 1994, is a relatively new approach to text compression and has already proven to produce excellent results. However, there has been much research directed towards improving the efficienc y of the Move to Front algorithm with varying degrees of complexity. This paper examines a relatively simple technique using a dual modeling system and achieves very promising results. Because the MTF produces a highly skewed symbol distribution with little contextual information, the data can be more effectively utilized by splitting the alphabet into two models before the final encoding. One method suggested by Fenwick relies on a cache system where the most probable symbols are stored in a prominent foreground model, and the bulk remaining symbols stored in a larger background model. A similar approach has been suggested by Balkenhol but where the background is encoded in the original ASCII alphabet instead of the conventional MTF. Both methods produce superior results to the conventional MTF data stream, producing an overall bits/byte average of 2.345 and 2.361 over the Calgary Corpus respectively. In comparison with Wirth’s results, the overall average of Fenwick’s method reveals an approximate equal if not slightly superior outcome. This is very promising for a relatively simple technique that has much potential for further refinement. Information loss during the Burrows Wheeler Transform is also examined using Deterministic Information Theory measuring techniques. By utilizing the information measurement tools developed by Titchener a clearer understanding of informational changes over BWT has been established. Most importantly, MTF loses information making it amenable for compression, however it also contains little contextual structure and closely resembles the data stream after applying the BWT algorithm. These findings suggest it is viable to focus efforts on developing a better representation of the MTF stream, or even omitting it entirely. Since submitting this paper, further work was conducted using a cache directly on the BWT stream without the MTF recoding, achieving preliminary averages of 2.612bits/byte over the Calgary Corpus, with much promise for further improvements.
منابع مشابه
High-performance BWT-based Encoders
In 1994, Burrows and Wheeler [5] developed a data compression algorithm which performs significantly better than Lempel-Ziv based algorithms. Since then, a lot of work has been done in order to improve their algorithm, which is based on a reversible transformation of the input string, called BWT (the Burrows-Wheeler transformation). In this paper, we propose a compression scheme based on BWT, M...
متن کاملSecond step algorithms in the Burrows-Wheeler compression algorithm
In this paper we fix our attention on the second step algorithms of the Burrows–Wheeler compression algorithm, which in the original version is the Move To Front transform. We discuss many of its replacements presented so far, and compare compression results obtained using them. Then we propose a new algorithm that yields a better compression ratio than the previous ones.
متن کاملImprovements to the Burrows-Wheeler Compression Algorithm: After BWT Stages
The lossless Burrows-Wheeler Compression Algorithm has received considerable attention over recent years for both its simplicity and effectiveness. It is based on a permutation of the input sequence − the Burrows-Wheeler Transform − which groups symbols with a similar context close together. In the original version, this permutation was followed by a Move-To-Front transformation and a final ent...
متن کاملIncremental frequency count - a post BWT-stage for the Burrows-Wheeler compression algorithm
The stage after the Burrows-Wheeler Transform (BWT) has a key function inside the Burrows-Wheeler compression algorithm as it transforms the BWT output from a local context into a global context. This paper presents the Incremental Frequency Count stage, a post-BWT stage. The new stage is paired with a run length encoding stage between the BWT and entropy coding stage of the algorithm. It offer...
متن کاملThe Engineering of a Compression Boosting Library: Theory vs Practice in BWT Compression
Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to design and engineer compression algorithms based on the BWT paradigm. In particular, Move-to-Front Encoding is...
متن کاملSyllable-Based Burrows-Wheeler Transform
The Burrows-Wheeler Transform (BWT) is a compression method which reorders an input string into the form, which is preferable to another compression. Usually Move-To-Front transform and then Huffman coding is used to the permutated string. The original method [3] from 1994 was designed for an alphabet compression. In 2001, versions working with word and n-grams alphabet were presented. The newe...
متن کامل