Block Sorting Text Compression — Final Report
نویسنده
چکیده
A recent development in text compression is a “block sorting” algorithm which permutes the input text according to a special sort procedure and then processes the permuted text with Move-to-Front and a final statistical compressor. The technique combines good speed with excellent compression performance. This report investigates the block sorting compression algorithm, in particular trying to understand its operation and limitations. Various approaches are investigated in an attempt to improve the compression with block sorting, most of which involve a hierarchy of coding models to allow fast adaptation to local contexts. The best technique involves a new “structured” coding model, especially designed for compressing data with skew symbol distributions. Block sorting compression is found to be related to work by Shannon in 1951 on the prediction of English text. The work confirms block-sorting as a good text compression technique, with a compression approaching that of the currently best compressors while being much faster than other compressors of comparable performance. Preface This is third report of a series on block sorting text compression (previous members Technical Reports 111, 120, Refs 10 & 11). A shorter version was presented as a paper at ACSC’96 (ref [12]). This present report was prepared as a comprehensive account of my experience with block sorting compression, including several interesting curiosities, but grew too large for publication as a paper. Important material will be extracted to form probably two journal papers, but this text remains as the comprehensive and coherent picture of the work. While it largely replaces the two earlier reports, those do contain some useful material which is omitted from here. For example Tech Report 111 includes extensive logs of the compression process and output which could be useful for people interested in the details of the operation. 1 The report is available by anonymous FTP from ftp.cs.auckland.ac.nz /out/peter-f/TechRep130.ps
منابع مشابه
Improvements to the Block Sorting Text Compression Algorithm
This report presents some further work on the recently described “Block Sorting” lossless or text compression algorithm. It is already known that it is a context-based compressor of unbounded order, but those contexts are completely restructured by the sort phase of the compression. The report examines the effects of those context changes. It is shown that the requirements on the final compress...
متن کاملThe Burrows-Wheeler Transform for Block Sorting Text Compression: Principles and Improvements
A recent development in text compression is a “block sorting” algorithm which permutes the input text according to a special sort procedure and then processes the permuted text with Move-to-Front and a final statistical compressor. The technique combines good speed with excellent compression performance. This paper investigates the fundamental operation of the algorithm and presents some improv...
متن کاملExperiments with a Block Sorting Text Compression Algorithm
This report presents some preliminary work on a recently described “Block Sorting”lossless or text compression algorithm. While having little apparent relationship toestablished techniques, it has a performance which places it definitely among the best-known compressors. The original paper did little more than present the algorithm, withstrong advice for efficient implementation...
متن کاملText Compression using Recency Rank with Context and Relation to Context Sorting, Block Sorting and PPM*
Recently block sorting compression scheme was developed and relation to statistical scheme was studied, but theoretical analysis of performance has not been studied well. Context sorting is a compression scheme based on context similarity and it is regarded as an online version of the block sorting and it is asymptotically optimal. However, the compression speed is slower and the real performan...
متن کاملEnhanced Word-Based Block-Sorting Text Compression
The Block Sorting process of Burrows and Wheeler can be applied to any sequence in which symbols are (or might be) conditioned upon each other. In particular, it is possible to parse text into a stream of words, and then employ block sorting to identify and so exploit any conditioning relationships between words. In this paper we build upon the previous work of two of the authors, describing se...
متن کامل