Practical and Effective Re-Pair Compression
نویسندگان
چکیده
Re-Pair is an efficient grammar compressor that operates by recursively replacing high-frequency character pairs with new grammar symbols. The most space-efficient linear-time algorithm computing Re-Pair uses (1 + )n+ √ n words on top of the re-writable text (of length n and stored in n words), for any constant > 0; in practice however, this solution uses complex sub-procedures preventing it from being practical. In this paper, we present an implementation of the abovementioned result making use of more practical solutions; our tool further improves the working space to (1.5 + )n words (text included), for some small constant . As a second contribution, we focus on compact representations of the output grammar. The lower bound for storing a grammar with d rules is log(d!) + 2d ≈ d log d + 0.557d bits, and the most efficient encoding algorithm in the literature uses at most d log d + 2d bits and runs in O(d1.5) time. We describe a linear-time heuristic maximizing the compressibility of the output Re-Pair grammar. On real datasets, our grammar encoding uses—on average—only 2.8% more bits than the informationtheoretic minimum. In half of the tested cases, our compressor improves the output size of 7-Zip with maximum compression rate turned on. 1998 ACM Subject Classification E.4 Coding and Information Theory, E.1 Data Structures, F.2.2 Nonnumerical Algorithms and Problems
منابع مشابه
Re-Pair Compression of Inverted Lists
Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompressi...
متن کاملReference Sequence Construction for Relative Compression of Genomes
Relative compression, where a set of similar strings are compressed with respect to a reference string, is a very effective method of compressing DNA datasets containing multiple similar sequences. Relative compression is fast to perform and also supports rapid random access to the underlying data. The main difficulty of relative compression is in selecting an appropriate reference sequence. In...
متن کاملE ective Variable - Length - to - Fixed - Length Coding via a Re - Pair Algorithm
We address the problem of improving variable-length-toxed-length codes (VF codes). A VF code is an encoding scheme that uses a xed-length code, and thus, one can easily access the compressed data. However, conventional VF codes usually have an inferior compression ratio to that of variable-length codes. Although a method proposed by T. Uemura et al. in 2010 achieves a good compression ratio com...
متن کاملCapacitive Flux Compression Generator (RESEARCH NOTE)
Conventional Flux Compression Generators (FCG's) are used to generate high power DC pulses. A new kind of (FCG's) with series capacitance called Capacitive Flux Compression Generator (CFCG) will be introduced and explained in this paper. This new kind is used to generate modulated high power pulses. There are some problems to establish a capacitance in high power and high frequency applications...
متن کاملByte pair encoding : a text compression scheme that accelerates pattern matching
Byte pair encoding (BPE) is a simple universal text compression scheme. Decompression is very fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the original text. However, it has not been so popular since the compression is rather slow and the compression ratio is not as good as other methods such as Lempel-Ziv type compression. In this paper, we bring ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1704.08558 شماره
صفحات -
تاریخ انتشار 2017