Practical and Effective Re-Pair Compression

نویسندگان

  • Philip Bille
  • Inge Li Gørtz
  • Nicola Prezza
چکیده

Re-Pair is an efficient grammar compressor that operates by recursively replacing high-frequency character pairs with new grammar symbols. The most space-efficient linear-time algorithm computing Re-Pair uses (1 + )n+ √ n words on top of the re-writable text (of length n and stored in n words), for any constant > 0; in practice however, this solution uses complex sub-procedures preventing it from being practical. In this paper, we present an implementation of the abovementioned result making use of more practical solutions; our tool further improves the working space to (1.5 + )n words (text included), for some small constant . As a second contribution, we focus on compact representations of the output grammar. The lower bound for storing a grammar with d rules is log(d!) + 2d ≈ d log d + 0.557d bits, and the most efficient encoding algorithm in the literature uses at most d log d + 2d bits and runs in O(d1.5) time. We describe a linear-time heuristic maximizing the compressibility of the output Re-Pair grammar. On real datasets, our grammar encoding uses—on average—only 2.8% more bits than the informationtheoretic minimum. In half of the tested cases, our compressor improves the output size of 7-Zip with maximum compression rate turned on. 1998 ACM Subject Classification E.4 Coding and Information Theory, E.1 Data Structures, F.2.2 Nonnumerical Algorithms and Problems

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Re-Pair Compression of Inverted Lists

Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompressi...

متن کامل

Reference Sequence Construction for Relative Compression of Genomes

Relative compression, where a set of similar strings are compressed with respect to a reference string, is a very effective method of compressing DNA datasets containing multiple similar sequences. Relative compression is fast to perform and also supports rapid random access to the underlying data. The main difficulty of relative compression is in selecting an appropriate reference sequence. In...

متن کامل

E ective Variable - Length - to - Fixed - Length Coding via a Re - Pair Algorithm

We address the problem of improving variable-length-toxed-length codes (VF codes). A VF code is an encoding scheme that uses a xed-length code, and thus, one can easily access the compressed data. However, conventional VF codes usually have an inferior compression ratio to that of variable-length codes. Although a method proposed by T. Uemura et al. in 2010 achieves a good compression ratio com...

متن کامل

Capacitive Flux Compression Generator (RESEARCH NOTE)

Conventional Flux Compression Generators (FCG's) are used to generate high power DC pulses. A new kind of (FCG's) with series capacitance called Capacitive Flux Compression Generator (CFCG) will be introduced and explained in this paper. This new kind is used to generate modulated high power pulses. There are some problems to establish a capacitance in high power and high frequency applications...

متن کامل

Byte pair encoding : a text compression scheme that accelerates pattern matching

Byte pair encoding (BPE) is a simple universal text compression scheme. Decompression is very fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the original text. However, it has not been so popular since the compression is rather slow and the compression ratio is not as good as other methods such as Lempel-Ziv type compression. In this paper, we bring ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1704.08558  شماره 

صفحات  -

تاریخ انتشار 2017