Fast Relative Lempel-Ziv Self-index for Similar Sequences

نویسندگان

  • Huy Hoang Do
  • Jesper Jansson
  • Kunihiko Sadakane
  • Wing-Kin Sung
چکیده

Recent advances in biotechnology and web technology are generating huge collections of similar strings. People now face the problem of storing them compactly while supporting fast pattern searching. One compression scheme called relative Lempel-Ziv compression uses textual substitutions from a reference text as follows: Given a (large) set S of strings, represent each string in S as a concatenation of substrings from a reference string R. This basic scheme gives a good compression ratio when every string in S is similar to R, but does not provide any pattern searching functionality. Here, we describe a new data structure that supports fast pattern searching.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Compressed Self-Index for Genomic Databases

Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals’ genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a varian...

متن کامل

On compressing and indexing repetitive sequences

We introduce LZ-End, a new member of the Lempel-Ziv family of text compressors, which achieves compression ratios close to those of LZ77 but performs much faster at extracting arbitrary text substrings. We then build the first self-index based on LZ77 (or LZ-End) compression, which in addition to text extraction offers fast indexed searches on the compressed text. This self-index is particularl...

متن کامل

A measure of relative entropy between individual sequences with application to universal classification

A new notion of empirical informational divergence (relative entropy) between two individual sequences is introduced. If the two sequences are independent realizations of two finiteorder, finite alphabet, stationary Markov processes, the empirical relative entropy converges to the relative entropy almost surely. This new empirical divergence is based on a version of the Lempel-Ziv data compress...

متن کامل

Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval

Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the...

متن کامل

Approximate String Matching with Lempel-Ziv Compressed Indexes

A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T . Furthermore, the structure can reproduce any substring of T , thus it actually replaces T . Despite the explosion of interest on self-indexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012