Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval

نویسندگان

  • Shanika Kuruppu
  • Simon J. Puglisi
  • Justin Zobel
چکیده

Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just nHk(T ) + s log n + s log N s + O(s) bits. At the cost of negligible extra space, access to consecutive symbols requires O( + log n) time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections

Compression techniques that support fast random access are a core component of any information system. Current stateof-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as gzip. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compressi...

متن کامل

RLZAP: Relative Lempel-Ziv with Adaptive Pointers

Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.’s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, ...

متن کامل

Fast Relative Lempel-Ziv Self-index for Similar Sequences

Recent advances in biotechnology and web technology are generating huge collections of similar strings. People now face the problem of storing them compactly while supporting fast pattern searching. One compression scheme called relative Lempel-Ziv compression uses textual substitutions from a reference text as follows: Given a (large) set S of strings, represent each string in S as a concatena...

متن کامل

A Compressed Self-Index for Genomic Databases

Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals’ genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a varian...

متن کامل

LZW Data Compression on Large Scale and Extreme Distributed Systems

Results on the parallel complexity of Lempel-Ziv data compression suggest that the sliding window method is more suitable than the LZW technique on shared memory parallel machines. When instead we address the practical goal of designing distributed algorithms with low communication cost, sliding window compression does not seem to guarantee robustness if we scale up the system. The possibility ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010