A compressed dynamic self-index for highly repetitive text collections

نویسندگان

Takaaki Nishimoto

Yoshimasa Takabatake

Yasuo Tabei

چکیده

We present a novel compressed dynamic self-index for highly repetitive text collections. Signature encoding, an existing self-index of this type, has a large disadvantage of slow pattern search for short patterns. We obtain faster pattern search by leveraging the idea behind a truncated suffix tree (TST) to develop the first compressed dynamic self-index, called the TST-index, that supports not only fast pattern search but also dynamic update operations for highly repetitive texts. Experiments with a benchmark dataset show that the pattern search performance of the TST-index is significantly improved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Self - Indexing Based on LZ 77 ? Sebastian

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that sou...

متن کامل

Self-Index Based on LZ77

متن کامل

On compressing and indexing repetitive sequences

We introduce LZ-End, a new member of the Lempel-Ziv family of text compressors, which achieves compression ratios close to those of LZ77 but performs much faster at extracting arbitrary text substrings. We then build the first self-index based on LZ77 (or LZ-End) compression, which in addition to text extraction offers fast indexed searches on the compressed text. This self-index is particularl...

متن کامل

Faster Compressed Suffix Trees for Repetitive Text Collections

Recent compressed suffix trees targeted to highly repetitive text collections reach excellent compression performance, but operation times in the order of milliseconds. We design a new suffix tree representation for this scenario that still achieves very low space usage, only slightly larger than the best previous one, but supports the operations within microseconds. This puts the data structur...

متن کامل

Compressed Suffix Trees for Repetitive Texts

We design a new compressed suffix tree specifically tailored to highly repetitive text collections. This is particularly useful for sequence analysis on large collections of genomes of the close species. We build on an existing compressed suffix tree that applies statistical compression, and modify it so that it works on the grammar-compressed version of the longest common prefix array, whose d...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1711.02855 شماره

صفحات -

تاریخ انتشار 2017

A compressed dynamic self-index for highly repetitive text collections

نویسندگان

چکیده

منابع مشابه

Self - Indexing Based on LZ 77 ? Sebastian

Self-Index Based on LZ77

On compressing and indexing repetitive sequences

Faster Compressed Suffix Trees for Repetitive Text Collections

Compressed Suffix Trees for Repetitive Texts

عنوان ژورنال:

اشتراک گذاری