Querying highly similar sequences

نویسندگان

Carl Barton

Mathieu Giraud

Costas S. Iliopoulos

Thierry Lecroq

Laurent Mouchard

Solon P. Pissis

چکیده

In this paper, we present a solution to the extreme similarity sequencing problem. The extreme similarity sequencing problem consists of finding occurrences of a pattern p in a set S(0), S(1), , S(k), of sequences of equal length, where S(i), for all 1≤i≤k, differs from S(0) by a constant number of errors - around 10 in practice. We present an asymptotically fast O(n + occ logocc) time algorithm, as well as a practical O(nk/w) time algorithm for solving this problem, where n is the length of a sequence, occ is the number of candidate occurrences reported by our technique, w is the size of the machine word, and the total number of errors is bounded by k - the number of sequences.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fuzzy Querying of Semi-structured Data

Querying XML data is a well-explored topic thanks to powerful query languages such as XPath and XQuery. Both were designed to support the evaluation of binary predicates, which can be proven to be a limited approach to effective querying of XML data. In this paper, a fuzzy extension of the XPath query language is proposed. Its goal is to achieve more flexible querying through vague queries, whi...

متن کامل

Prefix-querying with an L1 distance metric for time-series subsequence matching under time warping

This paper discusses the way of processing time-series subsequence matching under time warping. Time warping enables sequences to be found with similar patterns even when they are of different lengths. The prefix-querying method is the first index-based approach that efficiently performs time-series subsequence matching under time warping without false dismissals. This method employs the L dist...

متن کامل

Querying Large Similar Sequences in a Compressed Format Efficiently

With the advances in next-generation sequencing technologies, the amount of genomic sequence data being produced continues to grow at an exponential rate. A unique characteristic of these sequences is that they are over 99% similar, and therefore highly compressible using their differences with respect to a reference sequence. Still, an increasingly pressing challenge is how to efficiently quer...

متن کامل

SIMAP: the similarity matrix of proteins

Similarity Matrix of Proteins (SIMAP) (http://mips.gsf.de/simap) provides a database based on a pre-computed similarity matrix covering the similarity space formed by >4 million amino acid sequences from public databases and completely sequenced genomes. The database is capable of handling very large datasets and is updated incrementally. For sequence similarity searches and pairwise alignments...

متن کامل

Developing a BIM-based Spatial Ontology for Semantic Querying of 3D Property Information

With the growing dominance of complex and multi-level urban structures, current cadastral systems, which are often developed based on 2D representations, are not capable of providing unambiguous spatial information about urban properties. Therefore, the concept of 3D cadastre is proposed to support 3D digital representation of land and properties and facilitate the communication of legal owners...

متن کامل

Querying Highly Similar Structured Sequences via Binary Encoding and Word Level Operations

In the post-genomic era there has been an explosion in the amount of genomic data available and the primary research problems have moved from being able to produce interesting biological data to being able to efficiently process and store this information. In this paper we present efficient data structures and algorithms for the High Similarity Sequencing Problem. In the High Similarity Sequenc...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

International journal of computational biology and drug design

دوره 6 1-2 شماره

صفحات -

تاریخ انتشار 2013

Querying highly similar sequences

نویسندگان

چکیده

منابع مشابه

Fuzzy Querying of Semi-structured Data

Prefix-querying with an L1 distance metric for time-series subsequence matching under time warping

Querying Large Similar Sequences in a Compressed Format Efficiently

SIMAP: the similarity matrix of proteins

Developing a BIM-based Spatial Ontology for Semantic Querying of 3D Property Information

Querying Highly Similar Structured Sequences via Binary Encoding and Word Level Operations

عنوان ژورنال:

اشتراک گذاری