Querying highly similar sequences
نویسندگان
چکیده
In this paper, we present a solution to the extreme similarity sequencing problem. The extreme similarity sequencing problem consists of finding occurrences of a pattern p in a set S(0), S(1), , S(k), of sequences of equal length, where S(i), for all 1≤i≤k, differs from S(0) by a constant number of errors - around 10 in practice. We present an asymptotically fast O(n + occ logocc) time algorithm, as well as a practical O(nk/w) time algorithm for solving this problem, where n is the length of a sequence, occ is the number of candidate occurrences reported by our technique, w is the size of the machine word, and the total number of errors is bounded by k - the number of sequences.
منابع مشابه
Fuzzy Querying of Semi-structured Data
Querying XML data is a well-explored topic thanks to powerful query languages such as XPath and XQuery. Both were designed to support the evaluation of binary predicates, which can be proven to be a limited approach to effective querying of XML data. In this paper, a fuzzy extension of the XPath query language is proposed. Its goal is to achieve more flexible querying through vague queries, whi...
متن کاملPrefix-querying with an L1 distance metric for time-series subsequence matching under time warping
This paper discusses the way of processing time-series subsequence matching under time warping. Time warping enables sequences to be found with similar patterns even when they are of different lengths. The prefix-querying method is the first index-based approach that efficiently performs time-series subsequence matching under time warping without false dismissals. This method employs the L dist...
متن کاملQuerying Large Similar Sequences in a Compressed Format Efficiently
With the advances in next-generation sequencing technologies, the amount of genomic sequence data being produced continues to grow at an exponential rate. A unique characteristic of these sequences is that they are over 99% similar, and therefore highly compressible using their differences with respect to a reference sequence. Still, an increasingly pressing challenge is how to efficiently quer...
متن کاملSIMAP: the similarity matrix of proteins
Similarity Matrix of Proteins (SIMAP) (http://mips.gsf.de/simap) provides a database based on a pre-computed similarity matrix covering the similarity space formed by >4 million amino acid sequences from public databases and completely sequenced genomes. The database is capable of handling very large datasets and is updated incrementally. For sequence similarity searches and pairwise alignments...
متن کاملDeveloping a BIM-based Spatial Ontology for Semantic Querying of 3D Property Information
With the growing dominance of complex and multi-level urban structures, current cadastral systems, which are often developed based on 2D representations, are not capable of providing unambiguous spatial information about urban properties. Therefore, the concept of 3D cadastre is proposed to support 3D digital representation of land and properties and facilitate the communication of legal owners...
متن کاملQuerying Highly Similar Structured Sequences via Binary Encoding and Word Level Operations
In the post-genomic era there has been an explosion in the amount of genomic data available and the primary research problems have moved from being able to produce interesting biological data to being able to efficiently process and store this information. In this paper we present efficient data structures and algorithms for the High Similarity Sequencing Problem. In the High Similarity Sequenc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- International journal of computational biology and drug design
دوره 6 1-2 شماره
صفحات -
تاریخ انتشار 2013