Suffix Tree Characterization of Maximal Motifs in Biological Sequences

نویسندگان

  • Maria Federico
  • Nadia Pisanti
چکیده

Finding motifs in biological sequences is one of the most intriguing problems for string algorithms designers due to, on the one hand, the numerous applications of this problem in molecular biology and, on the other hand, the challenging aspects of the computational problem. Indeed, when dealing with biological sequences it is necessary to work with approximations (that is, to identify fragments that are not necessarily identical, but just similar, according to a given similarity notion) and this complicates the problem. Existing algorithms run in time linear with respect to the input size. Nevertheless, the output size can be very large due to the approximation (namely exponential in the approximation degree). This often makes the output unreadable, next to slowing down the inference itself. A high degree of redundancy has been detected in the set of motifs that satisfy traditional requirements, even for exact motifs. Moreover, it has been observed many times that only a subset of these motifs, namely the maximal motifs, could be enough to provide the information of all of them. In this paper, we aim at removing such redundancy. We extend some notions of maximality already defined for exact motifs to the case of approximate motifs with Hamming distance, and we give a characterization of maximal motifs on the suffix tree. Given that this data structure is used by a whole class of motif extraction tools, we show how these tools can be modified to include the maximality requirement without changing the asymptotical complexity.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Structural Analysis of Gapped Motifs of a String

We investigate the structure of the set of gapped motifs (repeated patterns with don’t cares) of a given string of symbols. A natural equivalence classification is introduced for the motifs, based on their pattern of occurrences, and another classification for the occurrence patterns, based on the induced motifs. Quadratic–time algorithms are given for finding a maximal representative for an eq...

متن کامل

Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification

This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p > or = 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p - 1 intervals of distance (one for each pair of successive boxes in the collection). The con...

متن کامل

Finding DNA Motifs: A Probabilistic Suffix Tree Approach

We address the problem of de novo motif identification. That is, given a set of DNA sequences we try to identify motifs in the dataset without having any prior knowledge about existence of any motifs in the dataset. We propose a method based on Probabilistic Suffix Trees (PSTs) to identify fixed-length motifs from a given set of DNA sequences. Our experiments reveal that our approach successful...

متن کامل

Application of suffix trees for the acquisition of common motifs with gaps in a set of strings

The inference of common motifs in a set of strings is a well-known problem with many applications in biological sciences. We study a new variant of this problem that offers a solution with the added flexibility in the length of the common motifs to be found. We present algorithms that allow stretching of the length of the motifs as well as elasticity in the length of gaps between the motifs. Th...

متن کامل

STEME: efficient EM to find motifs in large data sets

MEME and many other popular motif finders use the expectation-maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughput biological techniques. A suffix tree is a data structure that can efficiently index a set of s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008