Simple and Flexible Detection of Contiguous Repeats Using a Suffix Tree (Preliminary Version)
نویسندگان
چکیده
We study the problem of detecting all occurrences of (primitive) tandem repeats and tandem arrays in a string. We first give a simple timeand space-optimal algorithm to find all tandem repeats, and then modify it to become a time and space-optimal algorithm for finding only the primitive tandem repeats. Both of these algorithms are then extended to handle tandem arrays. The contribution of this paper is both pedagogical and practical, giving simple algorithms and implementations based on a suffix tree, using only standard tree traversal techniques. Theoretical Computer Science 270 (2002) 843–856 www.elsevier.com/locate/tcs Simple and exible detection of contiguous repeats using a su x tree Jens Stoye ∗; , Dan Gus eld 2 Department of Computer Science, University of California, Davis, Davis, CA 95616, USA Received December 1999; revised August 2000; accepted February 2001 Communicated by A. Apostolico
منابع مشابه
Compact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth
Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution for the depth in a PAT tree is the same as the limiting distribution for the depth in a PATRICIA trie, even...
متن کاملHomologous synteny Block Detection Based on Suffix Tree Algorithms
A synteny block represents a set of contiguous genes located within the same chromosome and well conserved among various species. Through long evolutionary processes and genome rearrangement events, large numbers of synteny blocks remain highly conserved across multiple species. Understanding distribution of conserved gene blocks facilitates evolutionary biologists to trace the diversity of lif...
متن کاملRepMaestro: scalable repeat detection on disk-based genome sequences
MOTIVATION We investigate the problem of exact repeat detection on large genomic sequences. Most existing approaches based on suffix trees and suffix arrays (SAs) are limited either to small sequences or those that are memory resident. We introduce RepMaestro, a software that adapts existing in-memory-enhanced SA algorithms to enable them to scale efficiently to large sequences that are disk re...
متن کاملA Simple Parallel Cartesian Tree Algorithm and its Application to Suffix Tree Construction
We present a simple linear work and space, and polylogarithmic time parallel algorithm for generating multiway Cartesian trees. As a special case, the algorithm can be used to generate suffix trees from suffix arrays on arbitrary alphabets in the same bounds. In conjunction with parallel suffix array algorithms, such as the skew algorithm, this gives a rather simple linear work parallel algorit...
متن کاملCcc-bicluster Analysis for Time Series Gene Expression Data
Many of the biclustering problems have been shown to be NP-complete. However, when they are interested in identify biclusters in time series expression data, it can limit the problem by finding only maximal biclusters with contiguous columns. This restriction leads to a well-mannered problem. Its motivation is the fact that biological processes start and conclude in an identifiable contiguous p...
متن کامل