Encoded Expansion: An Efficient Algorithm to Discover Identical String Motifs
نویسندگان
چکیده
A major task in computational biology is the discovery of short recurring string patterns known as motifs. Most of the schemes to discover motifs are either stochastic or combinatorial in nature. Stochastic approaches do not guarantee finding the correct motifs, while the combinatorial schemes tend to have an exponential time complexity with respect to motif length. To alleviate the cost, the combinatorial approach exploits dynamic data structures such as trees or graphs. Recently (Karci (2009) Efficient automatic exact motif discovery algorithms for biological sequences, Expert Systems with Applications 36:7952-7963) devised a deterministic algorithm that finds all the identical copies of string motifs of all sizes [Formula: see text] in theoretical time complexity of [Formula: see text] and a space complexity of [Formula: see text] where [Formula: see text] is the length of the input sequence and [Formula: see text] is the length of the longest possible string motif. In this paper, we present a significant improvement on Karci's original algorithm. The algorithm that we propose reports all identical string motifs of sizes [Formula: see text] that occur at least [Formula: see text] times. Our algorithm starts with string motifs of size 2, and at each iteration it expands the candidate string motifs by one symbol throwing out those that occur less than [Formula: see text] times in the entire input sequence. We use a simple array and data encoding to achieve theoretical worst-case time complexity of [Formula: see text] and a space complexity of [Formula: see text] Encoding of the substrings can speed up the process of comparison between string motifs. Experimental results on random and real biological sequences confirm that our algorithm has indeed a linear time complexity and it is more scalable in terms of sequence length than the existing algorithms.
منابع مشابه
An Efficient Routing Algorithm to Lifetime Expansion in Wireless Sensor Networks
This paper proposes an efficient network architecture to improve energy consumption in Wireless Sensor Networks (WSN). The proposed architecture uses a mobile data collector to a partitioned network. The mobile data collector moves to center of each logical partition after each decision period. The mobile data collector must declare its new location by packet broadcasting to all sensor node...
متن کاملAn Efficient Routing Algorithm to Lifetime Expansion in Wireless Sensor Networks
This paper proposes an efficient network architecture to improve energy consumption in Wireless Sensor Networks (WSN). The proposed architecture uses a mobile data collector to a partitioned network. The mobile data collector moves to center of each logical partition after each decision period. The mobile data collector must declare its new location by packet broadcasting to all sensor node...
متن کاملAn Efficient Bi-objective Genetic Algorithm for the Single Batch-Processing Machine Scheduling Problem with Sequence Dependent Family Setup Time and Non-identical Job Sizes
This paper considers the problem of minimizing make-span and maximum tardiness simultaneously for scheduling jobs under non-identical job sizes, dynamic job arrivals, incompatible job families,and sequence-dependentfamily setup time on the single batch- processor, where split size of jobs is allowed between batches. At first, a new Mixed Integer Linear Programming (MILP) model is proposed for t...
متن کاملDevelopment of an Efficient Hybrid Method for Motif Discovery in DNA Sequences
This work presents a hybrid method for motif discovery in DNA sequences. The proposed method called SPSO-Lk, borrows the concept of Chebyshev polynomials and uses the stochastic local search to improve the performance of the basic PSO algorithm as a motif finder. The Chebyshev polynomial concept encourages us to use a linear combination of previously discovered velocities beyond that proposed b...
متن کاملIncremental Paradigms of Motif Discovery
We examine the problem of extracting maximal irredundant motifs from a string. A combinatorial argument poses a linear bound on the total number of such motifs, thereby opening the way to the quest for the fastest and most efficient methods of extraction. The basic paradigm explored here is that of iterated updates of the set of irredundant motifs in a string under consecutive unit symbol exten...
متن کامل