{Approximation Algorithm for the Shortest Superstring Problem

نویسنده

  • Chris Armen
چکیده

Given a collection of strings S = fs1; : : : ; sng over an alphabet , a superstring of S is a string containing each si as a substring; that is, for each i, 1 i n, contains a block of jsij consecutive characters that match si exactly. The shortest superstring problem is the problem of nding a superstring of minimum length. The shortest superstring problem has applications in both data compression and computational biology. In data compression, the problem is a part of a general model of string compression proposed by Gallant, Maier and Storer (JCSS '80). Much of the recent interest in the problem is due to its application to DNA sequence assembly. The problem has been shown to be NP-hard; in fact, it was shown by Blum et al.(JACM '94) to be MAX SNP-hard. The rst O(1)-approximation was also due to Blum et al., who gave an algorithm that always returns a superstring no more than 3 times the length of an optimal solution. Several researchers have published results that improve on the approximation ratio; of these, the best previous result is our algorithmShortString, which achieves a 2 4 {approximation (WADS '95). We present our new algorithm, G-ShortString, which achieves a ratio of 2 3 . It generalizes the ShortString algorithm, but the analysis di ers substantially from that of ShortString. Our previous work identi ed classes of strings that have a nested periodic structure, and which must be present in the worst case for our algorithms. We introduced machinery to descibe these strings and proved strong structural properties about them. In this paper we extend this study to strings that exhibit a more relaxed form of the same structure, and we use this understanding to obtain our improved result.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

انتخاب کوچکترین ابر رشته در DNA با استفاده از الگوریتم ازدحام ذرّات

A DNA string can be supposed a very long string on alphabet with 4 letters. Numerous scientists attempt in decoding of this string. since this string is very long , a shorter section of it that have overlapping on each other will be decoded .There is no information for the right position of these sections on main DNA string. It seems that the shortest string (substring of the main DNA string) i...

متن کامل

Approximating the Shortest Superstring Problem Using de Bruijn Graphs

The best known approximation ratio for the shortest superstring problem is 2 11 23 (Mucha, 2012). In this note, we improve this bound for the case when the length of all input strings is equal to r, for r ≤ 7. For example, for strings of length 3 we get a 1 1 3 -approximation. An advantage of the algorithm is that it is extremely simple both to implement and to analyze. Another advantage is tha...

متن کامل

Approximation Solutions for Time-Varying Shortest Path Problem

Abstract. Time-varying network optimization problems have tradition-ally been solved by specialized algorithms. These algorithms have NP-complement time complexity. This paper considers the time-varying short-est path problem, in which can be optimally solved in O(T(m + n)) time,where T is a given integer. For this problem with arbitrary waiting times,we propose an approximation algorithm, whic...

متن کامل

An Experimental Comparison of Approximation Algorithms for the Shortest Common Superstring Problem

The paper deals with an experimental comparison of a 4-approximation algorithm with a 3-approximation algorithm for the Shortest Common Superstring (SCS) problem. It has two main objectives, one is to show that even though the quotient between the two approximations is 4/3, in the worst case, the average results quotient is approximately 1, independently of the instances size. The second object...

متن کامل

Approximating Shortest Superstring Problem Using de Bruijn Graphs

The best known approximation ratio for the shortest superstring problem is 2 11 23 (Mucha, 2012). In this note, we improve this bound for the case when the length of all input strings is equal to r, for r ≤ 7. E.g., for strings of length 3 we get a 1 1 3 -approximation. An advantage of the algorithm is that it is extremely simple both to implement and to analyze. Another advantage is that it is...

متن کامل

A linear time algorithm for Shortest Cyclic Cover of Strings

Merging words according to their overlap yields a superstring. This basic operation allows to infer long strings from a collection of short pieces, as in genome assembly. To capture a maximum of overlaps, the goal is to infer the shortest superstring of a set of input words. The Shortest Cyclic Cover of Strings (SCCS) problem asks, instead of a single linear superstring, for a set of cyclic str...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995