A Fast Suffix-Sorting Algorithm

نویسندگان

  • Rudolf Ahlswede
  • Bernhard Balkenhol
  • Christian Deppe
  • Martin Fröhlich
چکیده

We present an algorithm to sort all suffixes of x = (x1, . . . , xn) ∈ Xn lexicographically, where X = {0, . . . , q−1}. Fast and efficient sorting of a large amount of data according to its suffix structure (suffix-sorting) is a useful technology in many fields of application, front-most in the field of Data Compression where it is used e.g. for the Burrows and Wheeler Transformation (BWT for short), a block-sorting transformation ([3],[9]). Larsson [4] describes the relationship between the BWT on one hand and suffix trees and context trees on the other hand. Then Sadakane [8] suggests a well referenced method to compute the BWT more time efficiently. Then the algorithms based on suffix trees have been improved ([6],[5],[1]). In [3] it was observed that for an input string of size n, this transformation can be computed in O(n) time and space using suffix trees. While suffix trees are considered to be greedy in space – even small factors hidden in the O-notation may decide on the feasibility of an algorithm – sorting was accomplished by alternative non-linear methods: Manber and Myers [7] introduced an algorithm of O(n log n) in worst case time and 8n bytes of space and in [2] an algorithm based on Quicksort is suggested, which is fast on the average but its worst case complexity is O(n logn). Most prominent in this case is the Bendson-Sedgewick Algorithm which requires 4n bytes and Sadakane’s example of a combination of the Manber-Myers Algorithm with the Bendson-Sedgewick Algorithm with a complexity of O(nlogn) worst case time using 9n bytes [8]. The reduction of the space requirement due to an upper bound on n seems trivial. However, it turns out that it involves a considerable amount of engineering work to achieve an improvement, while retaining an acceptable worst case time complexity. This paper proposes an algorithm, efficient in the terms described above, ideal for handling large blocks of input data. We assume that the cardinality of the alphabet (q) is smaller than the text-string (n). Our algorithm computes the suffix sorting in O(n) space and O(n logn) time in the worst case. It has also the property that it sorts the suffixes lexicographically according to the prefixes of length t2 = logq n2 in the worst case in linear time. After the initial sorting of length t2, we use a Quick-sort-variant to sort the remaining part. Therefore we get the worst time O(n logn). It is also possible to modify our algorithm by using Heap-sort. Then we will get a worst case time O(n(log n)).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the Speed of LZ77 Compression by Hashing and Suffix Sorting

Two new algorithms for improving the speed of the LZ77 compression are proposed. One is based on a new hashing algorithm named two-level hashing that enables fast longest match searching from a sliding dictionary, and the other uses suffix sorting. The former is suitable for small dictionaries and it significantly improves the speed of gzip, which uses a naive hashing algorithm. The latter is s...

متن کامل

Faster suffix sorting

We propose a fast and memory efficient algorithm for lexicographically sorting the suffixes of a string, a problem that has important applications in data compression as well as string matching. Our algorithm eliminates much of the overhead of previous specialized approaches while maintaining their robustness for all kinds of input. For input size n, our algorithm operates in only two integer a...

متن کامل

Linear-time Suffix Sorting - A New Approach for Suffix Array Construction

This thesis presents a new approach for linear-time suffix sorting. It introduces a new sorting principle that can be used to build the first non-recursive linear-time suffix array construction algorithm named GSACA. Although GSACA cannot hold up with the performance of state of the art suffix array construction algorithms, the algorithm introduces a lot of new ideas for suffix array constructi...

متن کامل

An Algorithm for Suffix Sorting and Its Applications∗

The suffix tree is a data structure that has found applications in various important problems, such as genetic sequencing, pattern matching and computational biology. Its derivative data structure, the suffix array, is another representation with the added advantage of a small memory footprint. We propose a simple O(n log n) time divideand-conquer sort-and-merge algorithm for solving the suffix...

متن کامل

Parallel Suffix Sorting

We present a parallel algorithm for lexicographically sorting the suffixes of a string. Suffix sorting has applications in string processing, data compression and computational biology. The ordered list of suffixes of a string stored in an array is known as Suffix Array, an important data structure in string processing and computational biology. Our focus is on deriving a practical implementati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Electronic Notes in Discrete Mathematics

دوره 21  شماره 

صفحات  -

تاریخ انتشار 2005