A hard-disk based suffix tree implementation
نویسندگان
چکیده
Suffix trees are incredibly useful structures for computational genomics and combinatorial pattern matching. Due to the small alphabet sizes used in computational genomics, specialised hard-disk based suffix trees have been designed, but the problem of creating an efficient hard-disk based suffix tree for large and unbounded alphabet sizes remains essentially unsolved. We have designed a hard-disk based hybrid suffix tree, residing on hard-disk and in RAM, which takes advantage of memory mapping, a method for treating data on a hard-disk transparently as though it was in memory. Memory mapping is provided by many modern operating systems. Through the use of memory mapping the implementation only loads a small amount of the suffix tree into working memory, which allows it to load faster and maintains a fairly efficient query speed. The implementation is based on Ukkonen’s suffix tree construction algorithm.
منابع مشابه
Manuscript Title: Faster Protein Classification Using Suffix Trees Running Head: Protein Classification Using Suffix Trees Authors:
Motivation: Position-specific scoring matrices have been used extensively to recognize highly conserved protein regions. We present a method for accelerating these searches using a suffix tree data structure computed from the sequences to be searched. Methods: Building on earlier work that allows evaluation of a scoring matrix to be stopped early, the suffix tree-based method excludes many prot...
متن کاملInvestigation of Local-Alignment Search for Large Biological Se
One of the most important applications of a string database is storing large gene sequences. Typically the database should support queries like exact matching and pattern matching. The main hindrance to any direct approach for solving this problem is the large amount of unstructured data (of the order of GB) involved. A promising index structure for this problem is the suffix tree. In 2001, Hun...
متن کاملGeneralized Suffix Trees for Biological Sequence Data: Applications and Implementation
This paper addresses applications of sujjix trees and generalized suffix trees (GSTs) to biological sequence data analysis. We define a basic set of suffix tree and GST operations needed to support sequence data analysis. While those &finitions are straightforward, the construction and manipulation of disk-based GST structures for large volumes of sequence data requires intricate design. GST pr...
متن کاملERA: Efficient Serial and Parallel Suffix Tree Construction for Very Long Strings
The suffix tree is a data structure for indexing strings. It is used in a variety of applications such as bioinformatics, time series analysis, clustering, text editing and data compression. However, when the string and the resulting suffix tree are too large to fit into the main memory, most existing construction algorithms become very inefficient. This paper presents a disk-based suffix tree ...
متن کاملObtaining Provably Good Performance from Suffix Trees in Secondary Storage
Designing external memory data structures for string databases is of significant recent interest due to the proliferation of biological sequence data. The suffix tree is an important indexing structure that provides optimal algorithms for memory bound data. However, string Btrees provide the best known asymptotic performance in external memory for substring search and update operations. Work on...
متن کامل