Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing
نویسنده
چکیده
The paper defines an annotated su x tree (AST) a data structure used to calculate and store the frequencies of all the fragments of the given string or a collection of strings. The AST is associated with a string to text scoring, which takes all fuzzy matches into account. We show how the AST and the AST scoring can be used for Natural Language Processing tasks.
منابع مشابه
Compact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth
Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution for the depth in a PAT tree is the same as the limiting distribution for the depth in a PATRICIA trie, even...
متن کاملSparse compact directed acyclic word graphs
The suffix tree of string w represents all suffixes of w, and thus it supports full indexing of w for exact pattern matching. On the other hand, a sparse suffix tree of w represents only a subset of the suffixes of w, and therefore it supports sparse indexing of w. There has been a wide range of applications of sparse suffix trees, e.g., natural language processing and biological sequence analy...
متن کاملParallel Suffix Arrays for Linguistic Pattern Search
The paper presents the results of an analysis of the merits and problems of using suffix arrays as an index data structure for annotated natural-language corpora. It shows how multiple suffix arrays can be combined to represent layers of annotation, and how this enables matches for complex linguistic patterns to be identified in the corpus quickly and, for a large subclass of patterns, with gre...
متن کاملOn-Line Linear-Time Construction of Word Suffix Trees
Suffix trees are the key data structure for text string matching, and are used in wide application areas such as bioinformatics and data compression. Sparse suffix trees are kind of suffix trees that represent only a subset of suffixes of the input string. In this paper we study word suffix trees, which are one variation of sparse suffix trees. Let D be a dictionary of words and w be a string i...
متن کاملSuffix Trees as Language Models
Suffix trees are data structures that can be used to index a corpus. In this paper, we explore how some properties of suffix trees naturally provide the functionality of an n-gram language model with variable n. We explain how we leverage these properties of suffix trees for our Suffix Tree Language Model (STLM) implementation and explain how a suffix tree implicitly contains the data needed fo...
متن کامل