Ternary Tree Optimalization for n-gram Indexing

نویسندگان

  • Daniel Robenek
  • Jan Platos
  • Václav Snásel
چکیده

N-gram indexing is used in many practical applications. Spam detection, plagiarism detection or comparison of DNA reads. There are many data structures that can be used for this purpose, each with different characteristics. In this article the ternary search tree data structure is used. One improvement of ternary tree that can save up to 43% of required memory is introduced. In the second part new data structure, named ternary forest, is proposed. Efficiency of ternary forest is tested and compared to ternary search tree and two-level indexing ternary search tree.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient In-memory Data Structures for n-grams Indexing

Indexing n-gram phrases from text has many practical applications. Plagiarism detection, comparison of DNA of sequence or spam detection. In this paper we describe several data structures like hash table or B+ tree that could store n-grams for searching. We perform tests that shows their advantages and disadvantages. One of neglected data structure for this purpose, ternary search tree, is deep...

متن کامل

A succinct data structure for self-indexing ternary relations

The representation of binary relations has been intensively studied and many different theoretical and practical representations have been proposed to answer the usual queries in multiple domains. However, ternary relations have not received as much attention, even though many real-world applications require the processing of ternary relations. In this paper we present a new compressed and self...

متن کامل

A succint data structure for self-indexing ternary relations

The representation of binary relations has been intensively studied and many different theoretical and practical representations have been proposed to answer the usual queries in multiple domains. However, ternary relations have not received as much attention, even though many real-world applications require the processing of ternary relations. In this paper we present a new compressed and self...

متن کامل

The Treegram Index|an Eecient Technique for Retrieval in Linguistic Treebanks under Consideration for Other Conferences (specify)? Acl

In computational linguistics, large tree databases tagged with morpho-syntactic information are in need of fast retrieval of multiway tree structures. To tackle this problem, we present a generalization of the classical n-gram indexing technique called Treegram indexing. As an application of treegram indexing, we describe the Venona retrieval system, which handles the BH t treebank containing 5...

متن کامل

Multiway-Tree Retrieval Based on Treegrams

Large tree databases as knowledge repositories become more and more important; a prominent example are the treebanks in computational linguistics: text corpora consisting of up to five million words tagged with syntactic information. Consequently, these large amounts of structured data pose the problem of fast tree retrieval: Given a database T of labeled multiway trees and a query tree q, find...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014