Using Masks, Suffix Array-Based Data Structures And Multidimensional Arrays To Compute Positional Ngram Statistics From Corpora

نویسندگان

  • Alexandre Gil
  • Gaël Dias
چکیده

This paper describes an implementation to compute positional ngram statistics (i.e. Frequency and Mutual Expectation) based on masks, suffix array-based data structures and multidimensional arrays. Positional ngrams are ordered sequences of words that represent continuous or discontinuous substrings of a corpus. In particular, the positional ngram model has shown successful results for the extraction of discontinuous collocations from large corpora. However, its computation is heavy. For instance, 4.299.742 positional ngrams (n=1..7) can be generated from a 100.000-word size corpus in a seven-word size window context. In comparison, only 700.000 ngrams would be computed for the classical ngram model. It is clear that huge efforts need to be made to process positional ngram statistics in reasonable time and space. Our solution shows O(h(F) N log N) time complexity where N is the corpus size and h(F) a function of the window context.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Virtual Corpus Approach to Deriving Ngram Statistics from Large Scale Corpora

This paper reports our implementation of the Virtual Corpus approach to deriving ngram statistics for ngrams of any length from large-scale corpora based on the suffix array data structure. In order to enable the VC to accommodate corpora with a vocabulary of different size, we first convert corpus tokens into integer codes. To accelerate the processing, we employ a bucket-radixsort for sorting...

متن کامل

Entropy-Compressed Indexes for Multidimensional Pattern Matching

In this talk, we will discuss the challenges involved in developing a multidimensional generalizations of compressed text indexing structures. These structures depend on some notion of Burrows-Wheeler transform (BWT) for multiple dimensions, though naive generalizations do not enable multidimensional pattern matching. We study the 2D case to possibly highlight combinatorial properties that do n...

متن کامل

Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus

Bigrams and trigrams are commonly used in statistical natural language processing; this paper will describe techniques for working with much longer ngrams. Suffix arrays were first introduced to compute the frequency and location of a substring (ngram) in a sequence (corpus) of length N . To compute frequencies over all N(N+1)/2 substrings in a corpus, the substrings are grouped into a manageab...

متن کامل

An Efficient Language Model Using Double-Array Structures

Ngram language models tend to increase in size with inflating the corpus size, and consume considerable resources. In this paper, we propose an efficient method for implementing ngram models based on doublearray structures. First, we propose a method for representing backwards suffix trees using double-array structures and demonstrate its efficiency. Next, we propose two optimization methods fo...

متن کامل

Parallel Suffix Arrays for Corpus Exploration

This paper describes how recently developed techniques for suffix array construction and compression can be expanded to bring a new data structure, called parallel suffix array, into existence, which is suitable as an in-memory representation of large annotated corpora, enabling complex queries and fast extractions of the context of matching substrings. It is also shown how parallel suffix arra...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003