Recursive n-gram hashing is pairwise independent, at best

نویسندگان

  • Daniel Lemire
  • Owen Kaser
چکیده

Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise independent, our implementations either run in time O(n) or use an exponential amount of memory. As a more scalable alternative, we make hashing by cyclic polynomials pairwise independent by ignoring n− 1 bits. Experimentally, we show that hashing by cyclic polynomials is twice as fast as hashing by irreducible polynomials. We also show that randomized Karp-Rabin hash families are not pairwise independent.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Recursive Hashing and One-Pass, One-Hash n-Gram Count Estimation

Many applications use sequences of n consecutive symbols (n-grams). We review n-gram hashing and prove that recursive hash families are pairwise independent at best. We prove that hashing by irreducible polynomials is pairwise independent whereas hashing by cyclic polynomials is quasi-pairwise independent: we make it pairwise independent by discarding n− 1 bits. One application of hashing is to...

متن کامل

One-Pass, One-Hash n-Gram Statistics Estimation

In multimedia, text or bioinformatics databases, applications query sequences of n consecutive symbols called n-grams. Estimating the number of distinct n-grams is a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire an unassuming algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashin...

متن کامل

Pairwise Rotation Hashing for High-dimensional Features

Binary Hashing is widely used for effective approximate nearest neighbors search. Even though various binary hashing methods have been proposed, very few methods are feasible for extremely high-dimensional features often used in visual tasks today. We propose a novel highly sparse linear hashing method based on pairwise rotations. The encoding cost of the proposed algorithm is O(n logn) for n-d...

متن کامل

The universality of iterated hashing over variable-length strings

Iterated hash functions process strings recursively, one character at a time. At each iteration, they compute a new hash value from the preceding hash value and the next character. We prove that iterated hashing can be pairwise independent, but never 3wise independent. We show that it can be almost universal over strings much longer than the number of hash values; we bound the maximal string le...

متن کامل

On the independent spanning trees of recursive circulant graphs G(cdm, d) with d>2

Two spanning trees of a graph G are said to be independent if they are rooted at the same vertex r, and for each vertex v 6= r in G, the two different paths from v to r, one path in each tree, are internally disjoint. A set of spanning trees of G is independent if they are pairwise independent. A recursive circulant graph G(N, d) has N = cdm vertices labeled from 0 to N − 1, where d > 2, m > 1,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computer Speech & Language

دوره 24  شماره 

صفحات  -

تاریخ انتشار 2010