Recursive n-gram hashing is pairwise independent, at best
نویسندگان
چکیده
Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise independent, our implementations either run in time O(n) or use an exponential amount of memory. As a more scalable alternative, we make hashing by cyclic polynomials pairwise independent by ignoring n− 1 bits. Experimentally, we show that hashing by cyclic polynomials is twice as fast as hashing by irreducible polynomials. We also show that randomized Karp-Rabin hash families are not pairwise independent.
منابع مشابه
Recursive Hashing and One-Pass, One-Hash n-Gram Count Estimation
Many applications use sequences of n consecutive symbols (n-grams). We review n-gram hashing and prove that recursive hash families are pairwise independent at best. We prove that hashing by irreducible polynomials is pairwise independent whereas hashing by cyclic polynomials is quasi-pairwise independent: we make it pairwise independent by discarding n− 1 bits. One application of hashing is to...
متن کاملOne-Pass, One-Hash n-Gram Statistics Estimation
In multimedia, text or bioinformatics databases, applications query sequences of n consecutive symbols called n-grams. Estimating the number of distinct n-grams is a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire an unassuming algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashin...
متن کاملPairwise Rotation Hashing for High-dimensional Features
Binary Hashing is widely used for effective approximate nearest neighbors search. Even though various binary hashing methods have been proposed, very few methods are feasible for extremely high-dimensional features often used in visual tasks today. We propose a novel highly sparse linear hashing method based on pairwise rotations. The encoding cost of the proposed algorithm is O(n logn) for n-d...
متن کاملThe universality of iterated hashing over variable-length strings
Iterated hash functions process strings recursively, one character at a time. At each iteration, they compute a new hash value from the preceding hash value and the next character. We prove that iterated hashing can be pairwise independent, but never 3wise independent. We show that it can be almost universal over strings much longer than the number of hash values; we bound the maximal string le...
متن کاملOn the independent spanning trees of recursive circulant graphs G(cdm, d) with d>2
Two spanning trees of a graph G are said to be independent if they are rooted at the same vertex r, and for each vertex v 6= r in G, the two different paths from v to r, one path in each tree, are internally disjoint. A set of spanning trees of G is independent if they are pairwise independent. A recursive circulant graph G(N, d) has N = cdm vertices labeled from 0 to N − 1, where d > 2, m > 1,...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computer Speech & Language
دوره 24 شماره
صفحات -
تاریخ انتشار 2010