One Permutation Hashing

نویسندگان

  • Ping Li
  • Art B. Owen
  • Cun-Hui Zhang
چکیده

Abstract Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, b-bit minwise hashing has been applied to large-scale learning and sublinear time nearneighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 permutations on the data. This paper presents a simple solution called one permutation hashing. Conceptually, given a binary data matrix, we permute the columns once and divide the permuted columns evenly into k bins; and we store, for each data vector, the smallest nonzero location in each bin. The probability analysis illustrates that this one permutation scheme should perform similarly to the original (k-permutation) minwise hashing. Our experiments with training SVM and logistic regression confirm that one permutation hashing can achieve similar (or even better) accuracies compared to the k-permutation scheme. See more details in arXiv:1208.1259.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search

The query complexity of locality sensitive hashing (LSH) based similarity search is dominated by the number of hash evaluations, and this number grows with the data size (Indyk & Motwani, 1998). In industrial applications such as search where the data are often high-dimensional and binary (e.g., text n-grams), minwise hashing is widely adopted, which requires applying a large number of permutat...

متن کامل

Sufficient conditions for sound hashing using a truncated permutation

In this paper we give a generic security proof for hashing modes that make use of an underlying fixed-length permutation. We formulate a set of five simple conditions, which are easy to implement and to verify, for such a hashing mode to be sound. These hashing modes include tree hashing modes and sequential hashing modes. We provide a proof that for any hashing mode satisfying the five conditi...

متن کامل

Improved Densification of One Permutation Hashing

The existing work on densification of one permutation hashing [24] reduces the query processing cost of the (K,L)-parameterized Locality Sensitive Hashing (LSH) algorithm with minwise hashing, from O(dKL) to merely O(d + KL), where d is the number of nonzeros of the data vector, K is the number of hashes in each hash table, and L is the number of hash tables. While that is a substantial improve...

متن کامل

Preimage Attack on Parallel FFT-Hashing

Parallel FFT-Hashing was designed by C. P. Schnorr and S. Vaudenay in 1993. The function is a simple and light weight hash algorithm with 128-bit digest. Its basic component is a multi-permutation which helps in proving its resistance to collision attacks. In this work we show a preimage attack on Parallel FFT-Hashing with complexity 2 + 2 and memory 2 which is less than the generic complexity ...

متن کامل

Permutation grouping: intelligent Hash function design for audio & image retrieval

The combination of MinHash-based signatures and LocalitySensitive Hashing (LSH) schemes has been effectively used for finding approximate matches in very large audio and image retrieval systems. In this study, we introduce the idea of permutation-grouping to intelligently design the hash functions that are used to index the LSH tables. This helps to overcome the inefficiencies introduced by has...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012