Robust and Efficient Locality Sensitive Hashing for Nearest Neighbor Search in Large Data Sets
نویسندگان
چکیده
Locality sensitive hashing (LSH) has been used extensively as a basis for many data retrieval applications. However, previous approaches, such as random projection and multi-probe hashing, may exhibit high query complexity of up to Θ(n) when the underlying data distribution is highly skewed. This is due to the imbalance in the number of data stored per each bucket, which leads to slow query time in large data sets. In this paper, we introduce a distribution-free LSH algorithm that addresses this problem by maintaining nearly uniform number of points per bucket. As a consequence, our algorithm allows one to reduce the number of hash tables, and is hence memory-efficient, while achieving high accuracy. Through extensive experiments, we show that our algorithm accurately retrieves nearest neighbors faster than other standard LSH algorithms do in large data sets, and maintains nearly uniform number of per-bucket points.
منابع مشابه
lsh, Nearest neighbor search in high dimensions
Calculating distance pairs is O(n2) in memory and time and finding the nearest neighbor is O(n) in time. Tree indexing techniques like kd-tree [2] were developed to cope with large n, however their performance quickly breaks down for p > 3 [3]. Locality sensitive hashing (LSH) [3] is a technique for generating hash numbers from high dimensional data, such that nearby points have identical hashe...
متن کاملLocality-Sensitive Hashing for Data with Categorical and Numerical Attributes Using Dual Hashing
Locality-sensitive hashing techniques have been developed to efficiently handle nearest neighbor searches and similar pair identification problems for large volumes of high-dimensional data. This study proposes a locality-sensitive hashing method that can be applied to nearest neighbor search problems for data sets containing both numerical and categorical attributes. The proposed method makes ...
متن کاملLSH At Large - Distributed KNN Search in High Dimensions
We consider K-Nearest Neighbor search for high dimensional data in large-scale structured Peer-to-Peer networks. We present an efficient mapping scheme based on p-stable Locality Sensitive Hashing to assign hash buckets to peers in a Chord-style overlay network. To minimize network traffic, we process queries in an incremental top-K fashion leveraging on a locality preserving mapping to the pee...
متن کاملA Layered Locality Sensitive Hashing based Sequence Similarity Search Algorithm for Web Sessions
In this article we propose a Layered Locality Sensitive Hashing Algorithm to perform similarity search on the web log sequence data. Locality Sensitive Hashing has been found to be an efficient technique for the approximate nearest neighbor search over a large database, as it has sub-linear dependence on the data size even for high dimension. Mining the large web log data to provide customised ...
متن کاملNearest Neighbor Search in the Metric Space of a Complex Network for Community Detection
The objective of this article is to bridge the gap between two important research directions: (1) nearest neighbor search, which is a fundamental computational tool for large data analysis; and (2) complex network analysis, which deals with large real graphs but is generally studied via graph theoretic analysis or spectral analysis. In this article, we have studied the nearest neighbor search p...
متن کامل