Cosine Similarity Search with Multi Index Hashing
نویسندگان
چکیده
Due to rapid development of the Internet, recent years have witnessed an explosion in the rate of data generation. Dealing with data at current scales brings up unprecedented challenges. From the algorithmic view point, executing existing linear algorithms in information retrieval and machine learning on such tremendous amounts of data incur intolerable computational and storage costs. To address this issue, there is a growing interest to map data points in large-scale datasets to binary codes. This can significantly reduce the storage complexity of large-scale datasets. However, one of the most compelling reasons for using binary codes or any discrete representation is that they can be used as direct indices into a hash table. Incorporating hash table offers fast query execution; one can look up the nearby buckets in a hash table populated with binary codes to retrieve similar items. Nonetheless, if binary codes are compared in terms of the cosine similarity rather than the Hamming distance, there is no fast exact sequential procedure to find the K closest items to the query other than the exhaustive search. Given a large dataset of binary codes and a binary query, the problem that we address is to efficiently find K closest codes in the dataset that yield the largest cosine similarities to the query. To handle this issue, we first elaborate on the relation between the Hamming distance and the cosine similarity. This allows finding the sequence of buckets to check in the hash table. Having this sequence, we propose a multi-index hashing approach that can increase the search speed up to orders of magnitude in comparison to the exhaustive search and even approximation methods such as LSH. We empirically evaluate the performance of the proposed algorithm on real world datasets.
منابع مشابه
Fast Information-Theoretic Agglomerative Co-clustering
Our algorithm iteratively merges those clusters whose merge yields a lower objective cost. However, operations such as finding nearest neighbors or closest pair of clusters are expensive, especially in high dimensions. To quickly find highly similar clusters to be merged, we exploit the Locality-Sensitive Hashing (LSH) technique, which we briefly describe in this section. Simply put, LSH [2] is...
متن کاملAn Adaptive Multi-level Hashing Structure for Fast Approximate Similarity Search
Fast information retrieval is an essential task in data management, mainly due to the increasing availability of data. To address this problem, database researchers have developed indexing techniques to logically organize elements from large datasets in order to answer queries efficiently. In this context, an approximate similarity search algorithm known as Locality Sensitive Hashing (LSH) was ...
متن کاملRandom projections for large-scale speaker search
This paper describes a system for indexing acoustic feature vectors for large-scale speaker search using random projections. Given one or more target feature vectors, large-scale speaker search enables returning similar vectors (in a nearest-neighbors fashion) in sublinear time. The speaker feature space is comprised of i-vectors, derived from Gaussian Mixture Model supervectors. The index and ...
متن کاملImproved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS)
Recently we showed that the problem of Maximum Inner Product Search (MIPS) is efficient and it admits provably sub-linear hashing algorithms. In [23], we used asymmetric transformations to convert the problem of approximate MIPS into the problem of approximate near neighbor search which can be efficiently solved using L2-LSH. In this paper, we revisit the problem of MIPS and argue that the quan...
متن کاملClustering is Efficient for Approximate Maximum Inner Product Search
Efficient Maximum Inner Product Search (MIPS) is an important task that has a wide applicability in recommendation systems and classification with a large number of classes. Solutions based on locality-sensitive hashing (LSH) as well as tree-based solutions have been investigated in the recent literature, to perform approximate MIPS in sublinear time. In this paper, we compare these to another ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1610.00574 شماره
صفحات -
تاریخ انتشار 2016