Low-Quality Dimension Reduction and High-Dimensional Approximate Nearest Neighbor
نویسندگان
چکیده
The approximate nearest neighbor problem ( -ANN) in Euclidean settings is a fundamental question, which has been addressed by two main approaches: Data-dependent space partitioning techniques perform well when the dimension is relatively low, but are affected by the curse of dimensionality. On the other hand, locality sensitive hashing has polynomial dependence in the dimension, sublinear query time with an exponent inversely proportional to (1 + )2, and subquadratic space requirement. We generalize the Johnson-Lindenstrauss Lemma to define “low-quality” mappings to a Euclidean space of significantly lower dimension, such that they satisfy a requirement weaker than approximately preserving all distances or even preserving the nearest neighbor. This mapping guarantees, with high probability, that an approximate nearest neighbor lies among the k approximate nearest neighbors in the projected space. These can be efficiently retrieved while using only linear storage by a data structure, such as BBD-trees. Our overall algorithm, given n points in dimension d, achieves space usage in O(dn), preprocessing time in O(dn logn), and query time in O(dn logn), where ρ is proportional to 1− 1/log logn, for fixed ∈ (0, 1). The dimension reduction is larger if one assumes that pointsets possess some structure, namely bounded expansion rate. We implement our method and present experimental results in up to 500 dimensions and 106 points, which show that the practical performance is better than predicted by the theoretical analysis. In addition, we compare our approach with E2LSH. 1998 ACM Subject Classification F.2.2 [Analysis of algorithms and problem complexity] Geometrical problems and computations
منابع مشابه
Minimax rates for cost-sensitive learning on manifolds with approximate nearest neighbours
We study the approximate nearest neighbour method for cost-sensitive classification on low-dimensional manifolds embedded within a high-dimensional feature space. We determine the minimax learning rates for distributions on a smooth manifold, in a cost-sensitive setting. This generalises a classic result of Audibert and Tsybakov. Building upon recent work of Chaudhuri and Dasgupta we prove that...
متن کاملEFFECT OF THE NEXT-NEAREST NEIGHBOR INTERACTION ON THE ORDER-DISORDER PHASE TRANSITION
In this work, one and two-dimensional lattices are studied theoretically by a statistical mechanical approach. The nearest and next-nearest neighbor interactions are both taken into account, and the approximate thermodynamic properties of the lattices are calculated. The results of our calculations show that: (1) even though the next-nearest neighbor interaction may have an insignificant ef...
متن کاملFast Parallel Estimation of High Dimensional Information Theoretical Quantities with Low Dimensional Random Projection Ensembles
Goal: estimation of high dimensional information theoretical quantities (entropy, mutual information, divergence). • Problem: computation/estimation is quite slow. • Consistent estimation is possible by nearest neighbor (NN) methods [1] → pairwise distances of sample points: – expensive in high dimensions [2], – approximate isometric embedding into low dimension is possible (Johnson-Lindenstrau...
متن کاملNearest Neighbor Search in Multidimensional Spaces Depth Oral Report
The Nearest Neighbor Search problem is deened as follows: given a set P of n points, preprocess the points so as to eeciently answer queries that require nding the closest point in P to a query point q. If we are willing to settle for a point that is almost as close as the nearest neighbor, then we can relax the problem to the approximate Nearest Neighbor Search. Nearest Neighbor Search (exact ...
متن کاملThe Analysis of a Probabilistic Approach to Nearest Neighbor Searching
Given a set S of n data points in some metric space. Given a query point q in this space, a nearest neighbor query asks for the nearest point of S to q. Throughout we will assume that the space is real d-dimensional space <d, and the metric is Euclidean distance. The goal is to preprocess S into a data structure so that such queries can be answered efficiently. Nearest neighbor searching has ap...
متن کامل