On Index-Free Similarity Search in Metric Spaces
نویسندگان
چکیده
Metric access methods (MAMs) serve as a tool for speeding similarity queries. However, all MAMs developed so far are index-based; they need to build an index on a given database. The indexing itself is either static (the whole database is indexed at once) or dynamic (insertions/deletions are supported), but there is always a preprocessing step needed. In this paper, we propose D-file, the first MAM that requires no indexing at all. This feature is especially beneficial in domains like data mining, streaming databases, etc., where the production of data is much more intensive than querying. Thus, in such environments the indexing is the bottleneck of the entire production/querying scheme. The idea of D-file is an extension of the trivial sequential file (an abstraction over the original database, actually) by so-called D-cache. The D-cache is a main-memory structure that keeps track of distance computations spent by processing all similarity queries so far (within a runtime session). Based on the distances stored in D-cache, the D-file can cheaply determine lower bounds of some distances while the distances alone have not to be explicitly computed, which results in faster queries. Our experimental evaluation shows that query efficiency of D-file is comparable to the index-based state-of-the-art MAMs, however, for zero indexing costs.
منابع مشابه
Similarity Join in Metric Spaces Using eD-Index
Similarity join in distance spaces constrained by the metric postulates is the necessary complement of more famous similarity range and the nearest neighbor search primitives. However, the quadratic computational complexity of similarity joins prevents from applications on large data collections. We present the eD-Index, an extension of D-index, and we study an application of the eDIndex to imp...
متن کاملAccess Structures for Advanced Similarity Search in Metric Spaces
Similarity retrieval is an important paradigm for searching in environments where exact match has little meaning. Moreover, in order to enlarge the set of data types for which the similarity search can efficiently be performed, the notion of mathematical metric space provides a useful abstraction for similarity. In this paper we consider the problem of organizing and searching large data-sets f...
متن کاملAspects of Metric Spaces in Computation
Metric spaces, which generalise the properties of commonly-encountered physical and abstract spaces into a mathematical framework, frequently occur in computer science applications. Three major kinds of questions about metric spaces are considered here: the intrinsic dimensionality of a distribution, the maximum number of distance permutations, and the difficulty of reverse similarity search. I...
متن کاملResource Description and Selection for Range Query Processing in General Metric Spaces
Similarity search in general metric spaces is a key aspect in many application fields. Metric space indexing provides a flexible indexing paradigm and is solely based on the use of a distance metric. No assumption is made about the representation of the database objects. Nowadays, ever-increasing data volumes require large-scale distributed retrieval architectures. Here, local and global indexi...
متن کاملMargin-Based Pivot Selection for Similarity Search Indexes
When developing an index for a similarity search in metric spaces, how to divide the space for effective search pruning is a fundamental issue. We present Maximal Metric Margin Partitioning (MMMP), a partitioning scheme for similarity search indexes. MMMP divides the data based on its distribution pattern, especially for the boundaries of clusters. A partitioning boundary created by MMMP is lik...
متن کاملDAHC-tree: An Effective Index for Approximate Search in High-Dimensional Metric Spaces
Similarity search in high-dimensional metric spaces is a key operation in many applications, such as multimedia databases, image retrieval, object recognition, and others. The high dimensionality of the data requires special index structures to facilitate the search. A problem regarding the creation of suitable index structures for highdimensional data is the relationship between the geometry o...
متن کامل