An approximate algorithm for top-k closest pairs join query in large high dimensional data
نویسندگان
چکیده
In this paper we present a novel approximate algorithm to calculate the top-k closest pairs join query of two large and high dimensional data sets. The algorithm has worst case time complexity OðdnkÞ and space complexity OðndÞ and guarantees a solution within a Oðd1þ1tÞ factor of the exact one, where t 2 {1,2, . . . ,1} denotes the Minkowski metrics Lt of interest and d the dimensionality. It makes use of the concept of space filling curve to establish an order between the points of the space and performs at most d + 1 sorts and scans of the two data sets. During a sca\n, each point from one data set is compared with its closest points, according to the space filling curve order, in the other data set and points whose contribution to the solution has already been analyzed are detected and eliminated. Experimental results on real and synthetic data sets show that our algorithm behaves as an exact algorithm in low dimensional spaces; it is able to prune the entire (or a considerable fraction of the) data set even for high dimensions if certain separation conditions are satisfied; in any case it returns a solution within a small error to the exact one. 2004 Elsevier B.V. All rights reserved.
منابع مشابه
Supporting KDD Applications by the k-Nearest Neighbor Join
The similarity join has become an important database primitive to support similarity search and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Well-known are two types of the similarity join, the distance range join where the user defines a distance threshold for the join, and the closest point query or k-distance ...
متن کاملHigh Performance Data Mining Using the Nearest Neighbor Join
The similarity join has become an important database primitive to support similarity search and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Well-known are two types of the similarity join, the distance range join where the user defines a distance threshold for the join, and the closest point query or k-distance ...
متن کاملApproximate Algorithms for Distance-Based Queries in High-Dimensional Data Spaces Using R-Trees
In modern database applications the similarity or dissimilarity of complex objects is examined by performing distance-based queries (DBQs) on data of high dimensionality. The R-tree and its variations are commonly cited multidimensional access methods that can be used for answering such queries. Although, the related algorithms work well for low-dimensional data spaces, their performance degrad...
متن کاملProcessing Distance Join Queries with Constraints
Distance-join queries are used in many modern applications, such as spatial databases, spatiotemporal databases, and data mining. One of the most common distance-join queries is the closest-pair query. Given two datasets DA and DB the closest-pair query (CPQ) retrieves the pair (a,b), where a ∈ DA and b ∈ DB, having the smallest distance between all pairs of objects. An extension to this proble...
متن کاملCost models for distance joins queries using R-trees
The K-Closest-Pairs Query (K-CPQ), a type of distance join in spatial databases, discovers the K pairs of objects formed from two different datasets with the K smallest distances. Recently, branch-and-bound algorithms based on R-trees have been developed in order to answer K-CPQs efficiently. For query optimization purposes, analytical models are needed to estimate the processing cost of a spec...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Data Knowl. Eng.
دوره 53 شماره
صفحات -
تاریخ انتشار 2005