An approximate algorithm for top-k closest pairs join query in large high dimensional data

نویسندگان

  • Fabrizio Angiulli
  • Clara Pizzuti
چکیده

In this paper we present a novel approximate algorithm to calculate the top-k closest pairs join query of two large and high dimensional data sets. The algorithm has worst case time complexity OðdnkÞ and space complexity OðndÞ and guarantees a solution within a Oðd1þ1tÞ factor of the exact one, where t 2 {1,2, . . . ,1} denotes the Minkowski metrics Lt of interest and d the dimensionality. It makes use of the concept of space filling curve to establish an order between the points of the space and performs at most d + 1 sorts and scans of the two data sets. During a sca\n, each point from one data set is compared with its closest points, according to the space filling curve order, in the other data set and points whose contribution to the solution has already been analyzed are detected and eliminated. Experimental results on real and synthetic data sets show that our algorithm behaves as an exact algorithm in low dimensional spaces; it is able to prune the entire (or a considerable fraction of the) data set even for high dimensions if certain separation conditions are satisfied; in any case it returns a solution within a small error to the exact one. 2004 Elsevier B.V. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Supporting KDD Applications by the k-Nearest Neighbor Join

The similarity join has become an important database primitive to support similarity search and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Well-known are two types of the similarity join, the distance range join where the user defines a distance threshold for the join, and the closest point query or k-distance ...

متن کامل

High Performance Data Mining Using the Nearest Neighbor Join

The similarity join has become an important database primitive to support similarity search and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Well-known are two types of the similarity join, the distance range join where the user defines a distance threshold for the join, and the closest point query or k-distance ...

متن کامل

Approximate Algorithms for Distance-Based Queries in High-Dimensional Data Spaces Using R-Trees

In modern database applications the similarity or dissimilarity of complex objects is examined by performing distance-based queries (DBQs) on data of high dimensionality. The R-tree and its variations are commonly cited multidimensional access methods that can be used for answering such queries. Although, the related algorithms work well for low-dimensional data spaces, their performance degrad...

متن کامل

Processing Distance Join Queries with Constraints

Distance-join queries are used in many modern applications, such as spatial databases, spatiotemporal databases, and data mining. One of the most common distance-join queries is the closest-pair query. Given two datasets DA and DB the closest-pair query (CPQ) retrieves the pair (a,b), where a ∈ DA and b ∈ DB, having the smallest distance between all pairs of objects. An extension to this proble...

متن کامل

Cost models for distance joins queries using R-trees

The K-Closest-Pairs Query (K-CPQ), a type of distance join in spatial databases, discovers the K pairs of objects formed from two different datasets with the K smallest distances. Recently, branch-and-bound algorithms based on R-trees have been developed in order to answer K-CPQs efficiently. For query optimization purposes, analytical models are needed to estimate the processing cost of a spec...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Data Knowl. Eng.

دوره 53  شماره 

صفحات  -

تاریخ انتشار 2005