Unsupervised record matching with noisy and incomplete data
نویسندگان
چکیده
منابع مشابه
Unsupervised record matching with noisy and incomplete data
We consider the problem of duplicate detection: given a large data set in which each entry has multiple attributes, detect which distinct entries refer to the same real world entity. Our method consists of three main steps: creating a similarity score between entries, grouping entries together into ‘unique entities’, and refining the groups. We compare various methods for creating similarity sc...
متن کاملHyperspectral Unmixing from Incomplete and Noisy Data
In hyperspectral images, once the pure spectra of the materials are known, hyperspectral unmixing seeks to find their relative abundances throughout the scene. We present a novel variational model for hyperspectral unmixing from incomplete noisy data, which combines a spatial regularity prior with the knowledge of the pure spectra. The material abundances are found by minimizing the resulting c...
متن کاملAdaptive Approximate Record Matching
Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...
متن کاملUnsupervised Blocking of Imbalanced Datasets for Record Matching
Record matching in data engineering refers to searching for data records originating from same entities across different data sources. The solutions for record matching usually employ learning algorithms to train a classifier that labels record pairs as either matches or nonmatches. In practice, the amount of non-matches typically far exceeds the amount of matches. This problem is so-called imb...
متن کاملPSOM+: Parametrized Self-Organizing Maps for noisy and incomplete data
We present an extension to the Parametrized Self-Organizing Map that allows the construction of continuous manifolds from noisy, incomplete and not necessarily gridorganized training data. All three problems are tackled by minimizing the overall smoothness of a PSOM manifold. For this, we introduce a matrix which defines a metric in the space of PSOM weights, depending only on the underlying gr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International Journal of Data Science and Analytics
سال: 2018
ISSN: 2364-415X,2364-4168
DOI: 10.1007/s41060-018-0129-7