A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Authors

Abstract:

Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of data sources and human faults in data entry, it is possible to appear several copies of an entity in a data source. This problem leads to error occurrence in operations or output results of a system; also, it costs a lot for related organization or business. Therefore, data cleaning process especially duplicate record detection, became one of the most important area of computer science in recent years. Many solutions presented for detecting duplicates in different situations, but they almost are all time-consuming. Also, the volume of data is growing up every day. hence, previous methods don’t have enough performance anymore. Incorrect detection of two different records as duplicates, is another problem that recent works are being faced. This becomes important because duplicates will usually be deleted and some correct data will be lost. So it seems that presenting new methods is necessary. In this paper, a method has been proposed that reduces required volume of process using hierarchical clustering with appropriate features. In this method, similarity between records has been estimated in several levels. In each level, a different feature has been used for estimating similarity between records. As a result, clusters that contain very similar records will be created in the last level. The comparisons are done on these records for detecting duplicates. Also, in this paper, a relative similarity function has been proposed for comparing between records. This function has high precision in determining the similarity. Eventually, the evaluation results show that the proposed method detects 90% of duplicate records with 97% accuracy in less time and results have improved.

Upgrade to premium to download articles

Sign up to access the full text

Already have an account?login

similar resources

Duplicate Detection of Records in Queries Using Clustering

The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations o...

full text

Automatic Road Detection and Extraction From MultiSpectral Images Using a New Hierarchical Object-based Method

Road detection and Extraction is one of the most important issues in photogrammetry, remote sensing and machine vision. A great deal of research has been done in this area based on multispectral images, which are mostly relatively good results. In this paper, a novel automated and hierarchical object-based method for detecting and extracting of roads is proposed. This research is based on the M...

full text

A new method for hierarchical clustering combination

In the field of pattern recognition, combining different classifiers into a robust classifier is a common approach for improving classification accuracy. Recently, this trend has also been used to improve clustering performance especially in non-hierarchical clustering approaches. Generally hierarchical clustering is preferred in comparison with the partitional clustering for applications when ...

full text

A Hierarchical Classification Method for Breast Tumor Detection

Introduction Breast cancer is the second cause of mortality among women. Early detection of it can enhance the chance of survival. Screening systems such as mammography cannot perfectly differentiate between patients and healthy individuals. Computer-aided diagnosis can help physicians make a more accurate diagnosis. Materials and Methods Regarding the importance of separating normal and abnorm...

full text

A New Hierarchical Clustering Method using Topological Map

We present a new hierarchical clustering criteria which can be applied to data set. This is done after generating an initial partition by using a Topological Self Organizing Map. This criteria contains two terms which take into account two di erent errors simultaneously: the square error of the entire clustering (as the Ward criteria) and the topological structure given by the Self Organizing M...

full text

A New Sensitive Method for Detection of Viroids

Background and Aims: Viroids are smallest known plant pathogens and cause several economically significant diseases. Until recently, viroid detection relied mainly on biological tests and indexing. Today various diagnostic techniques such as nucleic acid hybridization, southern blot and reverse transcription coupled with polymerase chain reaction (RT-PCR) are being used for detection and diag...

full text

My Resources

Save resource for easier access later

Save to my library Already added to my library

{@ msg_add @}


Journal title

volume 18  issue 4

pages  3- 22

publication date 2022-03

By following a journal you will be notified via email when a new issue of this journal is published.

Keywords

No Keywords

Hosted on Doprax cloud platform doprax.com

copyright © 2015-2023