Medical record linkage in health information systems by approximate string matching and clustering

نویسندگان

  • Erik-André Sauleau
  • Jean-Philippe Paumier
  • Antoine Buemi
چکیده

BACKGROUND Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity. METHODS The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods. RESULTS The batch analysis of 300,000 "supposedly" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records. CONCLUSION Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e. real-time) proximity detection when inserting a new identity.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Real World Performance of Approximate String Comparators for use in Patient Matching

Medical record linkage is becoming increasingly important as clinical data is distributed across independent sources. To improve linkage accuracy we studied different name comparison methods that establish agreement or disagreement between corresponding names. In addition to exact raw name matching and exact phonetic name matching, we tested three approximate string comparators. The approximate...

متن کامل

Embracing the Sparse, Noisy, and Interrelated Aspects of Patient Demographics for use in Clinical Medical Record Linkage

Duplicate patient records in health information systems have received increased attention in recent time due to regulatory incentives to integrate the healthcare enterprise. Historically, most patient record matching systems have been limited to simple applications of the Fellegi-Sunter theory of record linkage with edit distance based string similarity measurements. String similarity approache...

متن کامل

Approximate String Comparison and its Effect on an Advanced Record Linkage System

Record linkage, sometimes referred to as information retrieval (Frakes and Baeza-Yates, 1992) is needed for the creation, unduplication, and maintenance of name and address lists. This paper describes string comparators and their effect in a production matching system. Because many lists have typographical errors in more than 20 percent of first names and also in last names, effective methods f...

متن کامل

An Empirical Comparison of Approaches to Approximate String Matching in Private Record Linkage

Due to the frequency of spelling and typographical errors in practical applications, record linkage algorithms have to use string similarity functions. In many legal contexts, identifiers such as names have to be encrypted before a record linkage can be attempted. Therefore, algorithms for computing string similarity functions with encrypted identifiers are essential for approximating string ma...

متن کامل

A study on company name matching for database integration

In this report we describe an activity of information integration performed on databases with patent data and company indicators. Depending on the application area, this kind of activity is known as record linkage, duplicate detection, record matching, reference reconciliation or other domain-specific terms. In particular, we present a detailed case study on company name matching. We show how t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • BMC Medical Informatics and Decision Making

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2005