Profile-Based Object Matching for Information Integration

نویسندگان

  • AnHai Doan
  • Ying Lu
  • Yoonkyong Lee
  • Jiawei Han
چکیده

an important role in many information management contexts, including information integration, data warehousing, information extraction, and text joins in databases. As the “Related Work” sidebar describes, researchers have proposed numerous object-matching solutions in both the AI and database communities. Virtually all these solutions assume that the target objects, or tuples, share the same set of attributes, and they match tuples by comparing attribute similarity. However, tuples often have nonoverlapping, or disjoint, attributes. This is frequently true in information integration, especially when merging tuples from different data sources. Because these data sources are typically developed independently, they often have overlapping, but different schemas. As an example, suppose a government agency is trying to merge two branch offices and their databases (S1 and S2), which both list information about people living in Champaign, Illinois. As Figure 1 shows, each tuple contains information about a single person. Because the databases cover the same geographical area, they contain many duplicate tuples. For example, S1’s tuple S. Riley, 105 Spring St, 61801, $95,000 refers to the same person as S2’s Sarah Riley, E. Spring St, 61801, 38; the shared attributes are name, street, and zip, while the disjoint attributes are income (S1) and age (S2). Existing systems can effectively match attributes, despite variations and errors: the name “Sarah Riley,” for example, might be abbreviated as “S. Riley” or even mistyped as “Sarah Rilye.” However, these methods don’t exploit disjoint attributes in the matching process. Our Profile-Based Object Matching solution—as embodied in the PROM system we’re developing—exploits disjoint attributes to maximize matching accuracy. The key to PROM is that disjoint attributes are often correlated, and we can leverage this correlation to perform a “sanity check” on object matching. So if we had the S1 tuple Mike Smith, E. White Street, 61820, $100,000 and the S2 tuple Mike Smith, E. White Street, 61820, 6, prior solutions would declare a match because the shared attributes match perfectly. However, the disjoint attributes (income in S1 and age in S2) indicate that, combined, the two tuples give us “Mike Smith,” a six-year-old with a $100,000 yearly income. This is possible but very unlikely. Thus, PROM would reject the match.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

3D Classification of Urban Features Based on Integration of Structural and Spectral Information from UAV Imagery

Three-dimensional classification of urban features is one of the important tools for urban management and the basis of many analyzes in photogrammetry and remote sensing. Therefore, it is applied in many applications such as planning, urban management and disaster management. In this study, dense point clouds extracted from dense image matching is applied for classification in urban areas. Appl...

متن کامل

An Improved Semantic Schema Matching Approach

Schema matching is a critical step in many applications, such as data warehouse loading, Online Analytical Process (OLAP), Data mining, semantic web [2] and schema integration. This task is defined for finding the semantic correspondences between elements of two schemas. Recently, schema matching has found considerable interest in both research and practice. In this paper, we present a new impr...

متن کامل

Improvement of Navigation Accuracy using Tightly Coupled Kalman Filter

In this paper, a mechanism is designed for integration of inertial navigation system information (INS) and global positioning system information (GPS). In this type of system a series of mathematical and filtering algorithms with Tightly Coupled techniques with several objectives such as application of integrated navigation algorithms, precise calculation of flying object position, speed and at...

متن کامل

Analytical Comparison of Methods for Calculating the Completeness of VGI

Spatial data, which is one of the main needs of human societies from business organizations to the general users today, cannot meet the needs of a wide range of users without changing the structure of conventional methods of data registration and updating on a metropolitan scale. Open Street Map, as one of the most successful implementations of the crowdsourcing approach to spatial data with th...

متن کامل

MOMA - A Mapping-based Object Matching System

Object matching or object consolidation is a crucial task for data integration and data cleaning. It addresses the problem of identifying object instances in data sources referring to the same real world entity. We propose a flexible framework called MOMA for mapping based object matching. It allows the construction of match workflows combining the results of several matcher algorithms on both ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Intelligent Systems

دوره 18  شماره 

صفحات  -

تاریخ انتشار 2003