Multi-schema entity resolution

  • Qiong Huang

Student thesis: Master's thesis

Abstract

Entity resolution (ER) is the problem of identifying and merging the records judged to represent the same real-world entity. Most previous ER approaches assumed a unified schema (or a global schema) under which all records are compared and merged in a field-by-field basis. We consider the multi-schema ER problem in which records come from multiple sources that are of different schemas. A prime example of multi-schema ER is Information Integration over the deep web, where the goal is to integrate data from heterogeneous sources. In this thesis, we formalize the multi-schema ER problem, investigate some properties that are satisfied in a unified-schema setting, but not in a multi-schema setting, and identify the possible resolution conflicts that might occur in a multi-schema setting using the previous ER approaches. We then propose the validity-ensured and order-sensitive (VEOS) algorithm that is free from such conflicts and, at the same time, can take advantage of order scheduling to improve accuracy. We identify schema-level and data-level criteria to distinguish the more reliable comparisons so that by comparing them first a more accurate result is obtained. To leverage such information, we propose to construct a confidence graph upon which our scheduling algorithm is developed. Our experiments, using real online shopping data, show that: (1) our scheduling algorithm is very effective in improving accuracy, and (2) VEOS with scheduling outperforms other methods in both accuracy and efficiency.
Date of Award2008
Original languageEnglish
Awarding Institution
  • The Hong Kong University of Science and Technology

Cite this

'