Adaptive top-k overlap set similarity joins

Zhong Yang, Bolong Zheng*, Guohui Li, Xi Zhao, Xiaofang Zhou, Christian S. Jensen

*Corresponding author for this work

Research output: Chapter in Book/Conference Proceeding/ReportConference Paper published in a bookpeer-review

Abstract

The set similarity join (SSJ) is core functionality in a range of applications, including data cleaning, near-duplicate object detection, and data integration. Threshold-based SSJ queries return all pairs of sets with similarity no smaller than a given threshold. As results, and their utility, are very sensitive to the choice of threshold value, it is a problem that it is difficult to choose such an appropriate value. Doing so requires prior knowledge of the data, which users often do not have. To avoid this problem, we propose a solution to the top-k overlap set similarity join (TkOSSJ) that returns k pairs of sets with the highest overlap similarities. The state-of-the-art solution disregards the effect of the so-called step size, which is the number of elements accessed in each iteration of the algorithm. This affects its performance negatively. To address this issue, we first propose an algorithm that uses a fixed step size, thus taking advantage of the benefits of a large step size, and then we present an adaptive step size algorithm that is capable of automatically adjusting the step size, thus reducing redundant computations. An extensive empirical study offers insight into the new algorithms and indicates that they are capable of outperforming the state-of-the-art method on real, large-scale data sets.

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020
PublisherIEEE Computer Society
Pages1081-1092
Number of pages12
ISBN (Electronic)9781728129037
DOIs
Publication statusPublished - Apr 2020
Externally publishedYes
Event36th IEEE International Conference on Data Engineering, ICDE 2020 - Dallas, United States
Duration: 20 Apr 202024 Apr 2020

Publication series

NameProceedings - International Conference on Data Engineering
Volume2020-April
ISSN (Print)1084-4627

Conference

Conference36th IEEE International Conference on Data Engineering, ICDE 2020
Country/TerritoryUnited States
CityDallas
Period20/04/2024/04/20

Bibliographical note

Publisher Copyright:
© 2020 IEEE.

Keywords

  • Overlap set similarity
  • Similarity join
  • Top-k join, data lake

Fingerprint

Dive into the research topics of 'Adaptive top-k overlap set similarity joins'. Together they form a unique fingerprint.

Cite this