Abstract
The set similarity join (SSJ) is core functionality in a range of applications, including data cleaning, near-duplicate object detection, and data integration. Threshold-based SSJ queries return all pairs of sets with similarity no smaller than a given threshold. As results, and their utility, are very sensitive to the choice of threshold value, it is a problem that it is difficult to choose such an appropriate value. Doing so requires prior knowledge of the data, which users often do not have. To avoid this problem, we propose a solution to the top-k overlap set similarity join (TkOSSJ) that returns k pairs of sets with the highest overlap similarities. The state-of-the-art solution disregards the effect of the so-called step size, which is the number of elements accessed in each iteration of the algorithm. This affects its performance negatively. To address this issue, we first propose an algorithm that uses a fixed step size, thus taking advantage of the benefits of a large step size, and then we present an adaptive step size algorithm that is capable of automatically adjusting the step size, thus reducing redundant computations. An extensive empirical study offers insight into the new algorithms and indicates that they are capable of outperforming the state-of-the-art method on real, large-scale data sets.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020 |
| Publisher | IEEE Computer Society |
| Pages | 1081-1092 |
| Number of pages | 12 |
| ISBN (Electronic) | 9781728129037 |
| DOIs | |
| Publication status | Published - Apr 2020 |
| Externally published | Yes |
| Event | 36th IEEE International Conference on Data Engineering, ICDE 2020 - Dallas, United States Duration: 20 Apr 2020 → 24 Apr 2020 |
Publication series
| Name | Proceedings - International Conference on Data Engineering |
|---|---|
| Volume | 2020-April |
| ISSN (Print) | 1084-4627 |
Conference
| Conference | 36th IEEE International Conference on Data Engineering, ICDE 2020 |
|---|---|
| Country/Territory | United States |
| City | Dallas |
| Period | 20/04/20 → 24/04/20 |
Bibliographical note
Publisher Copyright:© 2020 IEEE.
Keywords
- Overlap set similarity
- Similarity join
- Top-k join, data lake
Fingerprint
Dive into the research topics of 'Adaptive top-k overlap set similarity joins'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver