Universal Set Similarity Search via Multi-Task Representation Learning

Zhong YANG, Bolong ZHENG, Guohui LI, Xi ZHAO, Xiaofang ZHOU

Research output: Contribution to conferenceConference Paperpeer-review

Abstract

Set similarity search, as a foundational operation in data processing with diverse applications in different domains, has been extensively studied. However, in the era of big data where sets sizes and quantities are rapidly increasing, set similarity search suffers from significant computational and storage overheads. Additionally, traditional approaches struggle to universally address the search problem across different similarity measures and query types. To tackle these challenges, AI techniques, with their powerful learning capabilities, may provide a viable solution. In this paper, we first propose a multi-task representation learning approach with box embeddings that accurately simulates different similarity measures simultaneously by estimating the overlap and union relationships between set pairs in latent box space. Based on the compressed representations of sets, we then introduce a universal search approach designed to answer various set similarity queries with parallel implementation. Extensive experiments conducted on real-world datasets demonstrate the universality, accuracy and efficiency of the proposed approach, showing that it outperforms competing methods. For reproduction, we release our source code on https://github.com/yangzhong901/MTBUS.
Original languageEnglish
Pages1483-1495
Publication statusPublished - May 2025
Event2025 IEEE 41st International Conference on Data Engineering (ICDE) -
Duration: 1 May 20251 May 2025

Conference

Conference2025 IEEE 41st International Conference on Data Engineering (ICDE)
Period1/05/251/05/25

Fingerprint

Dive into the research topics of 'Universal Set Similarity Search via Multi-Task Representation Learning'. Together they form a unique fingerprint.

Cite this