Abstract
There exists considerable literature on estimating the cardinality of set intersection result. In this paper, we consider a generalized problem for integer sets where, given a gap parameter δ, two elements are deemed as matches if their numeric difference equals δ or is within δ. We call this problem the gapped set intersection size estimation (GSISE), and it can be used to model applications in database systems, data mining, and information retrieval. We first distinguish two subtypes of the estimation problem: the point gap estimation and range gap estimation. We propose optimized sketches to tackle the two problems efficiently and effectively with theoretical guarantees. We demonstrate the usage of our proposed techniques in mining top-K related keywords efficiently, by integrating with an inverted index. Finally, substantial experiments based on a large subset of the ClueWed09 dataset demonstrate the efficiency and effectiveness of the proposed methods.
| Original language | English |
|---|---|
| Title of host publication | CIKM 2015 - Proceedings of the 24th ACM International Conference on Information and Knowledge Management |
| Publisher | Association for Computing Machinery |
| Pages | 1351-1360 |
| Number of pages | 10 |
| ISBN (Electronic) | 9781450337946 |
| DOIs | |
| Publication status | Published - 17 Oct 2015 |
| Externally published | Yes |
| Event | 24th ACM International Conference on Information and Knowledge Management, CIKM 2015 - Melbourne, Australia Duration: 19 Oct 2015 → 23 Oct 2015 |
Publication series
| Name | International Conference on Information and Knowledge Management, Proceedings |
|---|---|
| Volume | 19-23-Oct-2015 |
Conference
| Conference | 24th ACM International Conference on Information and Knowledge Management, CIKM 2015 |
|---|---|
| Country/Territory | Australia |
| City | Melbourne |
| Period | 19/10/15 → 23/10/15 |
Bibliographical note
Publisher Copyright:© 2015 ACM.
Keywords
- Indexing
- Set intersection size estimation
- Top-k