Efficient Set-Correlation Operator Inside Databases

Fei Gao, Shao Xu Song*, Lei Chen, Jian Min Wang

*Corresponding author for this work

Research output: Contribution to journalJournal Articlepeer-review

1 Citation (Scopus)

Abstract

Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, and are often stored as set records in hidden-Web databases. Many interesting information retrieval tasks are correspondingly raised on the correlation query over these short text records, such as finding hot topics over news highlights and searching related scientific papers on a certain topic. However, current relational database management systems (RDBMS) do not directly provide support on set correlation query. Thus, in this paper, we address both the effectiveness and the efficiency issues of set correlation query over set records in databases. First, we present a framework of set correlation query inside databases. To the best of our knowledge, only the Pearson’s correlation can be implemented to construct token correlations by using RDBMS facilities. Thereby, we propose a novel correlation coefficient to extend Pearson’s correlation, and provide a pure-SQL implementation inside databases. We further propose optimal strategies to set up correlation filtering threshold, which can greatly reduce the query time. Our theoretical analysis proves that with a proper setting of filtering threshold, we can improve the query efficiency with a little effectiveness loss. Finally, we conduct extensive experiments to show the effectiveness and the efficiency of proposed correlation query and optimization strategies.

Original languageEnglish
Pages (from-to)683-701
Number of pages19
JournalJournal of Computer Science and Technology
Volume31
Issue number4
DOIs
Publication statusPublished - 1 Jul 2016

Bibliographical note

Publisher Copyright:
© 2016, Springer Science+Business Media New York.

Keywords

  • correlation measure
  • correlation query
  • set record

Fingerprint

Dive into the research topics of 'Efficient Set-Correlation Operator Inside Databases'. Together they form a unique fingerprint.

Cite this