Probabilistic correlation-based similarity measure of unstructured records

Shaoxu Song*, Lei Chen

*Corresponding author for this work

Research output: Chapter in Book/Conference Proceeding/ReportConference Paper published in a bookpeer-review

Abstract

Computing the similarity between unstructured records is a fundamental function in multiple applications. Approximate string matching and full text retrieval techniques do not show the best performance when applied directly, since the information are limited in unstructured records of short record length. In this paper, we propose a novel probabilistic correlation-based similarity measure. Rather than simply conducting the exact matching tokens of two records, our similarity evaluation enriches the information of records by considering the correlations of tokens. We define the probabilistic correlation between tokens as the probability that these tokens appear in the same records. Then we compute the weight of tokens and discover the correlations of records based on the probabilistic correlations of tokens. Finally, we present extensive experimental results to demonstrate the effectiveness of our approach.

Original languageEnglish
Title of host publicationCIKM 2007 - Proceedings of the 16th ACM Conference on Information and Knowledge Management
Pages967-970
Number of pages4
DOIs
Publication statusPublished - 2007
Event16th ACM Conference on Information and Knowledge Management, CIKM 2007 - Lisboa, Portugal
Duration: 6 Nov 20079 Nov 2007

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Conference

Conference16th ACM Conference on Information and Knowledge Management, CIKM 2007
Country/TerritoryPortugal
CityLisboa
Period6/11/079/11/07

Keywords

  • Probabilistic correlation
  • Record similarity

Fingerprint

Dive into the research topics of 'Probabilistic correlation-based similarity measure of unstructured records'. Together they form a unique fingerprint.

Cite this