TY - GEN
T1 - Probabilistic correlation-based similarity measure of unstructured records
AU - Song, Shaoxu
AU - Chen, Lei
PY - 2007
Y1 - 2007
N2 - Computing the similarity between unstructured records is a fundamental function in multiple applications. Approximate string matching and full text retrieval techniques do not show the best performance when applied directly, since the information are limited in unstructured records of short record length. In this paper, we propose a novel probabilistic correlation-based similarity measure. Rather than simply conducting the exact matching tokens of two records, our similarity evaluation enriches the information of records by considering the correlations of tokens. We define the probabilistic correlation between tokens as the probability that these tokens appear in the same records. Then we compute the weight of tokens and discover the correlations of records based on the probabilistic correlations of tokens. Finally, we present extensive experimental results to demonstrate the effectiveness of our approach.
AB - Computing the similarity between unstructured records is a fundamental function in multiple applications. Approximate string matching and full text retrieval techniques do not show the best performance when applied directly, since the information are limited in unstructured records of short record length. In this paper, we propose a novel probabilistic correlation-based similarity measure. Rather than simply conducting the exact matching tokens of two records, our similarity evaluation enriches the information of records by considering the correlations of tokens. We define the probabilistic correlation between tokens as the probability that these tokens appear in the same records. Then we compute the weight of tokens and discover the correlations of records based on the probabilistic correlations of tokens. Finally, we present extensive experimental results to demonstrate the effectiveness of our approach.
KW - Probabilistic correlation
KW - Record similarity
UR - https://openalex.org/W1982486932
UR - https://www.scopus.com/pages/publications/63449109272
U2 - 10.1145/1321440.1321587
DO - 10.1145/1321440.1321587
M3 - Conference Paper published in a book
SN - 9781595938039
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 967
EP - 970
BT - CIKM 2007 - Proceedings of the 16th ACM Conference on Information and Knowledge Management
T2 - 16th ACM Conference on Information and Knowledge Management, CIKM 2007
Y2 - 6 November 2007 through 9 November 2007
ER -