Adding semantics to email clustering

Zheng Chen, Hua Li, Dou Shen, Qiang Yang, Benyu Zhang

Research output: Contribution to conferenceConference Paperpeer-review

19 Citations (Scopus)

Abstract

This paper presents a novel algorithm to cluster emails according to their contents and the sentence styles of their subject lines. In our algorithm, natural language processing techniques and frequent itemset mining techniques are utilized to automatically generate meaningful generalized sentence patterns (GSPs) from subjects of emails. Then we put forward a novel unsupervised approach which treats GSPs as pseudo class labels and conduct email clustering in a supervised manner, although no human labeling is involved. Our proposed algorithm is not only expected to improve the clustering performance, it can also provide meaningful descriptions of the resulted clusters by the GSPs. Experimental results on open dataset (Enron email dataset) and a personal email dataset collected by ourselves demonstrate that the proposed algorithm outperforms the K-means algorithm in terms of the popular measurement Fl. Furthermore, the cluster naming readability is improved by square 8.5% on the personal email dataset.
Original languageEnglish
DOIs
Publication statusPublished - 2006
Event6th International Conference on Data Mining, ICDM 2006; Hong Kong; China -
Duration: 1 Jan 20061 Jan 2006

Conference

Conference6th International Conference on Data Mining, ICDM 2006; Hong Kong; China
Period1/01/061/01/06

ISBNs

['978-0-7695-2701-7']

Fingerprint

Dive into the research topics of 'Adding semantics to email clustering'. Together they form a unique fingerprint.

Cite this