Parallel spectral clustering in distributed systems

Wen Yen Chen*, Yangqiu Song, Hongjie Bai, Chih Jen Lin, Edward Y. Chang

*Corresponding author for this work

Research output: Contribution to journalJournal Articlepeer-review

565 Citations (Scopus)

Abstract

Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.

Original languageEnglish
Article number5444877
Pages (from-to)568-586
Number of pages19
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Volume33
Issue number3
DOIs
Publication statusPublished - 2011
Externally publishedYes

Keywords

  • Nyström approximation
  • Parallel spectral clustering
  • distributed computing
  • nearest neighbors
  • normalized cuts

Fingerprint

Dive into the research topics of 'Parallel spectral clustering in distributed systems'. Together they form a unique fingerprint.

Cite this