Probabilistic representatives mining (Prem): A clustering method for distributional data reduction

Zhenyu Gao, Tejas G. Puranik, Dimitri N. Mavris

Research output: Contribution to journalJournal Articlepeer-review

8 Citations (Scopus)

Abstract

Complex computations and analyses on massive data sets can be impractical or infeasible. Data reduction is a crucial problem in the era of big data to obtain a reduced representation of the data set to facilitate more efficient yet accurate analyses. To best preserve the integrity of the original data set, a reduced representation aims to best maintain the same data distribution, referred to as a probabilistically representative subset. This paper considers the problem of reducing a large data set to very small such subsets at which random sampling does not perform well enough. We propose a data mining approach called Probabilistic Representatives Mining (PREM) to tackle this challenge. PREM uses balanced clustering to prevent undersampling and oversampling issues and multistage computing strategy to achieve better scalability and consistency. Numerical experiments on typical probability distributions and real-world data sets in the field of aeronautics and astronautics demonstrate PREM’s superiority over the baselines. An uncertainty quantification case study from aviation environmental impact modeling further shows PREM’s effectiveness and accuracy in generating probabilistically representative small samples for costly computations. Potential limitations and extensions of the method are also discussed in the paper.

Original languageEnglish
Pages (from-to)2580-2596
Number of pages17
JournalAIAA Journal
Volume60
Issue number4
DOIs
Publication statusPublished - 2022
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2022, AIAA International. All rights reserved.

Fingerprint

Dive into the research topics of 'Probabilistic representatives mining (Prem): A clustering method for distributional data reduction'. Together they form a unique fingerprint.

Cite this