High-dimensionality is one of the most challenging problems that has arisen in the past decade. Although data mining technology has been greatly developed, new challenges still emerge with respect to specific data structures. In order to discover previously unknown patterns and make predictions, we have to overcome these challenges. Moreover, when interactions among explanatory variables are taken into account, the dimensionality becomes even larger. Thus, feature selection is a hot topic in terms of supervised and unsupervised learning. In Essay 1 of this dissertation, we consider the business data mining problem, using the Amazon employee’s access as an example, to demonstrate the proposed feature selection and classification methods. First, when we apply Naive Bayes classifiers to the data set, the classifiers are modified step-by-step with ideas of Empirical Bayes, grouping, and migration. Second, we propose a three-stage Bayesian hierarchical model with regards to the special data structure. Also, because of the categorical structure, we propose a method for variable selection: Coefficient of Dependence (CoD). Finally, ensemble learning is used to bring together the classifiers as a whole. When carrying out the procedure, a technique that we refer to as Stringing is applied. The newly-developed classifiers outperform most of the existing models in terms of the ranking of the competition. Essay 2 contains a clustering analysis model, referred to as Beta-binomial mixture model. This idea comes from the classic Gaussian Mixture Model (GMM), as a method of distribution-based clustering. In distribution-based clustering, objects are clustered based on their similarities to the same distribution. An Expectation-maximization (EM) algorithm is used to fulfill the unsupervised model.
| Date of Award | 2020 |
|---|
| Original language | English |
|---|
| Awarding Institution | - The Hong Kong University of Science and Technology
|
|---|
| Supervisor | Inchi HU (Supervisor) |
|---|
Two essays on high-dimensional classification and clustering analysis
XIONG, J. (Author). 2020
Student thesis: Doctoral thesis