The performance of modern speech recognition systems depends heavily on the availability of sufficient training data. Although the recognition accuracy of a speech recognition system trained with a large amount of training data can reach above 90%, the recognition accuracy is much lower if the training data is inadequate. Acquiring large amounts of manually transcribed speech data is the major cost in deploying a speech recognition system for any new language. Therefore, there is a high demand for techniques that help us to build practical speech recognition systems in languages that have limited training data, with optimal tradeoff between computational cost and recognition accuracy. Previously, GMM-HMM based acoustic models with diagonal or full covariance matrices were heuristically chosen according to training data size. Full covariance models are seldom used when training data is limited since they tend to over fit. On the other hand, diagonal covariance models simply assume feature independence which is an over simplification. In this dissertation, we propose regularized and sparse models to deal with the problems that conventional diagonal and full covariance models face: incorrect model assumption and over-fitting when training data is insufficient. Three widely used regularization methods, namely ridge, lasso and elastic net regularization, are investigated in this thesis. Lasso and elastic net regularizations lead to sparse models, meaning that many entries inside the precision matrices are shrunk to zero. We also propose weighted lasso regularization to train acoustic models with sparse banded precision matrices. The proposed sparse banded models resulting from weighted lasso regularization subsume traditional acoustic modeling methods with diagonal or full covariance matrices as special cases. Regularization terms are added to the traditional objective functions to penalize complex models so that the resulting models will not suffer from serious over-fitting. We derive the training procedure under an HMM training framework by maximizing the new objective functions. Other implementation issues are also discussed. Both maximum likelihood training and discriminative training are investigated. Experimental results on three limited size corpus, namely Wall Street Journal, Cantonese and Mandarin data sets, show that our proposed models can significantly outperform conventional diagonal or full covariance models in terms of recognition accuracy. In addition, based on our experimental results, lasso regularization is recommended over other regularization schemes. We also found that sparse banded models need less computational cost.
| Date of Award | 2013 |
|---|
| Original language | English |
|---|
| Awarding Institution | - The Hong Kong University of Science and Technology
|
|---|
Regularized and sparse models for low resource speech recognition
Zhang, W. (Author). 2013
Student thesis: Doctoral thesis