Word representations obtained from large textual corpora have gained popularity in natural language processing as they can help to improve performance on supervised tasks for which only comparatively little labeled training data can be obtained (Turian, Ratinov, and Bengio 2010). Recently a series of scalable methods beginning with Word2Vec (Tomas Mikolov, Chen, et al. 2013) have enabled the learning from very large unlabeled corpora, obtaining better representations and representations for more words. The long-tail nature of human language – implying that most words are infrequent (Zipf 1949; Mandelbrot 1954) – however prevents these methods from representing infrequent words well (Lowe 2001; Luong, Socher, and Christopher D. Manning 2013). Considering that words are typically formed of meaningful parts, taking their structure into account was proposed as remedy (Harris 1954; Luong, Socher, and Christopher D. Manning 2013). Recently Bojanowski et al. (2017) proposed fastText, a scalable model incorporating such information. fastText allocates separate parameters for words and their parts, with part-specific parameters being shared among all words containing the respective part. However, parameters specific to rare words and rare word-parts are nevertheless estimated from little data and can suffer from unreliability and overfitting, impacting resulting word representations negatively. This thesis thus introduces a group lasso regularization (Yuan and Y. Lin 2006) to enable the selection of the words and word-parts jointly during training. Deselected parameters are pushed to 0, preventing negative impact on the resulting representation. For optimization a scalable proximal asynchronous stochastic gradient descent (ProxASGD) optimizer is introduced. The proposed method is evaluated on a variety of tasks and our results show that the regularization enables better representations for rare words and morphologically complex languages such as German. Providing separate regularization hyperparameters for words and word-parts, trading-off between inclusion of semantic and syntactic information is made possible.
| Date of Award | 2019 |
|---|
| Original language | English |
|---|
| Awarding Institution | - The Hong Kong University of Science and Technology
|
|---|
Structured sparsity for pre-training distributed word representations with subword information
LAUSEN, L. E. (Author). 2019
Student thesis: Master's thesis