We have managed to consistently improve the translation quality for challenging low resource languages by injecting semantic based objective functions into the training pipeline at an early (training) rather than late (tuning) stage as in previous attempts. The set of approaches suggested in this thesis are motivated by the fact that including semantics in a late stage tuning of machine translation models has already been shown to increase translation quality. Any shortage of parallel data constitutes a serious obstacle for conventional machine translation training techniques, because of their heavy dependency on memorization from big data. With low resource languages, for which parallel corpora are scarce, it becomes imperative to make learning from small data much more efficient by adding additional constraints to create stronger inductive biases—especially linguistically well-motivated constraints, such as the shallow semantic parses of the training sentences. However, while automatic semantic parsing is readily available to produce shallow semantic parses for a high resource output language (typically English), the problem is that there are no semantic parsers for low resource languages such as Oromo, Uyghur and Uzbek. We propose the first ever methods that inject a crosslingual semantic based objective function into training translation models for translation tasks like Chinese–English where we have semantic parsers for both languages. We report promising results showing that this way of training the machine translation model in general helps bias the learning towards semantically more correct bilingual constituents. Semantic statistical machine translation for low resource languages has been a difficult challenge since semantic parses are not usually available for low resource input languages but only for high resource output languages such as English. We extend our bilingual approaches to a low resource setup via our new training approaches which only require the output language semantic parse. We then thoroughly analyze the reasons behind the promising results that we achieved for multiple challenging low resource translation tasks such as Hausa, Uzbek, Uyghur, Swahili, Oromo and Amharic always translating into English. Our methods heavily rely on the degree of goodness of the semantic parser. We have noted that commonly used semantic parses completely fail to parse any sentence containing any form of the verb to be. Ignoring sentences containing the verb to be means that we are throwing away a good part of valuable data. Finally, we propose a novel way that attempts to semantically parse sentences that contains the verb to be and re-run all previous models on this new parsed data; we also contrast to a newer semantic parse which handles the verb to be . All our results show that building efficient MT systems for low resource languages could be more feasible than generally assumed.
| Date of Award | 2018 |
|---|
| Original language | English |
|---|
| Awarding Institution | - The Hong Kong University of Science and Technology
|
|---|
Improving semantic SMT for low resource languages
BELOUCIF, M. (Author). 2018
Student thesis: Doctoral thesis