Skip to main navigation Skip to search Skip to main content

Cross-lingual language modeling for low-resource speech recognition

  • Ping Xu

Student thesis: Master's thesis

Abstract

We show for the first time an end-to-end speech transcription and translation system with cross-lingual language modeling based on weighted finite-state transducers (WFSTs). The system can perform the decoding of the speech of a resource-poor source language to its transcription as well as into a resource-rich target language. The proposed cross-lingual language modeling approach uses phrase-level translation that includes phrase-level transduction and syntactic reordering. The phrase-level transduction is capable of performing n to m cross-lingual transduction instead of word-level transduction only allowing n to n transduction. The syntactic reordering serves to model the syntactic discrepancies between the resource-poor and resource-rich languages. As such, we can leverage the statistics from a resource-rich language to improve the language model of a resource-poor language in a truly cross-lingual language model. This cross-lingual language model can simultaneously improve the speech recognition performance of the resource-poor language and provide a translation of the resource-poor language to the resource-rich language. In this thesis, we focus on the recognition and translation of a non-standard Chinese language, Cantonese, which does not have a written form, to standard Chinese. The cross-lingual language model is trained from a large amount of resource-rich language (e.g. Mandarin) data and a small amount of resource-poor language (e.g. Cantonese) data, as well as some parallel data of the resource-poor and resource-rich languages. Evaluations on Cantonese speech recognition and Cantonese to standard Mandarin Chinese translation tasks show that our proposed cross-lingual language modeling improves the recognition and translation performance significantly, up to 12.5% relative word error rate (WER) reduction over the baseline language model interpolation, 6.6% relative WER reduction and 18.5% relative bilingual evaluation understudy (BLEU) score improvement, compared to the best word-level transduction approach. This model can be further generalized to speech translation of any source and target language pairs via the transcription and translation framework.
Date of Award2012
Original languageEnglish
Awarding Institution
  • The Hong Kong University of Science and Technology

Cite this

'