Abstract
Machine translation is an important natural language processing application, enabling widened access to information, cultural interchange, and business opportunities in a multilingual world. Driven by research into deep neural networks, machine translation has recently made rapid advances, particularly in the fluency of the translation output. As the methods tend to be data-hungry, high-resource languages have benefited more than low-resource ones. In this work, the aim is to improve machine translation into low-resource morphologically rich languages. Rich morphology leads to a combinatorial explosion in the number of word forms,resulting in very large vocabularies, containing many poorly modeled rare words. This thesis addresses these challenges with multiple approaches. The focus is on methods for segmenting words into subwords, to get more frequent and thus easier learned representations, and to increase the symmetry between languages. It is important to exploit additional resources from related tasks,such as parallel data from related high-resource language pairs and monolingual data from both low- and high-resource languages. Useful auxiliary data sets for multimodal translation can befound from captioning and text-only translation tasks. The methods for exploiting this auxiliary data include cross-lingual learning and data augmentation e.g. using denoising sequence autoen-coders and subword regularization. Learning setups used in the thesis include using unsupervised and language-independent methods, using active learning to guide an annotation effort to produce more informative data, and using scheduled multi-task learning to improve cross-lingual transfer. Contributions of the thesis include five novel segmentation methods: Morfessor FlatCat, Omorfi-restricted Morfessor, Cognate Morfessor, Morfessor EM+Prune, and a semi-supervised neural method. An active learning strategy for Morfessor FlatCat is presented. Evaluation of segmentation quality is performed using both intrinsic and extrinsic automatic methods. Morfessor EM+Prunefinds models with both lower cost and better quality in unsupervised segmentation than Morfessor Baseline. Active learning is superior to random selection for collecting annotations. The best performance in semi-supervised segmentation is achieved when using Morfessor FlatCat segmentations as features in a conditional random field. Contributions to machine translation include a target-side multi-task learning scheme, and scheduled multi-task learning with a denoising sequence autoencoder. LeBLEU, an evaluation measure suitable for morphologically rich languages is presented. Evaluation of translation quality is performed using both automatic and human evaluation. When resources are scarce, the most important auxiliary data comes from related languages. Other types of auxiliary data, such as monolingual corpora, are also beneficial and the gains are partly cumulative.
Translated title of the contribution | Konekäännös morfologisesti rikkaisiin resurssiniukkoihin kieliin |
---|---|
Original language | English |
Qualification | Doctor's degree |
Awarding Institution |
|
Supervisors/Advisors |
|
Publisher | |
Print ISBNs | 978-952-64-0168-3 |
Electronic ISBNs | 978-952-64-0169-0 |
Publication status | Published - 2020 |
MoE publication type | G5 Doctoral dissertation (article) |
Keywords
- machine translation
- morpheme segmentation
- subwords
- unsupervised learning
- semi-supervised learning
- transfer learning
- multi-task learning
- active learning