Abstract
Speech recognition specifically, and language technology more generally, have started to find everyday use. Challenging language tasks have become feasible through a continued growth in data resources and compute capacity, and through neural networks methods which are able to take advantage of this growth. As applications continue to integrate more deeply into our lives, it is important to understand and follow the many directions that these fields may take. At the turn of the 2020-decade, end-to-end models have received a lot of attention. End-to-end models hold promise of simpler solutions, which nonetheless may scale better with data and compute. On the other hand, end-to-end models defy decomposing tasks into easier subproblems. This decomposition allows modular designs, which permit a wider variety of data sources to be used. It remains unclear whether the end-to-end models are truly an improvement over previous technologies. It is not straight-forward to compare end-to-end and decomposed solutions fairly, because of their many differences. This thesis proposes a principled approach for comparisons of such heterogeneous solutions and applies it to speech recognition. In their default configuration, the end-to-end models forego many useful data sources, and rely solely on expensive end-to-end labeled data. This thesis explores methods for leveraging additional data sources in speech recognition, canonical morpheme segmentation, and spoken language translation. Additional data sources are especially useful in low data and under-resourced tasks. These difficult tasks often need the structure imposed by decomposed solutions. This thesis investigates end-to-end models in an under-resourced speech recognition and a low data canonical morpheme segmentation task. The tasks explored in this thesis are connected through a shared architecture: attention-based encoder-decoder models. Though these attention-based models are most often outperformed by hidden Markov model speech recognition systems, they showcase remarkable flexibility. They succeed in speech recognition using just tens of hours and upto thousands of hours of data. They learn to exploit auxiliary speaker and segmentation-marker inputs. They perform spoken language translation in one step. They even yield the author a first place in a public benchmark competition.
Translated title of the contribution | Attentiopohjaiset kokonaismallit kieliteknologiassa |
---|---|
Original language | English |
Qualification | Doctor's degree |
Awarding Institution |
|
Supervisors/Advisors |
|
Publisher | |
Print ISBNs | 978-952-64-1671-7 |
Electronic ISBNs | 978-952-64-1672-4 |
Publication status | Published - 2024 |
MoE publication type | G5 Doctoral dissertation (article) |
Keywords
- speech recognition
- spoken language translation
- canonical morpheme segmentation
- end-to-end models