Morph-based speech retrieval: Indexing methods and evaluations of unsupervised morphological analysis

Ville T. Turunen

    Research output: ThesisDoctoral ThesisCollection of Articles

    Abstract

    Speech retrieval enables users to find information in collections of spoken material. Automatic speech recognition (ASR) is used to transform the spoken words into text, and information retrieval (IR) methods are used for searching. Traditional ASR systems have a predefined vocabulary of words, and any word that is out-of-vocabulary (OOV) can not be recognized. Typically, rare words are excluded, which is problematic for retrieval, because query words are often rare words such as proper names. The limited vocabulary is especially problematic for languages such as Finnish that have a very large number of distinct word forms. In this thesis, morpheme-like subword units are used for speech recognition and retrieval. The subword units, referred to as morphs, are discovered using a data driven method that learns morphological structure from text data. Using this approach, it is possible to recognize any word in speech, even a word that was not in the training data, as a sequence of morphs. A rule-based morphological analyzer could be used to find base forms of the recognized words for indexing. However, the vocabulary of the analyzer is also limited, and recognition errors cause further problems for the analyzer. Instead, in this work, morphs are used as index terms as well. In Finnish speech retrieval experiments, the morph-based approach is compared to using word-based language models in ASR, and to using base forms in retrieval. Also, morphs are compared for story segmentation of speech. The results show that morph-based language models clearly outperform word-based models in retrieval performance. As index terms, using morphs is about as efficient as using base forms, but combining the two approaches is better than either alone, especially when there are a high proportion of unseen words in the queries. The effect of unoptimal morph segmentations is reduced by using alternative morph segmentations of query words and by using latent semantic indexing. Even if the morph deemed most likely by the ASR is incorrect, it is possible that the correct one is among the candidates the ASR considers. Utilizing the candidates in retrieval can improve performance. In this thesis, a representation of ASR hypotheses called confusion network is used for extracting alternative recognition results. A rank-based weighting of index terms is proposed, and found to outperform posterior probability based weighting. This thesis also studies evaluation metrics for unsupervised morphological analysis methods. Application evaluations such as speech retrieval are time consuming and cannot be used during method development. Different linguistic evaluation metrics have been proposed and are compared in this thesis by e.g. correlating the metrics to the results of application performance.
    Translated title of the contributionMorfeihin perustuva puhetiedonhaku: indeksointimenetelmiä sekä ohjaamattoman morfologisen analyysin evaluaatioita
    Original languageEnglish
    QualificationDoctor's degree
    Awarding Institution
    • Aalto University
    Supervisors/Advisors
    • Oja, Erkki, Supervising Professor
    • Kurimo, Mikko, Thesis Advisor
    Publisher
    Print ISBNs978-952-60-4717-1
    Electronic ISBNs978-952-60-4718-8
    Publication statusPublished - 2012
    MoE publication typeG5 Doctoral dissertation (article)

    Keywords

    • speech retrieval
    • spoken document retrieval
    • subword indexing
    • morphemes
    • out-of-vocabulary
    • confusion networks
    • morphological analysis

    Fingerprint Dive into the research topics of 'Morph-based speech retrieval: Indexing methods and evaluations of unsupervised morphological analysis'. Together they form a unique fingerprint.

    Cite this