Evaluation of audio-visual alignments in visually grounded speech models

Khazar Khorrami, Okko Räsänen

    Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

    5 Citations (Scopus)


    Systems that can find correspondences between multiple modalities, such as between speech and images, have great potential to solve different recognition and data analysis tasks in an unsupervised manner. This work studies multimodal learning in the context of visually grounded speech (VGS) models, and focuses on their recently demonstrated capability to extract spatiotemporal alignments between spoken words and the corresponding visual objects without ever been explicitly trained for object localization or word recognition. As the main contributions, we formalize the alignment problem in terms of an audiovisual alignment tensor that is based on earlier VGS work, introduce systematic metrics for evaluating model performance in aligning visual objects and spoken words, and propose a new VGS model variant for the alignment task utilizing cross-modal attention layer. We test our model and a previously proposed model in the alignment task using SPEECH-COCO captions coupled with MSCOCO images. We compare the alignment performance using our proposed evaluation metrics to the semantic retrieval task commonly used to evaluate VGS models. We show that cross-modal attention layer not only helps the model to achieve higher semantic cross-modal retrieval performance, but also leads to substantial improvements in the alignment performance between image object and spoken words.

    Original languageEnglish
    Title of host publication22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
    PublisherInternational Speech Communication Association (ISCA)
    Number of pages5
    ISBN (Electronic)9781713836902
    Publication statusPublished - 2021
    MoE publication typeA4 Conference publication
    EventInterspeech - Brno, Czech Republic
    Duration: 30 Aug 20213 Sept 2021
    Conference number: 22

    Publication series

    NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    ISSN (Print)2308-457X
    ISSN (Electronic)1990-9772


    Abbreviated titleINTERSPEECH
    Country/TerritoryCzech Republic


    • Audio-visual alignment
    • Cross-modal learning
    • Visual object localization
    • Word segmentation


    Dive into the research topics of 'Evaluation of audio-visual alignments in visually grounded speech models'. Together they form a unique fingerprint.

    Cite this