Comparing human and automated approaches to visual storytelling

Sabine Braun, Kim Starr, Jorma Laaksonen

Research output: Chapter in Book/Report/Conference proceedingChapterScientificpeer-review

37 Downloads (Pure)

Abstract

This chapter focuses on the recent surge of interest in automating methods for describing audiovisual content ,whether for image search and retrieval, visual storytelling or in response to the rising demand for audio description following changes to regulatory frameworks. While computer vision communities have intensified research into the automatic generation of video descriptions (Bernardi et al., 2016), the automation of still image captioning remains a challenge in terms of accuracy (Husain and Bober, 2016). Moving images pose additional challenges linked to temporality, including co-referencing (Rohrbach et al., 2017) and other features of narrative continuity (Huang et al., 2016). Machine-generated descriptions are currently less sophisticated than their human equivalents, and frequently incoherent or incorrect. By contrast, human descriptions are more elaborate and reliable but are expensive to produce. Nevertheless, they offer information about visual and auditory elements in audiovisual content that can be exploited for research into machine training. Based on our research conducted in the EU-funded MeMAD project, this chapter outlines a methodological approach for a systematic comparison of human- and machine-generated video descriptions, drawing on corpus-based and discourse-based approaches, with a view to identifying key characteristics and patterns in both types of description, and exploiting human knowledge about video description for machine training.

This chapter focuses on the recent surge of interest in automating methods for describing audiovisual content, whether for image search and retrieval, visual storytelling or in response to the rising demand for audio description following changes to regulatory frameworks. A model for machine-generated content description is therefore likely to be a more achievable goal in the shorter term than a model for generating elaborate audio descriptions. Relevance Theory (RT) focuses on the human ability to derive meaning through inferential processes. RT asserts that these processes are highly inferential, drawing on common knowledge and cultural experience, and that they are guided by the human tendency to maximise relevance and assumption that speakers/storytellers normally choose the optimally relevant way of communicating their intentions. Moving on from basic comprehension of events to interpretation and conjecture requires the viewer to employ ‘extradiegetic’ references such as social convention, cultural norms and life experience.
Original languageEnglish
Title of host publicationInnovation in Audio Description Research
PublisherRoutledge
Chapter8
Number of pages38
ISBN (Electronic)9781003052968
ISBN (Print)9781138356672
DOIs
Publication statusPublished - 2020
MoE publication typeA3 Part of a book or another research book

Publication series

NameIATIS Yearbook

Fingerprint

Dive into the research topics of 'Comparing human and automated approaches to visual storytelling'. Together they form a unique fingerprint.

Cite this