Deep audio-visual saliency: Baseline model and data

Hamed Rezazadegan Tavakoli, Ali Borji, Juho Kannala, Esa Rahtu

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

2 Citations (Scopus)

Abstract

This paper introduces a conceptually simple and effective Deep Audio-Visual Embedding for dynamic saliency prediction dubbed "DAVE" in conjunction with our efforts towards building an Audio-Visual Eye-tracking corpus named "AVE". Despite existing a strong relation between auditory and visual cues for guiding gaze during perception, video saliency models only consider visual cues and neglect the auditory information that is ubiquitous in dynamic scenes. Here, we propose a baseline deep audio-visual saliency model for multi-modal saliency prediction in the wild. Thus the proposed model is intentionally designed to be simple. A video baseline model is also developed on the same architecture to assess effectiveness of the audio-visual models on a fair basis. We demonstrate that audio-visual saliency model outperforms the video saliency models. The data and code are available at https://hrtavakoli.github.io/AVE/and https://github.com/hrtavakoli/DAVE.

Original languageEnglish
Title of host publicationProceedings ETRA 2020 Short Papers - ACM Symposium on Eye Tracking Research and Applications, ETRA 2020
EditorsStephen N. Spencer
PublisherACM
Number of pages5
ISBN (Electronic)9781450371346
DOIs
Publication statusPublished - 6 Feb 2020
MoE publication typeA4 Article in a conference publication
EventACM Symposium on Eye Tracking Research and Applications - Stuttgart, Germany
Duration: 2 Jun 20205 Jun 2020

Conference

ConferenceACM Symposium on Eye Tracking Research and Applications
Abbreviated titleETRA
Country/TerritoryGermany
CityStuttgart
Period02/06/202005/06/2020

Keywords

  • Audio-Visual Saliency
  • Deep Learning
  • Dynamic Visual Attention

Fingerprint

Dive into the research topics of 'Deep audio-visual saliency: Baseline model and data'. Together they form a unique fingerprint.

Cite this