Exploiting Scene Context for Image Captioning

Rakshith Shetty, Hamed Rezazadegan Tavakoli, Jorma Laaksonen

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

5 Citations (Scopus)

Abstract

This paper presents a framework for image captioning by exploiting the scene context. To date, most of the captioning models have been relying on the combination of Convolutional Neural Networks (CNN) and the Long-Short Term Memory (LSTM) model, trained in an end-to-end fashion. Recently, there has been extensive research towards improving the language model and the CNN architecture, utilizing attention mechanisms, and improving the learning techniques in such systems. A less studied area is the contribution of the scene context in the captioning. In this work, we study the role of the scene context, consisting of the scene type and objects. To this end, we augment the CNN features with scene context features, including scene detectors, objects and their localization, and their combinations. We use the scene context features as an initialization feature at the zeroth time step in a LSTM model with deep residual connections. In subsequent time steps, the model, however, uses the original CNN features. The proposed language model, contrary to more conventional ones, thus has access to visual features through the whole process of sentence generation. We demonstrate that the scene context features affect the language formation and improve the captioning results in the proposed framework. We also report results from the Microsoft COCO benchmark, where our model achieves the state-of-the-art performance on the test set.
Original languageEnglish
Title of host publicationProceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion
PublisherACM
Pages1-8
ISBN (Electronic)978-1-4503-4519-4
DOIs
Publication statusPublished - 2016
MoE publication typeA4 Article in a conference publication
EventACM Multimedia - Amsterdam, Netherlands
Duration: 15 Oct 201619 Oct 2016
Conference number: 24

Conference

ConferenceACM Multimedia
Abbreviated titleACMMM
CountryNetherlands
CityAmsterdam
Period15/10/201619/10/2016

Fingerprint Dive into the research topics of 'Exploiting Scene Context for Image Captioning'. Together they form a unique fingerprint.

Cite this