Projects per year
Abstract
This paper presents a framework for image captioning by exploiting the scene context. To date, most of the captioning models have been relying on the combination of Convolutional Neural Networks (CNN) and the Long-Short Term Memory (LSTM) model, trained in an end-to-end fashion. Recently, there has been extensive research towards improving the language model and the CNN architecture, utilizing attention mechanisms, and improving the learning techniques in such systems. A less studied area is the contribution of the scene context in the captioning. In this work, we study the role of the scene context, consisting of the scene type and objects. To this end, we augment the CNN features with scene context features, including scene detectors, objects and their localization, and their combinations. We use the scene context features as an initialization feature at the zeroth time step in a LSTM model with deep residual connections. In subsequent time steps, the model, however, uses the original CNN features. The proposed language model, contrary to more conventional ones, thus has access to visual features through the whole process of sentence generation. We demonstrate that the scene context features affect the language formation and improve the captioning results in the proposed framework. We also report results from the Microsoft COCO benchmark, where our model achieves the state-of-the-art performance on the test set.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion |
Publisher | ACM |
Pages | 1-8 |
ISBN (Electronic) | 978-1-4503-4519-4 |
DOIs | |
Publication status | Published - 2016 |
MoE publication type | A4 Conference publication |
Event | ACM Multimedia - Amsterdam, Netherlands Duration: 15 Oct 2016 → 19 Oct 2016 Conference number: 24 |
Conference
Conference | ACM Multimedia |
---|---|
Abbreviated title | ACMMM |
Country/Territory | Netherlands |
City | Amsterdam |
Period | 15/10/2016 → 19/10/2016 |
Fingerprint
Dive into the research topics of 'Exploiting Scene Context for Image Captioning'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Finnish centre of excellence in computational inference research
Xu, Y., Rintanen, J., Kaski, S., Anwer, R., Parviainen, P., Soare, M., Vuollekoski, H., Rezazadegan Tavakoli, H., Peltola, T., Blomstedt, P., Puranen, S., Dutta, R., Gebser, M., Mononen, T., Bogaerts, B., Tasharrofi, S., Pesonen, H., Weinzierl, A. & Yang, Z.
01/01/2015 → 31/12/2017
Project: Academy of Finland: Other research funding