This paper presents a framework for image captioning by exploiting the scene context. To date, most of the captioning models have been relying on the combination of Convolutional Neural Networks (CNN) and the Long-Short Term Memory (LSTM) model, trained in an end-to-end fashion. Recently, there has been extensive research towards improving the language model and the CNN architecture, utilizing attention mechanisms, and improving the learning techniques in such systems. A less studied area is the contribution of the scene context in the captioning. In this work, we study the role of the scene context, consisting of the scene type and objects. To this end, we augment the CNN features with scene context features, including scene detectors, objects and their localization, and their combinations. We use the scene context features as an initialization feature at the zeroth time step in a LSTM model with deep residual connections. In subsequent time steps, the model, however, uses the original CNN features. The proposed language model, contrary to more conventional ones, thus has access to visual features through the whole process of sentence generation. We demonstrate that the scene context features affect the language formation and improve the captioning results in the proposed framework. We also report results from the Microsoft COCO benchmark, where our model achieves the state-of-the-art performance on the test set.
|Otsikko||Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion|
|DOI - pysyväislinkit|
|Tila||Julkaistu - 2016|
|OKM-julkaisutyyppi||A4 Artikkeli konferenssijulkaisuussa|
|Tapahtuma||ACM Multimedia - Amsterdam, Alankomaat|
Kesto: 15 lokakuuta 2016 → 19 lokakuuta 2016
|Ajanjakso||15/10/2016 → 19/10/2016|