Exploiting Scene Context for Image Captioning

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaConference contributionScientificvertaisarvioitu

3 Sitaatiot (Scopus)

Abstrakti

This paper presents a framework for image captioning by exploiting the scene context. To date, most of the captioning models have been relying on the combination of Convolutional Neural Networks (CNN) and the Long-Short Term Memory (LSTM) model, trained in an end-to-end fashion. Recently, there has been extensive research towards improving the language model and the CNN architecture, utilizing attention mechanisms, and improving the learning techniques in such systems. A less studied area is the contribution of the scene context in the captioning. In this work, we study the role of the scene context, consisting of the scene type and objects. To this end, we augment the CNN features with scene context features, including scene detectors, objects and their localization, and their combinations. We use the scene context features as an initialization feature at the zeroth time step in a LSTM model with deep residual connections. In subsequent time steps, the model, however, uses the original CNN features. The proposed language model, contrary to more conventional ones, thus has access to visual features through the whole process of sentence generation. We demonstrate that the scene context features affect the language formation and improve the captioning results in the proposed framework. We also report results from the Microsoft COCO benchmark, where our model achieves the state-of-the-art performance on the test set.
AlkuperäiskieliEnglanti
OtsikkoProceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion
KustantajaACM
Sivut1-8
ISBN (elektroninen)978-1-4503-4519-4
DOI - pysyväislinkit
TilaJulkaistu - 2016
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaACM Multimedia - Amsterdam, Alankomaat
Kesto: 15 lokakuuta 201619 lokakuuta 2016
Konferenssinumero: 24

Conference

ConferenceACM Multimedia
LyhennettäACMMM
MaaAlankomaat
KaupunkiAmsterdam
Ajanjakso15/10/201619/10/2016

Sormenjälki Sukella tutkimusaiheisiin 'Exploiting Scene Context for Image Captioning'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

  • Projektit

    • 1 Päättynyt

    Suomalainen laskennallisen päättelyn huippuyksikkö

    Xu, Y., Rezazadegan Tavakoli, H., Pesonen, H., Puranen, S., Rintanen, J., Kaski, S., Anwer, R., Parviainen, P., Soare, M., Weinzierl, A. & Vuollekoski, H.

    01/01/201528/02/2018

    Projekti: Academy of Finland: Other research funding

    Laitteet

    Science-IT

    Mikko Hakala (Manager)

    Perustieteiden korkeakoulu

    Laitteistot/tilat: Facility

  • Siteeraa tätä

    Shetty, R., Rezazadegan Tavakoli, H., & Laaksonen, J. (2016). Exploiting Scene Context for Image Captioning. teoksessa Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion (Sivut 1-8). ACM. https://doi.org/10.1145/2983563.2983571