Geometry-aware relational exemplar attention for dense captioning

Tzu Jui Julius Wang, Hamed R. Tavakoli, Mats Sjöberg, Jorma Laaksonen

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaConference contributionScientificvertaisarvioitu

39 Lataukset (Pure)

Abstrakti

Dense captioning (DC), which provides a comprehensive context understanding of images by describing all salient visual groundings in an image, facilitates multimodal understanding and learning. As an extension of image captioning, DC is developed to discover richer sets of visual contents and to generate captions of wider diversity and increased details. The state-of-the-art models of DC consist of three stages: (1) region proposals, (2) region classification, and (3) caption generation for each proposal. They are typically built upon the following ideas: (a) guiding the caption generation with image-level features as the context cues along with regional features and (b) refining locations of region proposals with caption information. In this work, we propose (a) a joint visual-textual criterion exploited by the region classifier that further improves both region detection and caption accuracy, and (b) a Geometryaware Relational Exemplar attention (GREatt) mechanism to relate region proposals. The former helps the model learn a region classifier by effectively exploiting both visual groundings and caption descriptions. Rather than treating each region proposal in isolation, the latter relates regions in complementary relations, i.e. contextually dependent, visually supported and geometry relations, to enrich context information in regional representations. We conduct an extensive set of experiments and demonstrate that our proposed model improves the state-of-the-art by at least +5.3% in terms of the mean average precision on the Visual Genome dataset.

AlkuperäiskieliEnglanti
OtsikkoMULEA 2019 - 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications, co-located with MM 2019
KustantajaACM
Sivut3-11
Sivumäärä9
ISBN (elektroninen)9781450369183
DOI - pysyväislinkit
TilaJulkaistu - 15 lokakuuta 2019
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaInternational Workshop on Multimodal Understanding and Learning for Embodied Applications - Nice, Ranska
Kesto: 25 lokakuuta 201925 lokakuuta 2019
Konferenssinumero: 1

Workshop

WorkshopInternational Workshop on Multimodal Understanding and Learning for Embodied Applications
LyhennettäMULEA
MaaRanska
KaupunkiNice
Ajanjakso25/10/201925/10/2019

Sormenjälki Sukella tutkimusaiheisiin 'Geometry-aware relational exemplar attention for dense captioning'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

  • Projektit

    MeMAD Laaksonen

    Laaksonen, J., Sjöberg, M., Laria Mantecon, H. & Pehlivan Tort, S.

    01/01/201831/12/2020

    Projekti: EU: Framework programmes funding

    Laitteet

    Science-IT

    Mikko Hakala (Manager)

    Perustieteiden korkeakoulu

    Laitteistot/tilat: Facility

  • Siteeraa tätä

    Wang, T. J. J., Tavakoli, H. R., Sjöberg, M., & Laaksonen, J. (2019). Geometry-aware relational exemplar attention for dense captioning. teoksessa MULEA 2019 - 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications, co-located with MM 2019 (Sivut 3-11). ACM. https://doi.org/10.1145/3347450.3357656