Global Fusion Attention for Vision and Language Understanding

Zixin Guo*, Chen Liang, Ziyu Wan, Yang Bai

*Tämän työn vastaava kirjoittaja

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaAbstractScientificvertaisarvioitu

Abstrakti

We extend the popular transformer architecture to a multimodal model, processing both visual and textual inputs. We propose a new attention mechanism on Transformer-based architecture for the joint vision and language understanding tasks. Our model fuses multi-level comprehension between images and texts in a weighted manner, which could better curve the internal relationships. Experiments on benchmark VQA dataset CLEVR demonstrate the effectiveness of the proposed attention mechanism. We also observe the improvements in sample efficiency of reinforcement learning through the experiments on grounded language understanding tasks of BabyAI platform.

AlkuperäiskieliEnglanti
Sivut15789-15790
Sivumäärä2
TilaJulkaistu - 2021
OKM-julkaisutyyppiEi sovellu
Tapahtuma35th AAAI Conference on Artificial Intelligence / 33rd Conference on Innovative Applications of Artificial Intelligence / 11th Symposium on Educational Advances in Artificial Intelligence - Virtual, Online
Kesto: 2 helmik. 20219 helmik. 2021

Conference

Conference35th AAAI Conference on Artificial Intelligence / 33rd Conference on Innovative Applications of Artificial Intelligence / 11th Symposium on Educational Advances in Artificial Intelligence
KaupunkiVirtual, Online
Ajanjakso02/02/202109/02/2021

Sormenjälki

Sukella tutkimusaiheisiin 'Global Fusion Attention for Vision and Language Understanding'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä