Abstrakti
Neural-network-based image and video captioning can be substantially improved by utilizing architectures that make use of special features from the scene context, objects, and locations. A novel discriminatively trained evaluator network for choosing the best caption among those generated by an ensemble of caption generator networks further improves accuracy.
Alkuperäiskieli | Englanti |
---|---|
Sivut | 34-46 |
Sivumäärä | 13 |
Julkaisu | IEEE Multimedia |
Vuosikerta | 25 |
Numero | 2 |
DOI - pysyväislinkit | |
Tila | Julkaistu - 2018 |
OKM-julkaisutyyppi | A1 Julkaistu artikkeli, soviteltu |