Abstract
Neural-network-based image and video captioning can be substantially improved by utilizing architectures that make use of special features from the scene context, objects, and locations. A novel discriminatively trained evaluator network for choosing the best caption among those generated by an ensemble of caption generator networks further improves accuracy.
Original language | English |
---|---|
Pages (from-to) | 34-46 |
Number of pages | 13 |
Journal | IEEE Multimedia |
Volume | 25 |
Issue number | 2 |
DOIs | |
Publication status | Published - 2018 |
MoE publication type | A1 Journal article-refereed |
Keywords
- computer vision
- applications and expert knowledge-intensive systems
- artificial intelligence
- computing
- deep learning
- image captioning
- recurrent networks