Abstract

Image Difference Captioning (IDC) aims at generating sentences to describe differences between two similar-looking images. Conventional approaches learn an IDC model with a pre-trained and usually frozen visual feature extractor. Accordingly, two major issues may arise: (1) a large domain gap usually exists between the pre-training datasets used for training such a visual encoder and that of the downstream IDC task, and (2) the visual feature extractor, when separately encoding two images, often does not effectively encode the visual changes between two images. Due to the excellent zero-shot performance of the recently proposed CLIP, we thus propose CLIP4IDC to transfer a CLIP model for the IDC task to address those issues. Different from directly fine-tuning CLIP to generate sentences, we introduce an adaptation training process to adapt CLIP’s visual encoder to capture and align differences in image pairs based on the textual descriptions. Experiments on three IDC benchmark datasets, CLEVR-Change, Spot-the-Diff, and Image-Editing-Request, demonstrate the effectiveness of CLIP4IDC.
Original languageEnglish
Title of host publicationProceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (AACL-IJCNLP)
PublisherAssociation for Computational Linguistics
Pages33-42
Volume2
ISBN (Electronic)978-1-955917-64-3
Publication statusPublished - Nov 2022
MoE publication typeA4 Article in a conference publication
Event2nd Conference of the Asia-Pacific Chapter of the Association for
Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
- Virtual, Online
Duration: 20 Nov 202223 Nov 2022

Conference

Conference2nd Conference of the Asia-Pacific Chapter of the Association for
Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
Abbreviated titleAACL-IJCNLP
CityVirtual, Online
Period20/11/202223/11/2022

Fingerprint

Dive into the research topics of 'CLIP4IDC: CLIP for Image Difference Captioning'. Together they form a unique fingerprint.

Cite this