Multimodal Concept Detection and Annotation in Image and Video Collections

Satoru Ishikawa

Research output: ThesisDoctoral ThesisCollection of Articles

Abstract

The World Wide Web has become a common-place for finding for all kinds of purposes. The amount of data which one user can be dealing with has become large and its size is countinuously growing. The relevant data for users have not only become large, but also diverse. Hence, searching relevant information from such large and diverse resources is a critical task. However, users can not always formulate appropriate queries for finding the desired resources. In order to retrieve relevant information, the semantic relationships of the information in different modalities would need to be known and specified. This thesis approaches the multimodal cross-domain semantic retrieval and fusion problem from the point of view of content-based visual analysis and statistical natural language analysis. It also aims at using cross-domain textual semantics to generate pseudo tags for images to improve the performance of the information retrieval task. The main focus of the thesis is in bridging the semantic gap between textual and visual content domains. In order to combine and project the unimodal information to multimodal space, two approaches are used: one is the Multimodal Deep Boltzmann Machine (DBM) and the other is the late fusion of unimodal Support Vector Machines (SVM). One problem of the non-linear SVM approach is its high calculation cost. In this dissertation, the homogeneous kernel map method is used to improve the efficiency of SVM. In our experiments, we adopted deep convolutional neural network features, particularly GoogLeNet features, and the retrieval results of the SVM-based approaches improved to be nearly equal to those of the Multimodal DBM approch. One drawback of the multimodal information retrieval task is the requirement to be able to perform queries in each unimodal domain. In our experiments, if the query for image domain is missing or not appropriate, the approach is just the same as ordinal text search. Additionally, the image contents and its textual description do not always match. In order to improve the multimodal information retrieval, the method of pseudo tag generation is proposed in this thesis. The generation of pseudo tags is based on a text–image semantic map, which is calculated by the cooccurrence of latent topics in text and visual concepts in text–image data. In the experiments, the multimodal information retrieval results were considerably improved by using the pseudo tags.
Translated title of the contributionMultimodal Concept Detection and Annotation in Image and Video Collections
Original languageEnglish
QualificationDoctor's degree
Awarding Institution
  • Aalto University
Supervisors/Advisors
  • Kaski, Samuel, Supervising Professor
  • Laaksonen, Jorma, Thesis Advisor
Publisher
Print ISBNs978-952-60-3952-7
Electronic ISBNs978-952-60-3954-1
Publication statusPublished - 2020
MoE publication typeG5 Doctoral dissertation (article)

Keywords

  • information retrieval
  • multimodal concepts
  • images
  • videos

Fingerprint Dive into the research topics of 'Multimodal Concept Detection and Annotation in Image and Video Collections'. Together they form a unique fingerprint.

Cite this