Evaluating the quality of robotic visual-language maps

Matti Pekkanen*, Tsvetomila Mihaylova, Francesco Verdoja, Ville Kyrki

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsProfessional

Abstract

Visual-language models (VLMs) have recently been introduced in robotic mapping by using the latent representations, i.e., embeddings, of the VLMs to represent the natural language semantics in the map. The main benefit is moving beyond a small set of human-created labels toward open-vocabulary scene understanding. While there is anecdotal evidence that maps built this way support downstream tasks, such as navigation, rigorous analysis of the quality of the maps using these embeddings is lacking. In this paper, we propose a way to analyze the quality of maps created using VLMs by evaluating two critical properties: queryability and consistency. We demonstrate the proposed method by evaluating the maps created by two state-of-the-art methods, VLMaps and OpenScene, using two encoders, LSeg and OpenSeg, using real-world data from the Matterport3D data set. We find that OpenScene outperforms VLMaps with both encoders, and LSeg outperforms OpenSeg with both methods.
Original languageEnglish
Title of host publicationWorkshop on Vision-Language Models for Navigation and Manipulation
PublisherIEEE
Number of pages5
Publication statusPublished - 17 May 2024
MoE publication typeD3 Professional conference proceedings
EventWorkshop on Vision-Language Models for Navigation and Manipulation - Pacifico Yokohama, Yokohama, Japan
Duration: 17 May 202417 May 2024
https://vlmnm-workshop.github.io/

Workshop

WorkshopWorkshop on Vision-Language Models for Navigation and Manipulation
Abbreviated titleVLMNM
Country/TerritoryJapan
CityYokohama
Period17/05/202417/05/2024
Internet address

Fingerprint

Dive into the research topics of 'Evaluating the quality of robotic visual-language maps'. Together they form a unique fingerprint.

Cite this