GLaMM: Pixel Grounding Large Multimodal Model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming Hsuan Yang, Fahad S. Khan

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

28 Citations (Scopus)

Abstract

Large Multimodal Models (LMMs) extend Large Lan-guage Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or can-not offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly in-tertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the con-versations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of visually Grounded Conversation Generation (GCG), we in-troduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale. To this end, we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG, GLaMM also performs effectively on several downstream tasks, e.g., referring expression segmentation, image and region-level captioning and vision-language conversations.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PublisherIEEE
Pages13009-13018
Number of pages10
ISBN (Electronic)979-8-3503-5300-6
DOIs
Publication statusPublished - 2024
MoE publication typeA4 Conference publication
EventIEEE Conference on Computer Vision and Pattern Recognition - Seattle, United States
Duration: 16 Jun 202422 Jun 2024

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

ConferenceIEEE Conference on Computer Vision and Pattern Recognition
Abbreviated titleCVPR
Country/TerritoryUnited States
CitySeattle
Period16/06/202422/06/2024

Keywords

  • automated dataset annotation
  • LMM
  • MLMM
  • multimodal foundation models
  • Multimodal LMM
  • vision and language
  • vision-language
  • VLM

Fingerprint

Dive into the research topics of 'GLaMM: Pixel Grounding Large Multimodal Model'. Together they form a unique fingerprint.

Cite this