Palo: A Polyglot Large Multimodal Model for 5B People

Hanoona Rasheed, Muhammad Maaz, Abdelrahman Shaker, Salman Khan, Hisham Cholakal, Rao M. Anwer, Tim Baldwin, Michael Felsberg, Fahad S. Khan

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

Abstract

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called Palo. Palooffers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of 5B people (65% of the world population). Our approach involves a semiautomated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025
PublisherIEEE
Pages1745-1754
Number of pages10
ISBN (Electronic)979-8-3315-1083-1
DOIs
Publication statusPublished - 2025
MoE publication typeA4 Conference publication
EventIEEE Winter Conference on Applications of Computer Vision - Tucson, United States
Duration: 28 Feb 20254 Mar 2025

Publication series

Name IEEE Workshop on Applications of Computer Vision
ISSN (Electronic)2642-9381

Conference

ConferenceIEEE Winter Conference on Applications of Computer Vision
Abbreviated titleWACV
Country/TerritoryUnited States
CityTucson
Period28/02/202504/03/2025

Fingerprint

Dive into the research topics of 'Palo: A Polyglot Large Multimodal Model for 5B People'. Together they form a unique fingerprint.

Cite this