TY - GEN
T1 - Palo: A Polyglot Large Multimodal Model for 5B People
AU - Rasheed, Hanoona
AU - Maaz, Muhammad
AU - Shaker, Abdelrahman
AU - Khan, Salman
AU - Cholakal, Hisham
AU - Anwer, Rao M.
AU - Baldwin, Tim
AU - Felsberg, Michael
AU - Khan, Fahad S.
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called Palo. Palooffers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of 5B people (65% of the world population). Our approach involves a semiautomated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.
AB - In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called Palo. Palooffers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of 5B people (65% of the world population). Our approach involves a semiautomated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.
UR - http://www.scopus.com/inward/record.url?scp=105003633031&partnerID=8YFLogxK
U2 - 10.1109/WACV61041.2025.00177
DO - 10.1109/WACV61041.2025.00177
M3 - Conference article in proceedings
AN - SCOPUS:105003633031
T3 - IEEE Workshop on Applications of Computer Vision
SP - 1745
EP - 1754
BT - Proceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025
PB - IEEE
T2 - IEEE Winter Conference on Applications of Computer Vision
Y2 - 28 February 2025 through 4 March 2025
ER -