TY - GEN
T1 - ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations
AU - Jiang, Yue
AU - Schoop, Eldon
AU - Swearngin, Amanda
AU - Nichols, Jeffrey
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/3/24
Y1 - 2025/3/24
N2 - Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but many perform poorly on UI tasks due to the lack of UI training data. In this paper, we adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM). Unlike prior art, our method requires no human-provided annotations, and it can be applied to any dataset of UI screenshots. We generate a dataset of 353K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks. To assess the performance of our model, we benchmark it on UI element detection tasks, evaluate response quality, and showcase its applicability to UI verification.
AB - Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but many perform poorly on UI tasks due to the lack of UI training data. In this paper, we adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM). Unlike prior art, our method requires no human-provided annotations, and it can be applied to any dataset of UI screenshots. We generate a dataset of 353K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks. To assess the performance of our model, we benchmark it on UI element detection tasks, evaluate response quality, and showcase its applicability to UI verification.
KW - UI Automation
KW - User Interface
KW - Vision Language Model
UR - http://www.scopus.com/inward/record.url?scp=105001919735&partnerID=8YFLogxK
U2 - 10.1145/3708359.3712129
DO - 10.1145/3708359.3712129
M3 - Conference article in proceedings
AN - SCOPUS:105001919735
T3 - International Conference on Intelligent User Interfaces, Proceedings IUI
SP - 861
EP - 877
BT - IUI 2025 - Proceedings of the 2025 International Conference on Intelligent User Interfaces
PB - ACM
T2 - International Conference on Intelligent User Interfaces
Y2 - 24 March 2025 through 27 March 2025
ER -