Pedestrian Vision Language Model for Intentions Prediction

Research output: Contribution to journalArticleScientificpeer-review

9 Downloads (Pure)

Abstract

Effective modeling of human behavior is crucial for the safe and reliable coexistence of humans and autonomous vehicles. Traditional deep learning methods have limitations in capturing the complexities of pedestrian behavior, often relying on simplistic representations or indirect inference from visual cues, which hinders their explainability. To address this gap, we introduce PedVLM, a vision-language model that leverages multiple modalities (RGB images, optical flow, and text) to predict pedestrian intentions and also provide explainability for pedestrian behavior. PedVLM comprises a CLIP-based vision encoder and a text-to-text transfer transformer (T5) language model, which together extract and combine visual and text embeddings to predict pedestrian actions and enhance explainability. Furthermore, to complement our PedVLM model and further facilitate research, we also publicly release the corresponding dataset, PedPrompt, which includes the prompts in the Question-Answer (QA) template for pedestrian intention prediction. PedVLM is evaluated on PedPrompt, JAAD, and PIE datasets demonstrates its efficacy compared to state-of-the-art methods. The dataset and code will be made available at https://github.com/munirfarzeen/PedVLM
Original languageEnglish
Pages (from-to)393-406
Number of pages14
JournalIEEE Open Journal of Intelligent Transportation Systems
Volume6
DOIs
Publication statusPublished - 2025
MoE publication typeA1 Journal article-refereed

Keywords

  • Autonomous vehicles
  • Intelligent transportation systems
  • Large language models
  • Linguistics
  • Optical flow
  • Pedestrian intention prediction
  • Pedestrians
  • Predictive models
  • Trajectory
  • Transformers
  • Visualization
  • Prompt generation
  • vision-language models (VLMs)

Fingerprint

Dive into the research topics of 'Pedestrian Vision Language Model for Intentions Prediction'. Together they form a unique fingerprint.

Cite this