simmediumroboticsmetric · varies

Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving

Description

Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs, including navigation commands, hazard warnings, and traffic state descriptions, yet current systems often present these as disconnected fragments, forcing the model to discover on its own which environmental constraints are relevant to the current maneuver. We introduce Causal Scene Narration (CSN), which restructures VLA text inputs through intent-constraint alignment, quantitative grounding, and s

Source

http://arxiv.org/abs/2604.01723v1