simmediumroboticsmetric · varies

KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

Description

We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialize

Source

http://arxiv.org/abs/2604.07034v1