← Back to Benchmarks
simmediumsim-to-realmetric · varies

PEAfowl: Perception-Enhanced Multi-View Vision-Language-Action for Bimanual Manipulation

Description

Bimanual manipulation in cluttered scenes requires policies that remain stable under occlusions, viewpoint and scene variations. Existing vision-language-action models often fail to generalize because (i) multi-view features are fused via view-agnostic token concatenation, yielding weak 3D-consistent spatial understanding, and (ii) language is injected as global conditioning, resulting in coarse instruction grounding. In this paper, we introduce PEAfowl, a perception-enhanced multi-view VLA po

Source

http://arxiv.org/abs/2601.17885v1