← Back to Benchmarks
simmediumsim-to-realmetric · varies
PEAfowl: Perception-Enhanced Multi-View Vision-Language-Action for Bimanual Manipulation
Description
Bimanual manipulation in cluttered scenes requires policies that remain stable under occlusions, viewpoint and scene variations. Existing vision-language-action models often fail to generalize because (i) multi-view features are fused via view-agnostic token concatenation, yielding weak 3D-consistent spatial understanding, and (ii) language is injected as global conditioning, resulting in coarse instruction grounding. In this paper, we introduce PEAfowl, a perception-enhanced multi-view VLA po