← Back to Benchmarks
simmediummanipulation-datametric · varies
See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations
Description
Developing robust and general-purpose manipulation policies represents a fundamental objective in robotics research. While Vision-Language-Action (VLA) models have demonstrated promising capabilities for end-to-end robot control, existing approaches still exhibit limited generalization to tasks beyond their training distributions. In contrast, humans possess remarkable proficiency in acquiring novel skills by simply observing others performing them once. Inspired by this capability, we propose V