← Back to Benchmarks
simmediumvision-robotmetric · varies

Thinker: A vision-language foundation model for embodied intelligence

Description

When large vision-language models are applied to the field of robotics, they encounter problems that are simple for humans yet error-prone for models. Such issues include confusion between third-person and first-person perspectives and a tendency to overlook information in video endings during temporal reasoning. To address these challenges, we propose Thinker, a large vision-language foundation model designed for embodied intelligence. We tackle the aforementioned issues from two perspectives.

Source

http://arxiv.org/abs/2601.21199v1