← Back to Benchmarks
simmediumnavigationmetric · varies

Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence

Description

We present Butter-Bench, a benchmark evaluating large language model (LLM) controlled robots for practical intelligence, defined as the ability to navigate the messiness of the physical world. Current state-of-the-art robotic systems use a hierarchical architecture with LLMs in charge of high-level reasoning, and a Vision Language Action (VLA) model for low-level control. Butter-Bench evaluates the LLM part in isolation from the VLA. Although LLMs have repeatedly surpassed humans in evaluations

Source

http://arxiv.org/abs/2510.21860v1