simmediumvision-robotmetric · varies

EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next

Description

Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require

Source

http://arxiv.org/abs/2603.12147v1