← Back to Benchmarks
simmediumimitationmetric · varies

VAT: Vision Action Transformer by Unlocking Full Representation of ViT

Description

In robot learning, Vision Transformers (ViTs) are standard for visual perception, yet most methods discard valuable information by using only the final layer's features. We argue this provides an insufficient representation and propose the Vision Action Transformer (VAT), a novel architecture that is extended from ViT and unlocks the full feature hierarchy of ViT. VAT processes specialized action tokens with visual features across all transformer layers, enabling a deep and progressive fusion of

Source

http://arxiv.org/abs/2512.06013v2