simmediumvision-robotmetric · varies

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Description

Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized fo

Source

http://arxiv.org/abs/2603.06569v2