← Back to Benchmarks
simmediumrlmetric · varies
LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation
Description
Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or indirect alignment signals and remain suboptimal for simultaneously supporting multimodal understanding and generation, particularly in settings that require fine-grained language-visual reasoning and controllable generation. In this work, we propose LVRPO, a language-visual reinforcement-based prefe