← Back to Benchmarks
simmediumrlmetric · varies
RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
Description
Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable doma