Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training.
We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the Reason stage (RISE-CoT), a reinforcement learning-driven "annotation-reasoning-annotation" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The Inspire and Strengthen stage (RISE-R1) leverages a high-quality CoT subset for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations.
Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.
RISE operates through two stages to enhance VLM reasoning capabilities for image annotation tasks:
This stage generates high-quality, visually grounded Chains of Thought (CoTs) for image-annotation pairs in a self-supervised manner. The process involves:
Figure 1: RISE-CoT framework
This stage trains the VLM to produce structured "think-answer" outputs:
Figure 2: RISE-R1 framework
We evaluated RISE on four image annotation datasets with varying complexity:
RISE demonstrates superior performance across both complex and simple tasks:
Our ablation studies confirm the importance of key RISE components:
We introduced RISE, a novel two-stage framework that significantly enhances VLMs for complex image annotation tasks. RISE autonomously generates high-quality CoTs by verifying their ability to reconstruct original annotations, then uses these CoTs to train VLMs to produce accurate and interpretable "think-answer" outputs directly from images.
Through its verifiable, self-supervised CoT generation, RISE improves annotation accuracy and interpretability while uniquely enabling implicit evaluation and refinement of dataset annotation quality. This framework effectively boosts the reasoning capabilities of lower-capacity VLMs across various image annotation tasks, allowing them to perform akin to larger models.
@article{hu2024rise,
title={RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning},
author={Hu, Suhang and Hu, Wei and Su, Yuhang and Zhang, Fan},
journal={arXiv preprint},
year={2024}
}