RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning

Abstract

Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training.

We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the Reason stage (RISE-CoT), a reinforcement learning-driven "annotation-reasoning-annotation" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The Inspire and Strengthen stage (RISE-R1) leverages a high-quality CoT subset for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations.

Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.

RISE Framework

Two-Stage Approach

RISE operates through two stages to enhance VLM reasoning capabilities for image annotation tasks:

1. RISE-CoT: Closed-Loop Reasoning Generation

This stage generates high-quality, visually grounded Chains of Thought (CoTs) for image-annotation pairs in a self-supervised manner. The process involves:

Reasoning Generation: VLM produces a CoT justifying the annotation without leaking specifics
Annotation Reconstruction: VLM reconstructs the annotation from the generated CoT
Consistency Validation: Reward function evaluates CoT quality based on reconstruction accuracy

Figure 1: RISE-CoT framework

2. RISE-R1: Training VLM for Enhanced CoTs

This stage trains the VLM to produce structured "think-answer" outputs:

Inspire (SFT): Supervised fine-tuning on high-quality CoT subset
Strengthen (RFT): Reinforcement fine-tuning on full dataset to optimize task-specific outputs

Figure 2: RISE-R1 framework

Experiments & Results

Datasets

We evaluated RISE on four image annotation datasets with varying complexity:

Emotion6: Emotion classification with probability distributions
LISA: Context-driven object detection
ImageNet-Sub: Simple classification task
COCO-Sub: Multi-target object detection

Key Results

RISE demonstrates superior performance across both complex and simple tasks:

Outperforms SFT and Visual-RFT on Emotion6 and LISA
Achieves robust performance on ImageNet-Sub and COCO-Sub
Generates high-quality, interpretable Chains of Thought
Provides self-supervised solution without manual CoT annotation

Ablation Studies

Our ablation studies confirm the importance of key RISE components:

CoT Quality: RISE-CoT generates higher quality CoTs compared to Base-Model and GPT-4o
SFT Initialization: SFT on high-quality CoT subset is crucial for RFT success
Reward Function: Full reward function (with leakage prevention and format constraints) achieves best performance
Threshold Selection: τ=0.75 optimally balances CoT quality and dataset size

Conclusion

We introduced RISE, a novel two-stage framework that significantly enhances VLMs for complex image annotation tasks. RISE autonomously generates high-quality CoTs by verifying their ability to reconstruct original annotations, then uses these CoTs to train VLMs to produce accurate and interpretable "think-answer" outputs directly from images.

Through its verifiable, self-supervised CoT generation, RISE improves annotation accuracy and interpretability while uniquely enabling implicit evaluation and refinement of dataset annotation quality. This framework effectively boosts the reasoning capabilities of lower-capacity VLMs across various image annotation tasks, allowing them to perform akin to larger models.

BibTeX

@article{hu2024rise,
  title={RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning},
  author={Hu, Suhang and Hu, Wei and Su, Yuhang and Zhang, Fan},
  journal={arXiv preprint},
  year={2024}
}