Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang*, Simon Shaolei Du*, Yelong Shen*
- 18/06/2025: We update the evaluation results on DeepSeek-R1-Distill-Qwen-1.5B (see details below) on different context length (8k, 32k) and show consistent improvement from few-shot RLVR. Note: A summary replying the confusion regarding the evaluation of our DeepSeek-R1-Distill-Qwen-1.5B is available here. Also see our results with 32k length below.
- 17/05/2025: We release our checkpoints and dataset in huggingface.
- 30/04/2025: 🎉 We release our paper, code, and wandb records. See the summarization of our work at X(twitter).
Our training pipeline is adapted from verl and rllm(DeepScaleR). The installation commands that we verified as viable are as follows:
conda create -y -n rlvr_train python=3.10
conda activate rlvr_train
pip install -e .
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install ray vllm==0.6.3
pip install flash-attn --no-build-isolation
pip install wandb matplotlib
pip install huggingface_hub
If you are using H100 nodes and see errors like CUDA error: device kernel image is invalid
, please refer to this issue for fixing the problem.
Our evaluation pipeline for math reasoning tasks is adapted from Qwen2.5-Math. The installation commands that we verified as viable are as follows:
conda create -y -n rlvr_eval python=3.10
conda activate rlvr_eval
cd Qwen2.5-Eval/evaluation
cd latex2sympy
pip install -e .
cd ..
pip install -r requirements.txt
pip install vllm==0.5.1 --no-build-isolation
pip install transformers==4.42.3
pip install wandb matplotlib
pip install -U transformers
pip install vllm==0.6.3
We randomly select a subset consisting of 1209 examples from DeepScaleR-Preview-Dataset (DSR-sub), and we use it as the instance pool for data selection. We include the training example used in our paper in data/train/one_shot_rlvr
. For 1(few)-shot RLVR dataset, we duplicate the data until training batch size (in our experiment it is 128).
(Optionally) To obtain the training example, we rank DSR-sub by the historical variance score, which calculates the variance of the historical accuracy (We hope this can inspire better data selection way in the future). To obtain examples top_index
parameter in data/data_selection.sh
to bash data_selection.sh
.
As a reference, we present example
Prompt:
"The pressure \\( P \\) exerted by wind on a sail varies jointly as the area \\( A \\) of the sail and the cube of the wind's velocity \\( V \\). When the velocity is \\( 8 \\) miles per hour, the pressure on a sail of \\( 2 \\) square feet is \\( 4 \\) pounds. Find the wind velocity when the pressure on \\( 4 \\) square feet of sail is \\( 32 \\) pounds. Let's think step by step and output the final answer within \\boxed{}."
Ground truth (label in DSR-sub):
12.8.
Before training, we can assign the checkpoint path:
export CHECKPOINTS_DIR=./checkpoints # your checkpoint path
export WANDB_API_KEY=... # your wandb api key
To run 1-shot RLVR with
conda activate rlvr_train
bash scripts/train/training_1.5b_pi1_r128.sh
As a comparison, the commands for running full-set RLVR on DSR-sub is as below:
conda activate rlvr_train
bash scripts/train/training_1.5b_dsr_sub.sh
Please change data.train_files
and trainer.experiment_name
in the training script when trying other training examples.
To run evaluation for 1-shot RLVR with
conda activate rlvr_eval
cd Qwen2.5-Eval/evaluation
bash sh/eval_one_experiment_all_ckpts.sh
Here for AIME24, AMC23, and AIME25, we evaluate the pass@8 results.
Please adjust the experiment name in Qwen2.5-Eval/evaluation/sh/eval_one_experiment_all_ckpts.sh
when using other training examples.
For DeepSeek-R1-Distill-Qwen-1.5B, we can also evaluate based on rllm(DeepScaleR) official repo. As DeepSeek-R1 and DeepScaleR, we use temperature=0.6
and top_p=0.95
for evaluation, and use avg@16
for MATH500, Minerva MAth & OlympiadBench, ang avg@64
for AIME24, AIME25 and AMC23. Since our training length is 8192, we provide the evaluations results for both 8k and 32k evaluation length. The results can be reproduced by provided checkpoints.
Model | Training Length | Evaluation Length | MATH 500 (avg@16) | AIME 2024 (avg@64) | AMC 2023 (avg@64) | Minerva Math (avg@16) | Olympiad Bench (avg@16) | AIME 2025 (avg@64) | Avg |
---|---|---|---|---|---|---|---|---|---|
R1-Distill-1.5B | – | 8k | 76.7 | 20.8 | 51.3 | 23.3 | 35.4 | 19.7 | 37.9 |
1-shot RLVR on R1-Distill-1.5B | 8k | 8k | 80.5 | 25.1 | 58.9 | 27.2 | 40.2 | 21.7 | 42.3 |
4-shot RLVR on R1-Distill-1.5B | 8k | 8k | 81.2 | 25.8 | 60.1 | 26.8 | 40.4 | 22.0 | 42.7 |
16-shot RLVR on R1-Distill-1.5B | 8k | 8k | 83.3 | 29.6 | 64.8 | 29.3 | 43.3 | 22.8 | 45.5 |
1.2k-shot (DSR-sub) RLVR on R1-Distill-1.5B | 8k | 8k | 84.4 | 30.2 | 68.3 | 29.2 | 45.8 | 26.7 | 47.4 |
DeepScaleR-1.5B-Preview (40k DSR data) | 8k→16k→24k | 8k | 86.3 | 35.2 | 68.1 | 29.6 | 46.7 | 28.3 | 49.0 |
Model | Training Length | Evaluation Length | MATH 500 (avg@16) | AIME 2024 (avg@64) | AMC 2023 (avg@64) | Minerva Math (avg@16) | Olympiad Bench (avg@16) | AIME 2025 (avg@64) | Avg |
---|---|---|---|---|---|---|---|---|---|
R1-Distill-1.5B | – | 32k | 82.9 | 29.8 | 63.2 | 26.4 | 43.1 | 23.9 | 44.9 |
R1-Distill-1.5B (reported) | – | 32k | 83.9 | 28.9 | – | – | – | – | – |
1-shot RLVR on R1-Distill-1.5B | 8k | 32k | 83.9 | 31.0 | 66.1 | 28.3 | 44.6 | 24.1 | 46.3 |
4-shot RLVR on R1-Distill-1.5B | 8k | 32k | 84.8 | 32.2 | 66.6 | 27.7 | 45.5 | 24.8 | 46.9 |
16-shot RLVR on R1-Distill-1.5B | 8k | 32k | 84.5 | 34.3 | 69.0 | 30.0 | 46.9 | 25.2 | 48.3 |
1.2k-shot (DSR-sub) RLVR on R1-Distill-1.5B | 8k | 32k | 84.5 | 32.7 | 70.1 | 29.5 | 46.9 | 27.8 | 48.6 |
DeepScaleR-1.5B-Preview (40k DSR data) | 8k→16k→24k | 32k | 87.6 | 41.4 | 73.2 | 30.6 | 49.6 | 31.3 | 52.3 |
DeepScaleR-1.5B-Preview (reported) | 8k→16k→24k | 32k | 87.8 | 43.1 (avg@16) | 73.6 (avg@16) | 30.2 (avg@16) | 50.0 (avg@16) | – | – |
We have logged our experiments for three models to this wandb project, including the results of 1(few)-shot RLVR on Qwen2.5-Math-1.5B
, Qwen2.5-Math-7B
and DeepSeek-R1-Distill-Qwen-1.5B
. We also include the baseline of the full-set RLVR with DSR-sub in it. Please note that the validation results displayed are calculated using the verl/rllm framework and may differ slightly from qwen-eval results.
- Our training experiments are powered by a modified fork of rllm(DeepScaleR) and verl.
- Our evaluation experiments are based on a modified fork of Qwen2.5-Math.
- Our model is trained on top of
Qwen2.5-Math-1.5B
,Qwen2.5-Math-7B
,Llama-3.2-3B-Instruct
andDeepSeek-R1-Distill-Qwen-1.5B
.
@article{wang2025reinforcement,
title={Reinforcement Learning for Reasoning in Large Language Models with One Training Example},
author={Wang, Yiping and Yang, Qing and Zeng, Zhiyuan and Ren, Liliang and Liu, Lucas and Peng, Baolin and Cheng, Hao and He, Xuehai and Wang, Kuan and Gao, Jianfeng and Chen, Weizhu and Wang, Shuohang and Du, Simon Shaolei and Shen, Yelong},
journal={arXiv preprint arXiv:2504.20571},
year={2025}
}