Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang*, Simon Shaolei Du*, Yelong Shen*

Updates

18/06/2025: We update the evaluation results on DeepSeek-R1-Distill-Qwen-1.5B (see details below) on different context length (8k, 32k) and show consistent improvement from few-shot RLVR. Note: A summary replying the confusion regarding the evaluation of our DeepSeek-R1-Distill-Qwen-1.5B is available here. Also see our results with 32k length below.
17/05/2025: We release our checkpoints and dataset in huggingface.
30/04/2025: 🎉 We release our paper, code, and wandb records. See the summarization of our work at X(twitter).

Setup

Train Enviroment

Our training pipeline is adapted from verl and rllm(DeepScaleR). The installation commands that we verified as viable are as follows:

conda create -y -n rlvr_train python=3.10
conda activate rlvr_train
pip install -e .
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install ray vllm==0.6.3
pip install flash-attn --no-build-isolation
pip install wandb matplotlib
pip install huggingface_hub

If you are using H100 nodes and see errors like CUDA error: device kernel image is invalid, please refer to this issue for fixing the problem.

Eval Enviroment

Our evaluation pipeline for math reasoning tasks is adapted from Qwen2.5-Math. The installation commands that we verified as viable are as follows:

conda create -y -n rlvr_eval python=3.10
conda activate rlvr_eval
cd Qwen2.5-Eval/evaluation
cd latex2sympy
pip install -e .
cd ..
pip install -r requirements.txt 
pip install vllm==0.5.1 --no-build-isolation
pip install transformers==4.42.3
pip install wandb matplotlib
pip install -U transformers
pip install vllm==0.6.3

Data

DSR-sub

We randomly select a subset consisting of 1209 examples from DeepScaleR-Preview-Dataset (DSR-sub), and we use it as the instance pool for data selection. We include the training example used in our paper in data/train/one_shot_rlvr. For 1(few)-shot RLVR dataset, we duplicate the data until training batch size (in our experiment it is 128).

(Optionally) To obtain the training example, we rank DSR-sub by the historical variance score, which calculates the variance of the historical accuracy (We hope this can inspire better data selection way in the future). To obtain examples $\pi_i$ based on the historical accuracy of Qwen2.5-Math-1.5B, we can change the top_index parameter in data/data_selection.sh to $i-1$, and run then run bash data_selection.sh.

As a reference, we present example $\pi_1$ here:

$\pi_1$:

Prompt:
"The pressure \\( P \\) exerted by wind on a sail varies jointly as the area \\( A \\) of the sail and the cube of the wind's velocity \\( V \\). When the velocity is \\( 8 \\) miles per hour, the pressure on a sail of \\( 2 \\) square feet is \\( 4 \\) pounds. Find the wind velocity when the pressure on \\( 4 \\) square feet of sail is \\( 32 \\) pounds. Let's think step by step and output the final answer within \\boxed{}."

Ground truth (label in DSR-sub):
12.8.

Training

Before training, we can assign the checkpoint path:

export CHECKPOINTS_DIR=./checkpoints # your checkpoint path
export WANDB_API_KEY=... # your wandb api key

To run 1-shot RLVR with $\pi_1$, we can run:

conda activate rlvr_train
bash scripts/train/training_1.5b_pi1_r128.sh

As a comparison, the commands for running full-set RLVR on DSR-sub is as below:

conda activate rlvr_train
bash scripts/train/training_1.5b_dsr_sub.sh

Please change data.train_files and trainer.experiment_name in the training script when trying other training examples.

Evaluation

Eval Scripts for Qwen Models

To run evaluation for 1-shot RLVR with $\pi_1$ on 6 common math reasoning benchmarks (MATH500, AIME24, AMC23, Minerva Math, OlympiadBench, AIME25), we can follow the commands:

conda activate rlvr_eval
cd Qwen2.5-Eval/evaluation
bash sh/eval_one_experiment_all_ckpts.sh

Here for AIME24, AMC23, and AIME25, we evaluate the pass@8 results. Please adjust the experiment name in Qwen2.5-Eval/evaluation/sh/eval_one_experiment_all_ckpts.sh when using other training examples.

Evaluation for DeepSeek-R1-Distill-Qwen-1.5B

For DeepSeek-R1-Distill-Qwen-1.5B, we can also evaluate based on rllm(DeepScaleR) official repo. As DeepSeek-R1 and DeepScaleR, we use temperature=0.6 and top_p=0.95 for evaluation, and use avg@16 for MATH500, Minerva MAth & OlympiadBench, ang avg@64 for AIME24, AIME25 and AMC23. Since our training length is 8192, we provide the evaluations results for both 8k and 32k evaluation length. The results can be reproduced by provided checkpoints.

Evaluation length = 8192

Model	Training Length	Evaluation Length	MATH 500 (avg@16)	AIME 2024 (avg@64)	AMC 2023 (avg@64)	Minerva Math (avg@16)	Olympiad Bench (avg@16)	AIME 2025 (avg@64)	Avg
R1-Distill-1.5B	–	8k	76.7	20.8	51.3	23.3	35.4	19.7	37.9
1-shot RLVR on R1-Distill-1.5B	8k	8k	80.5	25.1	58.9	27.2	40.2	21.7	42.3
4-shot RLVR on R1-Distill-1.5B	8k	8k	81.2	25.8	60.1	26.8	40.4	22.0	42.7
16-shot RLVR on R1-Distill-1.5B	8k	8k	83.3	29.6	64.8	29.3	43.3	22.8	45.5
1.2k-shot (DSR-sub) RLVR on R1-Distill-1.5B	8k	8k	84.4	30.2	68.3	29.2	45.8	26.7	47.4
DeepScaleR-1.5B-Preview (40k DSR data)	8k→16k→24k	8k	86.3	35.2	68.1	29.6	46.7	28.3	49.0

Evaluation length = 32768

Model	Training Length	Evaluation Length	MATH 500 (avg@16)	AIME 2024 (avg@64)	AMC 2023 (avg@64)	Minerva Math (avg@16)	Olympiad Bench (avg@16)	AIME 2025 (avg@64)	Avg
R1-Distill-1.5B	–	32k	82.9	29.8	63.2	26.4	43.1	23.9	44.9
R1-Distill-1.5B (reported)	–	32k	83.9	28.9	–	–	–	–	–
1-shot RLVR on R1-Distill-1.5B	8k	32k	83.9	31.0	66.1	28.3	44.6	24.1	46.3
4-shot RLVR on R1-Distill-1.5B	8k	32k	84.8	32.2	66.6	27.7	45.5	24.8	46.9
16-shot RLVR on R1-Distill-1.5B	8k	32k	84.5	34.3	69.0	30.0	46.9	25.2	48.3
1.2k-shot (DSR-sub) RLVR on R1-Distill-1.5B	8k	32k	84.5	32.7	70.1	29.5	46.9	27.8	48.6
DeepScaleR-1.5B-Preview (40k DSR data)	8k→16k→24k	32k	87.6	41.4	73.2	30.6	49.6	31.3	52.3
DeepScaleR-1.5B-Preview (reported)	8k→16k→24k	32k	87.8	43.1 (avg@16)	73.6 (avg@16)	30.2 (avg@16)	50.0 (avg@16)	–	–

W&B

We have logged our experiments for three models to this wandb project, including the results of 1(few)-shot RLVR on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B. We also include the baseline of the full-set RLVR with DSR-sub in it. Please note that the validation results displayed are calculated using the verl/rllm framework and may differ slightly from qwen-eval results.

Acknowledgements

Our training experiments are powered by a modified fork of rllm(DeepScaleR) and verl.
Our evaluation experiments are based on a modified fork of Qwen2.5-Math.
Our model is trained on top of Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Llama-3.2-3B-Instruct and DeepSeek-R1-Distill-Qwen-1.5B.

Citation

@article{wang2025reinforcement,
  title={Reinforcement Learning for Reasoning in Large Language Models with One Training Example},
  author={Wang, Yiping and Yang, Qing and Zeng, Zhiyuan and Ren, Liliang and Liu, Lucas and Peng, Baolin and Cheng, Hao and He, Xuehai and Wang, Kuan and Gao, Jianfeng and Chen, Weizhu and Wang, Shuohang and Du, Simon Shaolei and Shen, Yelong},
  journal={arXiv preprint arXiv:2504.20571},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
Qwen2.5-Eval		Qwen2.5-Eval
data		data
docker		docker
docs		docs
examples		examples
patches		patches
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_train.txt		requirements_train.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Updates

Setup

Train Enviroment

Eval Enviroment

Data

DSR-sub

$\pi_1$:

Training

Evaluation

Eval Scripts for Qwen Models

Evaluation for DeepSeek-R1-Distill-Qwen-1.5B

Evaluation length = 8192

Evaluation length = 32768

W&B

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

ypwang61/One-Shot-RLVR

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Updates

Setup

Train Enviroment

Eval Enviroment

Data

DSR-sub

$\pi_1$:

Training

Evaluation

Eval Scripts for Qwen Models

Evaluation for DeepSeek-R1-Distill-Qwen-1.5B

Evaluation length = 8192

Evaluation length = 32768

W&B

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages