Skip to content

official repository for “Reinforcement Learning for Reasoning in Large Language Models with One Training Example”

License

Notifications You must be signed in to change notification settings

ypwang61/One-Shot-RLVR

Repository files navigation

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang*, Simon Shaolei Du*, Yelong Shen*


paper Models/Dataset Code 📁_W&B_LOGS X_Summary

Updates

  • 18/06/2025: We update the evaluation results on DeepSeek-R1-Distill-Qwen-1.5B (see details below) on different context length (8k, 32k) and show consistent improvement from few-shot RLVR. Note: A summary replying the confusion regarding the evaluation of our DeepSeek-R1-Distill-Qwen-1.5B is available here. Also see our results with 32k length below.
  • 17/05/2025: We release our checkpoints and dataset in huggingface.
  • 30/04/2025: 🎉 We release our paper, code, and wandb records. See the summarization of our work at X(twitter).

Setup

Train Enviroment

Our training pipeline is adapted from verl and rllm(DeepScaleR). The installation commands that we verified as viable are as follows:

conda create -y -n rlvr_train python=3.10
conda activate rlvr_train
pip install -e .
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install ray vllm==0.6.3
pip install flash-attn --no-build-isolation
pip install wandb matplotlib
pip install huggingface_hub

If you are using H100 nodes and see errors like CUDA error: device kernel image is invalid, please refer to this issue for fixing the problem.

Eval Enviroment

Our evaluation pipeline for math reasoning tasks is adapted from Qwen2.5-Math. The installation commands that we verified as viable are as follows:

conda create -y -n rlvr_eval python=3.10
conda activate rlvr_eval
cd Qwen2.5-Eval/evaluation
cd latex2sympy
pip install -e .
cd ..
pip install -r requirements.txt 
pip install vllm==0.5.1 --no-build-isolation
pip install transformers==4.42.3
pip install wandb matplotlib
pip install -U transformers
pip install vllm==0.6.3

Data

DSR-sub

We randomly select a subset consisting of 1209 examples from DeepScaleR-Preview-Dataset (DSR-sub), and we use it as the instance pool for data selection. We include the training example used in our paper in data/train/one_shot_rlvr. For 1(few)-shot RLVR dataset, we duplicate the data until training batch size (in our experiment it is 128).

(Optionally) To obtain the training example, we rank DSR-sub by the historical variance score, which calculates the variance of the historical accuracy (We hope this can inspire better data selection way in the future). To obtain examples $\pi_i$ based on the historical accuracy of Qwen2.5-Math-1.5B, we can change the top_index parameter in data/data_selection.sh to $i-1$, and run then run bash data_selection.sh.

As a reference, we present example $\pi_1$ here:

$\pi_1$:

Prompt:
"The pressure \\( P \\) exerted by wind on a sail varies jointly as the area \\( A \\) of the sail and the cube of the wind's velocity \\( V \\). When the velocity is \\( 8 \\) miles per hour, the pressure on a sail of \\( 2 \\) square feet is \\( 4 \\) pounds. Find the wind velocity when the pressure on \\( 4 \\) square feet of sail is \\( 32 \\) pounds. Let's think step by step and output the final answer within \\boxed{}."

Ground truth (label in DSR-sub):
12.8.

Training

Before training, we can assign the checkpoint path:

export CHECKPOINTS_DIR=./checkpoints # your checkpoint path
export WANDB_API_KEY=... # your wandb api key

To run 1-shot RLVR with $\pi_1$, we can run:

conda activate rlvr_train
bash scripts/train/training_1.5b_pi1_r128.sh

As a comparison, the commands for running full-set RLVR on DSR-sub is as below:

conda activate rlvr_train
bash scripts/train/training_1.5b_dsr_sub.sh 

Please change data.train_files and trainer.experiment_name in the training script when trying other training examples.

Evaluation

Eval Scripts for Qwen Models

To run evaluation for 1-shot RLVR with $\pi_1$ on 6 common math reasoning benchmarks (MATH500, AIME24, AMC23, Minerva Math, OlympiadBench, AIME25), we can follow the commands:

conda activate rlvr_eval
cd Qwen2.5-Eval/evaluation
bash sh/eval_one_experiment_all_ckpts.sh

Here for AIME24, AMC23, and AIME25, we evaluate the pass@8 results. Please adjust the experiment name in Qwen2.5-Eval/evaluation/sh/eval_one_experiment_all_ckpts.sh when using other training examples.

Evaluation for DeepSeek-R1-Distill-Qwen-1.5B

For DeepSeek-R1-Distill-Qwen-1.5B, we can also evaluate based on rllm(DeepScaleR) official repo. As DeepSeek-R1 and DeepScaleR, we use temperature=0.6 and top_p=0.95 for evaluation, and use avg@16 for MATH500, Minerva MAth & OlympiadBench, ang avg@64 for AIME24, AIME25 and AMC23. Since our training length is 8192, we provide the evaluations results for both 8k and 32k evaluation length. The results can be reproduced by provided checkpoints.

Evaluation length = 8192

Model Training Length Evaluation Length MATH 500 (avg@16) AIME 2024 (avg@64) AMC 2023 (avg@64) Minerva Math (avg@16) Olympiad Bench (avg@16) AIME 2025 (avg@64) Avg
R1-Distill-1.5B 8k 76.7 20.8 51.3 23.3 35.4 19.7 37.9
1-shot RLVR on R1-Distill-1.5B 8k 8k 80.5 25.1 58.9 27.2 40.2 21.7 42.3
4-shot RLVR on R1-Distill-1.5B 8k 8k 81.2 25.8 60.1 26.8 40.4 22.0 42.7
16-shot RLVR on R1-Distill-1.5B 8k 8k 83.3 29.6 64.8 29.3 43.3 22.8 45.5
1.2k-shot (DSR-sub) RLVR on R1-Distill-1.5B 8k 8k 84.4 30.2 68.3 29.2 45.8 26.7 47.4
DeepScaleR-1.5B-Preview (40k DSR data) 8k→16k→24k 8k 86.3 35.2 68.1 29.6 46.7 28.3 49.0

Evaluation length = 32768

Model Training Length Evaluation Length MATH 500 (avg@16) AIME 2024 (avg@64) AMC 2023 (avg@64) Minerva Math (avg@16) Olympiad Bench (avg@16) AIME 2025 (avg@64) Avg
R1-Distill-1.5B 32k 82.9 29.8 63.2 26.4 43.1 23.9 44.9
R1-Distill-1.5B (reported) 32k 83.9 28.9
1-shot RLVR on R1-Distill-1.5B 8k 32k 83.9 31.0 66.1 28.3 44.6 24.1 46.3
4-shot RLVR on R1-Distill-1.5B 8k 32k 84.8 32.2 66.6 27.7 45.5 24.8 46.9
16-shot RLVR on R1-Distill-1.5B 8k 32k 84.5 34.3 69.0 30.0 46.9 25.2 48.3
1.2k-shot (DSR-sub) RLVR on R1-Distill-1.5B 8k 32k 84.5 32.7 70.1 29.5 46.9 27.8 48.6
DeepScaleR-1.5B-Preview (40k DSR data) 8k→16k→24k 32k 87.6 41.4 73.2 30.6 49.6 31.3 52.3
DeepScaleR-1.5B-Preview (reported) 8k→16k→24k 32k 87.8 43.1 (avg@16) 73.6 (avg@16) 30.2 (avg@16) 50.0 (avg@16)

W&B

We have logged our experiments for three models to this wandb project, including the results of 1(few)-shot RLVR on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B. We also include the baseline of the full-set RLVR with DSR-sub in it. Please note that the validation results displayed are calculated using the verl/rllm framework and may differ slightly from qwen-eval results.

Acknowledgements

Citation

@article{wang2025reinforcement,
  title={Reinforcement Learning for Reasoning in Large Language Models with One Training Example},
  author={Wang, Yiping and Yang, Qing and Zeng, Zhiyuan and Ren, Liliang and Liu, Lucas and Peng, Baolin and Cheng, Hao and He, Xuehai and Wang, Kuan and Gao, Jianfeng and Chen, Weizhu and Wang, Shuohang and Du, Simon Shaolei and Shen, Yelong},
  journal={arXiv preprint arXiv:2504.20571},
  year={2025}
}

About

official repository for “Reinforcement Learning for Reasoning in Large Language Models with One Training Example”

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published