Qwen 3 and DeepSeek R1's Post-Training Approaches

Rajesh Parikh

Published Apr 29, 2025

Introduction

The Qwen 3 series (specifically Qwen3-235B-A22B/Qwen3-32B), announced by Alibaba Cloud yesterday, and DeepSeek R1, from DeepSeek AI, announced in January earlier this year have both pushed the boundary of AI Reasoning with distinct post-training approaches.

In this post we look at the post training approaches taken by both of the advanced AI models closely and review how these approaches are converging and look at any divergence if at all.

It's important to acknowledge that elaborate literature from both the labs have enabled democratize this knowledge and therefore the ability to write this post.

We have come a long way since January of this year in our ability to appreciate the details of model pre-training and post training stages, which were otherwise opaque pre-january of this year.

We focus on the similarity in the frontier model pipelines and not the subsequent smaller model variants produced of them.

Post Training Similarities

Both Deepseek R1 and Qwen-3 pipelines depict a multi-stage tuning approach.
Both Deepseek R1 and Qwen-3 pipelines heavily use RL through multiple stages of RL tuning during post-training. (namely RL tuning for reasoning and General RL tuning)
Both Deepseek R1 and Qwen-3 pipelines use cold start to seed Long CoT before the RL tuning
Long CoT cold-start primarly revolves around ensuring models don't struggle with challenges like poor readability, and language mixing. Infact, Deepseek actually built 2 models one with Long CoT (Deepseek-R1) and one without Long CoT (Deepseek-R1-Zero) to prove the core challenge wasn't really reasoning but poor readability and language difference

Deepseek's Post-Training Pipeline

Long Chain-of-Thought (CoT) Cold Start: This stage starts by fine-tuning a base model (DeepSeek-V3-Base) with thousands of cold-start data points to lay a solid foundation covering popular tasks such as mathametics, coding, STEM problems etc

Reasoning-Based Reinforcement Learning (RL): This stage begin by designing a straightforward template that guides the base model(Deepseek v3) to adhere to our specified instructions. The template requires DeepSeek-R1-Zero to first produce a reasoning process, followed by the final answer. It uses GRPO (Group Relative Policy Optimization) and generic reward system which includes Format rewards and Accuracy Rewards as training feedback signal to the base model.

Rejection Sampling Stage: This stage uses rejection sampling where the model creates it’s own labeled data by selecting the best examples from the last successful RL run.

The new synthetic data was merged with supervised data from DeepSeek-V3-Base in domains like writing, factual QA, and self-cognition. This step ensured the model could learn from both high-quality outputs and diverse domain-specific knowledge.

General RL: After fine-tuning with the new data, the model goes through a final RL process across diverse prompts and tasks.

Qwen 3 Post-Training Pipeline

Qwen 3's post-training is structured into a four-stage pipeline, as outlined in a recent blog post (Qwen3: Think Deeper, Act Faster | Qwen):

Long Chain-of-Thought (CoT) Cold Start: This stage involves fine-tuning on diverse long CoT data, covering mathematics, coding, logical reasoning, and STEM problems. This supervised fine-tuning (SFT) step seeds the model with reasoning capabilities.
Reasoning-Based Reinforcement Learning (RL): Computational resources are scaled up, and rule-based rewards are utilized to enhance exploration and exploitation, focusing on reasoning tasks.
Thinking Mode Fusion: This unique stage integrates non-thinking capabilities by fine-tuning on long CoT data and instruction-tuning data generated by the enhanced thinking model from stage 2, ensuring a balance between reasoning and general tasks.
General RL: Applied across more than 20 general-domain tasks, including instruction following, format following, and agent capabilities, to broaden the model's applicability.

Post Training Differences

Qwen 3's thinking mode fusion vs Deepseek's Rejection Sampling Stage. While they appear to be different, on presented technical details, they seem more of the same, sans minor difference. This may need more documentation/clarity.
Qwen 3 likely integrates a nuanced reward modeling strategy, particularly for complex tasks like mathematical reasoning. It employs Process Reward Models (PRMs), which are designed to evaluate and correct intermediate errors in the reasoning process, ensuring both the steps and final answers are accurate. For specific tasks, Qwen 3 uses outcome-based rewards, such as an accuracy verifier for math problems and a code execution server for coding tasks, making its approach highly task-specific and detailed. whereas DeepSeek R1 seems to adopt a simpler reward system. It focuses on rewarding final accuracy and adherence to a structured output format, such as using accuracy/format etc. Qwen 3's reward modeling appears to be more sophisticated, emphasizing process supervision than final output based rewards.

Conclusion & Implications

LLM Post training has made significant strides with reasoning being the primary focus. Reinforcement Learning appears to be the main technique (although with other methods such as SFT/DPO). Multi-stage pipeline includes stage for Reasoning bootstrapping followed by generalized Reasoning, in addition to combining thinking and non-thinking mode have come a long way but also have started appearing quite similar between different models.

More transparency in model training disclosures from other frontier labs such as OpenAI and Anthropic would have made the above conclusion crystal clear.

If one is reading right, Implications of this near standardization of post-training pipelines are many.

a) Data, Data curation/selection and not the pre-training and post-training mechanics brings in the real meat in how these models differ.

b) The differences in Reward modeling strategy are often natural and usually progress over time. This is likely to further expand as we have get specific about tasks.

c) The standardization of the pipeline renders itself for adoption further down the pipeline, such as domain/task specific models.

Key Citations

#openai #anthropic #alibaba #deepseek #qwen #posttraining

Kaustubh Zende

We build personalized AI apps that delight your customers and exceed expectations.

3mo

Thanks for sharing this very insightful article !!

1 Reaction

Sabarish Kumar G.

nice insights

See more comments

Qwen 3 and DeepSeek R1's Post-Training Approaches

Rajesh Parikh

Introduction

Post Training Similarities

Deepseek's Post-Training Pipeline

Qwen 3 Post-Training Pipeline

Post Training Differences

Conclusion & Implications

Key Citations

AI Agents All the way

830 followers

More articles by this author

Others also viewed

Gen AI for Business # 20

Ahead of AI #9: LLM Tuning & Dataset Perspectives

🥇Top AI Papers of the Week

🥇Top AI Papers of the Week

#42 Teaching AI to "Think", Fine-Tuning to SQL, Encoder-only models, and more!

This AI newsletter is all you need #29

DeepSeek Has Introduced Advanced Reasoning To The Open-Source Community, & More

What advancements in machine learning can we expect in the next decade?

AI Week in Review: From Hype to Reality—o3 May Be Proto-AGI

OpenAI’s New Open-Weight AI Model: Unlocking Advanced Reasoning Capabilities

Explore topics

Introduction

Post Training Similarities

Deepseek's Post-Training Pipeline

Qwen 3 Post-Training Pipeline

Post Training Differences

Conclusion & Implications

Key Citations

AI Agents All the way

830 followers

GPT-5 Introduction: What Enterprise Leaders Need to know

Aug 8, 2025

Enterprise Metadata Paradox and Data Readiness Debate

Jul 30, 2025

Agentic AI: Your personal belief system may be sabotaging your Enterprise Agentic AI Strategy

Jul 28, 2025

Crafting a compelling "Agentization" Vision for your enterprise

Jul 21, 2025

Agentic AI: Introduction and Journey to new operating paradigm

Jul 8, 2025

Why Gartner's Agentic AI Predictions Miss the Real Story 🎯

Jun 26, 2025

Agent Tools: Agent-Private Sandboxes with Embedded Tools may be the real future

Jun 10, 2025

Navigating the Complex Choices in AI Agent Development 🤖⚡

Apr 28, 2025

To MCP or To not MCP

Apr 5, 2025

Decision Flowchart for Agents to get that extra accuracy

Mar 23, 2025

Others also viewed

Gen AI for Business # 20

Ahead of AI #9: LLM Tuning & Dataset Perspectives

🥇Top AI Papers of the Week

🥇Top AI Papers of the Week

#42 Teaching AI to "Think", Fine-Tuning to SQL, Encoder-only models, and more!

This AI newsletter is all you need #29

DeepSeek Has Introduced Advanced Reasoning To The Open-Source Community, & More

What advancements in machine learning can we expect in the next decade?

AI Week in Review: From Hype to Reality—o3 May Be Proto-AGI

OpenAI’s New Open-Weight AI Model: Unlocking Advanced Reasoning Capabilities

Explore topics