Qwen 3 and DeepSeek R1's Post-Training Approaches
Introduction
The Qwen 3 series (specifically Qwen3-235B-A22B/Qwen3-32B), announced by Alibaba Cloud yesterday, and DeepSeek R1, from DeepSeek AI, announced in January earlier this year have both pushed the boundary of AI Reasoning with distinct post-training approaches.
In this post we look at the post training approaches taken by both of the advanced AI models closely and review how these approaches are converging and look at any divergence if at all.
It's important to acknowledge that elaborate literature from both the labs have enabled democratize this knowledge and therefore the ability to write this post.
We have come a long way since January of this year in our ability to appreciate the details of model pre-training and post training stages, which were otherwise opaque pre-january of this year.
We focus on the similarity in the frontier model pipelines and not the subsequent smaller model variants produced of them.
Post Training Similarities
Both Deepseek R1 and Qwen-3 pipelines depict a multi-stage tuning approach.
Both Deepseek R1 and Qwen-3 pipelines heavily use RL through multiple stages of RL tuning during post-training. (namely RL tuning for reasoning and General RL tuning)
Both Deepseek R1 and Qwen-3 pipelines use cold start to seed Long CoT before the RL tuning
Long CoT cold-start primarly revolves around ensuring models don't struggle with challenges like poor readability, and language mixing. Infact, Deepseek actually built 2 models one with Long CoT (Deepseek-R1) and one without Long CoT (Deepseek-R1-Zero) to prove the core challenge wasn't really reasoning but poor readability and language difference
Deepseek's Post-Training Pipeline
Long Chain-of-Thought (CoT) Cold Start: This stage starts by fine-tuning a base model (DeepSeek-V3-Base) with thousands of cold-start data points to lay a solid foundation covering popular tasks such as mathametics, coding, STEM problems etc
Reasoning-Based Reinforcement Learning (RL): This stage begin by designing a straightforward template that guides the base model(Deepseek v3) to adhere to our specified instructions. The template requires DeepSeek-R1-Zero to first produce a reasoning process, followed by the final answer. It uses GRPO (Group Relative Policy Optimization) and generic reward system which includes Format rewards and Accuracy Rewards as training feedback signal to the base model.
Rejection Sampling Stage: This stage uses rejection sampling where the model creates it’s own labeled data by selecting the best examples from the last successful RL run.
The new synthetic data was merged with supervised data from DeepSeek-V3-Base in domains like writing, factual QA, and self-cognition. This step ensured the model could learn from both high-quality outputs and diverse domain-specific knowledge.
General RL: After fine-tuning with the new data, the model goes through a final RL process across diverse prompts and tasks.
Qwen 3 Post-Training Pipeline
Qwen 3's post-training is structured into a four-stage pipeline, as outlined in a recent blog post (Qwen3: Think Deeper, Act Faster | Qwen):
Long Chain-of-Thought (CoT) Cold Start: This stage involves fine-tuning on diverse long CoT data, covering mathematics, coding, logical reasoning, and STEM problems. This supervised fine-tuning (SFT) step seeds the model with reasoning capabilities.
Reasoning-Based Reinforcement Learning (RL): Computational resources are scaled up, and rule-based rewards are utilized to enhance exploration and exploitation, focusing on reasoning tasks.
Thinking Mode Fusion: This unique stage integrates non-thinking capabilities by fine-tuning on long CoT data and instruction-tuning data generated by the enhanced thinking model from stage 2, ensuring a balance between reasoning and general tasks.
General RL: Applied across more than 20 general-domain tasks, including instruction following, format following, and agent capabilities, to broaden the model's applicability.
Post Training Differences
Qwen 3's thinking mode fusion vs Deepseek's Rejection Sampling Stage. While they appear to be different, on presented technical details, they seem more of the same, sans minor difference. This may need more documentation/clarity.
Qwen 3 likely integrates a nuanced reward modeling strategy, particularly for complex tasks like mathematical reasoning. It employs Process Reward Models (PRMs), which are designed to evaluate and correct intermediate errors in the reasoning process, ensuring both the steps and final answers are accurate. For specific tasks, Qwen 3 uses outcome-based rewards, such as an accuracy verifier for math problems and a code execution server for coding tasks, making its approach highly task-specific and detailed. whereas DeepSeek R1 seems to adopt a simpler reward system. It focuses on rewarding final accuracy and adherence to a structured output format, such as using accuracy/format etc. Qwen 3's reward modeling appears to be more sophisticated, emphasizing process supervision than final output based rewards.
Conclusion & Implications
LLM Post training has made significant strides with reasoning being the primary focus. Reinforcement Learning appears to be the main technique (although with other methods such as SFT/DPO). Multi-stage pipeline includes stage for Reasoning bootstrapping followed by generalized Reasoning, in addition to combining thinking and non-thinking mode have come a long way but also have started appearing quite similar between different models.
More transparency in model training disclosures from other frontier labs such as OpenAI and Anthropic would have made the above conclusion crystal clear.
If one is reading right, Implications of this near standardization of post-training pipelines are many.
a) Data, Data curation/selection and not the pre-training and post-training mechanics brings in the real meat in how these models differ.
b) The differences in Reward modeling strategy are often natural and usually progress over time. This is likely to further expand as we have get specific about tasks.
c) The standardization of the pipeline renders itself for adoption further down the pipeline, such as domain/task specific models.
Key Citations
Towards Effective Process Supervision in Mathematical Reasoning Qwen Blog
Deepseek Research Paper: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
#openai #anthropic #alibaba #deepseek #qwen #posttraining
We build personalized AI apps that delight your customers and exceed expectations.
3moThanks for sharing this very insightful article !!
Senior Manager – Data&AI | Ex-ADA Scientist | IITM Alumnus | Multi-Agent AI Architect | GenAI | Azure | AWS | Databricks | snowflake
3monice insights