🤔 Research Problem

Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process.

⚙️ Approach

in this study, we demonstrate that reasoning ability of LLMs can be significantly improved through large scale RL, even without using SFT as a cold start. And the ability can be further enhanced by a small amount of cold-start data.

To this end, we present: 1) DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and (2) DeepSeek-R1, which applies RL starting from a checkpoint fine-tuned with thousands of long Chain-of-Thought (CoT) examples. 3) Distill the reasoning capability from DeepSeek-R1 to small dense models.

🔍 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure reinforcement learning process

🫡 GRPO: Group Relative Policy Optimization

截屏2025-03-03 10.10.18.png

📌 constrained relative probability * relative rewards - KL divergence

Reward Modeling

we adopt a rule-based reward system that mainly consists of two types of rewards:

  1. Accuracy rewards: rule-based verification for math; compiler for code; …
  2. Format rewards

📌 We do not apply the outcome / process based neural reward model, because we find that a) the neural reward model may suffer from reward hacking in the large scale reinforcement learning process; b) retrain the reward model needs additional resources; c) complicates the whole traning pipeline.

Self-evolution Process and Aha Moment of DeepSeek-R1-Zero

Self-evolution Process of DeepSeek-R1-Zero

  1. the thinking time show consistent improvement throughout the whole traning process, demonstrating its ability to solve complex reasoning tasks by leveraging extended test-time computation.

    截屏2025-03-03 10.24.10.png

  2. the emergence of sophisticated behaviours as the test-time computation increases, such as reflection and exploration of alternative approaches.

🤯 Aha Moment of DeepSeek-R1-Zero