Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process.
in this study, we demonstrate that reasoning ability of LLMs can be significantly improved through large scale RL, even without using SFT as a cold start. And the ability can be further enhanced by a small amount of cold-start data.
To this end, we present: 1) DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and (2) DeepSeek-R1, which applies RL starting from a checkpoint fine-tuned with thousands of long Chain-of-Thought (CoT) examples. 3) Distill the reasoning capability from DeepSeek-R1 to small dense models.
explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure reinforcement learning process

📌 constrained relative probability * relative rewards - KL divergence
we adopt a rule-based reward system that mainly consists of two types of rewards:
📌 We do not apply the outcome / process based neural reward model, because we find that a) the neural reward model may suffer from reward hacking in the large scale reinforcement learning process; b) retrain the reward model needs additional resources; c) complicates the whole traning pipeline.
Self-evolution Process of DeepSeek-R1-Zero
the thinking time show consistent improvement throughout the whole traning process, demonstrating its ability to solve complex reasoning tasks by leveraging extended test-time computation.

the emergence of sophisticated behaviours as the test-time computation increases, such as reflection and exploration of alternative approaches.
🤯 Aha Moment of DeepSeek-R1-Zero