1. DeepSeek Reward System
DeepSeek eliminates the need for a separate value network by using a rule-based reward system.
Accuracy Rewards
For deterministic problems (Math, Code):
- Math: Correct final answer in specified format.
- Code: Passes compiler and test cases (e.g., LeetCode).
Format Rewards
Enforces structural consistency:
- Reasoning must be inside
<think>tags. - Rewards for adherence, penalties for violation.
2. What is GRPO?
Definition
GRPO generates a group of responses for each input and evaluates them relative to each other, rather than against an absolute standard.
This stabilizes learning and removes the need for a large, separate critic model (Value Network).
GRPO Process Flow
3. Example Scenario
Problem: "John has 5 apples, gives 2 to Sarah. How many left?"
| Response | Correct? | Reward | Advantage |
|---|---|---|---|
| A: "3 apples left" | Yes | 1.0 | +High |
| B: "2 apples left" | No | 0.0 | -Low |
| C: "5 - 2 = 3" | Yes | 1.0 | +High |
| D: "John has 5" | No | 0.0 | -Low |
*Advantages are calculated relative to the group mean (0.5 in this case).
4. Mathematical Formulation
GRPO Objective
The loss function extends PPO with a group-based advantage:
\[ L_{\text{GRPO}}(\theta) = \mathbb{E} \left[ \min\left( r_t^g(\theta) A_t, \text{clip}(r_t^g(\theta), 1 - \epsilon, 1 + \epsilon) A_t \right) \right] - \beta D_{KL}(\pi_\theta || \pi_{ref}) \]\( r_t^g(\theta) \): Relative importance ratio (New Policy / Old Policy).
\( A_t \): Advantage computed from the group rewards.
\( D_{KL} \): KL Divergence penalty to prevent drifting too far from the reference model.