🧠

DeepSeek & GRPO

Group Relative Policy Optimization and Rule-Based Reward Systems.

🏆

1. DeepSeek Reward System

DeepSeek eliminates the need for a separate value network by using a rule-based reward system.

Accuracy Rewards

For deterministic problems (Math, Code):

  • Math: Correct final answer in specified format.
  • Code: Passes compiler and test cases (e.g., LeetCode).

Format Rewards

Enforces structural consistency:

  • Reasoning must be inside <think> tags.
  • Rewards for adherence, penalties for violation.
👥

2. What is GRPO?

Definition

GRPO generates a group of responses for each input and evaluates them relative to each other, rather than against an absolute standard.

This stabilizes learning and removes the need for a large, separate critic model (Value Network).

GRPO Process Flow

graph TD Input("Input Prompt") --> Gen("Generate Group of K Responses") Gen --> R1("Response 1") Gen --> R2("Response 2") Gen --> R3("Response 3") R1 --> Score1("Score: Reward Model") R2 --> Score2("Score: Reward Model") R3 --> Score3("Score: Reward Model") Score1 --> Avg("Calculate Group Average") Score2 --> Avg Score3 --> Avg Avg --> Adv("Compute Advantage (Score - Avg)") Adv --> Update("Update Policy") style Gen fill:#e0e7ff,stroke:#6366f1 style Adv fill:#f3e8ff,stroke:#a855f7
🍎

3. Example Scenario

Problem: "John has 5 apples, gives 2 to Sarah. How many left?"

Response Correct? Reward Advantage
A: "3 apples left" Yes 1.0 +High
B: "2 apples left" No 0.0 -Low
C: "5 - 2 = 3" Yes 1.0 +High
D: "John has 5" No 0.0 -Low

*Advantages are calculated relative to the group mean (0.5 in this case).

📐

4. Mathematical Formulation

GRPO Objective

The loss function extends PPO with a group-based advantage:

\[ L_{\text{GRPO}}(\theta) = \mathbb{E} \left[ \min\left( r_t^g(\theta) A_t, \text{clip}(r_t^g(\theta), 1 - \epsilon, 1 + \epsilon) A_t \right) \right] - \beta D_{KL}(\pi_\theta || \pi_{ref}) \]

\( r_t^g(\theta) \): Relative importance ratio (New Policy / Old Policy).

\( A_t \): Advantage computed from the group rewards.

\( D_{KL} \): KL Divergence penalty to prevent drifting too far from the reference model.

📝

5. Test Your Knowledge

1. Does DeepSeek use a separate Value Network (Critic)?

2. What are the two main types of rewards in DeepSeek?

3. How is the "Advantage" calculated in GRPO?

4. What is the purpose of the KL Divergence term?

5. GRPO is an extension of which algorithm?

Previous

Lecture 23

Next

Lecture 25