Welcome to Module 08. This one is about knowledge distillation — one of the most practical techniques in applied deep learning. The idea is simple and powerful, and in Week 6 you're going to implement it yourself. Right now, we're just going to build the intuition so that when we get there, you already understand why it works.
Here's the situation. You've got a big model — maybe something with 395 million parameters — that's really good at the task. It's accurate. But it's expensive to run, it's slow, and maybe you can't afford to serve it in production. Then you've got a smaller model — 149 million parameters — that's fast and cheap but not as accurate. The question that distillation answers is: can the small model learn something from the big model that it couldn't learn just from the raw data? And the answer is yes, it absolutely can. The trick is in how we present the training signal.
Let's make this concrete. You have a customer complaint in your training set, and the ground truth label says "billing dispute." That's a hard label. It's a one-hot vector: 1.0 for billing dispute, 0.0 for everything else. Now, you take that same complaint and run it through your big teacher model. The teacher doesn't just say "billing dispute." It says "70% billing dispute, 20% payment issue, 5% account management, and a little bit of everything else." That's a soft label — it's a probability distribution across all classes. And here's the key insight: the soft label carries strictly more information than the hard label. The hard label tells you the answer. The soft label tells you the answer and how the teacher thought about the alternatives.
Why do soft labels help? Because they encode relationships between classes that hard labels completely throw away. When the teacher says "70% billing dispute and 20% payment issue," it's telling you those two categories are related — complaints about billing disputes often look similar to complaints about payment issues. The hard label says billing dispute is right and everything else is equally wrong. That's a lot of information to discard. Geoff Hinton, who introduced this idea, called it "dark knowledge" — it's information that's hidden in the teacher's near-misses, in the probabilities it assigns to the wrong classes. When you train the student on soft labels, it picks up on this structure. It learns not just what the right answer is, but which wrong answers are close, which categories are similar, and how the label space is organized. That's a richer training signal, and it leads to a better student model.
There's one more ingredient, and that's temperature. When you compute softmax at temperature 1 — the default — the output probabilities tend to be peaky. The model is confident: 85% billing dispute, and everything else is basically noise. But the interesting information is in those small probabilities. To bring them out, we raise the temperature. At temperature 2, the distribution spreads out. At temperature 5, it spreads out even more. You can see the class relationships more clearly. The teacher is saying "look, billing dispute and payment issue really are similar, and even account management has something in common with them." You lose that signal at temperature 1 because the probabilities are too concentrated. In Week 6, you'll experiment with different temperatures and see how they affect the student's learning. For now, just remember: higher temperature means softer probabilities, which means richer signal about class relationships.
Here's how this connects to the course. In Week 6, you're going to do this for real. You'll take a teacher model — one with 395 million parameters — and run it across your entire training set. You'll collect its soft predictions for every example. Then you'll take the student model you've been training all semester — 149 million parameters — and train it on a blend of the hard labels from the dataset and the soft labels from the teacher. You'll compare that distilled student against the same model trained only on hard labels, and you'll measure the difference. The result, if everything goes well, is a student model that performs closer to the teacher without being any bigger or slower. That's the power of distillation: you're transferring knowledge from one model to another through the training signal.
Let me leave you with the core insight. Distillation works because a good teacher model doesn't just know what the right answer is — it knows how the wrong answers relate to each other. When it says "this is probably billing dispute but it could be payment issue," that's real, useful information. And when you train a smaller model on those soft predictions, that structural knowledge transfers. The student ends up understanding the label space better than it would from hard labels alone. In Week 6, you'll implement this end to end and see the results for yourself. For now, make sure you understand the intuition: soft labels are richer than hard labels, temperature controls how much information leaks through, and dark knowledge is the key to why distillation works.