Distillation

Tool	What it compresses	What it costs
LLM.int8	Memory at ~equal accuracy	Latency on T4
int4 NF4	Memory + (with right kernel) latency	Some calibration drift on tail
AWQ / GPTQ / FP8	Production-grade compression	Different hardware than you have

`T_d`	Softmax behavior	Information transferred
1	Sharp; close to argmax	Mostly the top-1 class
4	Moderately soft	Top class + relative shape of next several
8	Very soft, near-uniform	Spread across many classes
∞	Uniform	Nothing

Spec	Value
Base model	Qwen3-32B (decoder)
Adaptation	LoRA rank 16 (~3.2M trainable params)
Training data	original train + test split combined (79,278 examples)*
Calibration	post-hoc temperature scaling, `T = 1.25`
Val macro F1	0.322
Val ECE	0.021

Name	Requires	Used here?
Hinton-style KD — student matches full softmax	Logits access	No — closed APIs don't expose full logits
Top-k KD — match top-k logprobs	`logprobs` API param	Possible, slow, no allegations specifically claim
Hard-label distillation — train on teacher's argmax / sampled output	Just API access	Yes — what every "distillation attack" actually means
Synthetic-data SFT — teacher labels/rewrites unlabeled data	API access	Yes (often blended)
CoT trace harvesting — query for reasoning traces	API + reasoning model	The Anthropic / Gemini-trace allegations


Teacher	Gemini 3.1 Flash Lite (public API)
What crosses the wire	Sampled `(query, tools, answer)` triples — no logits
Student loss	Standard cross-entropy on the triples
Post-training	2B tokens, 45 minutes
Architecture	Encoder-decoder, no FFN, gated residual, INT4 QAT

	Your lab today	Needle (Cactus, May 2026)
Teacher	Qwen3-32B, weights on Hub	Gemini 3.1 Flash Lite, behind API
Target shape	Full softmax over 113 classes	Sampled tool-call output
Loss	KL(student ∥ teacher) + CE on labels, T_d=4, α=0.7	Cross-entropy on sampled output
What transfers	Distribution + argmax	Argmax only
Busbridge prediction	ECE 0.1–0.6%	ECE 22–39%

Comparison	Δ macro F1 (median)	95% CI
Qwen3-32B vs ModernBERT-large	+0.014	[−0.008, +0.045]

Outcome	Magnitude
Tail F1 lift over plain CE	~+0.003 (within noise)
ECE improvement	~−0.024 (real)
What temperature scaling alone would buy	~the same ECE improvement

Setup	Macro F1
Week 1 baseline (train only)	0.209
Week 6 vanilla (train+test)	0.264

Wanting	Verdict	What we measured
#1: scale will fix tail	Cracked	Paired bootstrap CI [−0.008, +0.045]
#2: fancy tricks will fix tail	Cracked	cRT bought calibration, not capacity
#3: distillation will fix tail	The lab measures today	Per-tier F1 + ECE, paired bootstrap

Lift	Magnitude
Week 1 → Week 6 vanilla (data shift)	+0.055
Week 6 vanilla → Week 6 distilled (KD)	+0.015

Metric	What it measures	When it matters
ECE	Top-1 confidence calibration	Confidence-threshold routing (does 80% confidence mean 80% accuracy?)
NLL	Full 113-dim distribution calibration	Ensembling, downstream Bayesian inference, second-class predictions

Threshold T	What you trade	What you need
Lower	More volume auto-routed	Tolerance for confident-wrong errors
Higher	Higher accuracy on auto-routed	Better-calibrated confidence (or you escalate everything)

Scenario	Hard constraint	Primary metric	Likely winning recipe
A — High-throughput batch triage	<10 ms/ex on T4; 10k QPS	Macro F1 + throughput	Vanilla + post-hoc T (fastest, calibrated enough)
B — Regulated escalation review	Calibrated probabilities for human review	ECE + NLL	KD or KD + post-hoc T (full distribution matters)
C — Long-tail rare-class monitoring	Tail-class detection critical	Tail F1 + tail calibration	No recipe likely wins (data ceiling)

Welcome back. Last class of the term. You've now spent five weeks fine-tuning, comparing, diagnosing, and compressing. Today we close the term with the third compression — knowledge compression, distillation. You take a 32B-parameter teacher, and you train a 149M-parameter student to inherit something useful from it. The interesting word in that sentence is "something useful." That's what we're going to spend the next 85 minutes unpacking. By the end of class you will have specific opinions about which property a student inherits, which property it doesn't, and what cheaper alternatives exist for the property you actually want. The lab measures it. The homework defends a deployment recommendation against fixed constraints. And the news cycle, fortunately, is going to give us a free pedagogical hook on the way through.

One week ago you sat in this room and we did three things. We looked at what quantization actually does numerically. We looked at the production stack the field uses in 2026 — AWQ on vLLM and FP8 on H100s, not bitsandbytes on T4. And then you measured six configurations on your own hardware. The take-home was that quantization isn't one technique with one set of trade-offs; it's a family of tools, each best on a specific constraint, on specific hardware. The lesson today rhymes with that one. Distillation is also not one technique — and we'll see that explicitly when we look at what's been in the news.

Here is the sentence I want you to leave with today. Distillation transfers specific properties at specific costs. Cheaper alternatives often transfer the same property — but some don't have a substitute. Read that twice. Most of the time when somebody says "we distilled GPT-4," what they mean is "we trained on GPT-4's outputs." That works for some properties. It does not work for others. By the end of class you'll know precisely which is which, and you'll know what test you'd run to find out. This week's lab and homework both turn on this distinction. So does next week's news cycle, which is the convenient fact we'll exploit in Act 2.

Rhythm of the day. Lecture first — we'll build the math, look at what distillation is actually doing under the hood, then take a 5-minute detour through the news cycle, then close with what you'll measure. Lab is 80 minutes, predict-then-observe rhythm, three silent bugs in a colleague's KD loss for you to find. Homework picks up after class — and this week it's a design exercise, not an analysis exercise. You pick one of three deployment scenarios, build a candidate recipe shortlist, test two specific claims from the literature against your own numbers, defend your choice with measurements. The memo is five sections, weighted toward the deployment defense — and Prompt 5 is the capstone synthesis of the term, integrating every measurement axis from Weeks 1 through 6 into one defended deployment recommendation.

Quick orientation on the proper nouns. Distillation has flavors. Hinton-style is what we'll spend most of today on — student matches teacher's full softmax. Hard-label, output-only, Alpaca-style — those are all the same thing under different names: train the student on the teacher's sampled outputs, no soft target. We'll see in Act 2 why the distinction matters and why the news cycle calls all of these things "distillation" without specifying. The recipe today is concrete. The teacher is Qwen3-32B, fine-tuned with LoRA, frozen base. The student is ModernBERT-base, the same one you've been training since Week 1. Distillation temperature T_d = 4, α = 0.7 — those are the hyperparameters we'll defend or sweep in homework. Mechanism diagnostics — ECE, NLL, JS divergence — these are how you'll diagnose what got transferred and what didn't.

This is the term in one diagram. You've been chasing macro F1 on a 113-class long-tail problem since Week 1. You've tried scaling — bigger encoder, then a decoder, then a bigger decoder. You've tried fancy tricks — class weighting, kitchen sink recipes, two-phase training. Each time you wanted the thing to work, and each time we ran a careful enough measurement to see what it actually bought you. Wanting #1 was scale. The paired bootstrap on Qwen3-32B against ModernBERT-large with training data held constant produced a confidence interval that includes zero. No statistically significant scale advantage at this dataset's tail length. Wanting #2 was fancy tricks. Class-weighted training in particular. We measured it and found that it improved calibration but didn't lift tail F1 — a useful effect on the wrong axis. Wanting #3 is distillation. We arrive there honestly today: we've eliminated the easy answers, and now we're going to measure the harder one.

Section 1. Same playbook as Week 5. Before we run anything, look at what's happening mechanically. The KD loss has two parts, a temperature parameter, and a specific math direction that turns out to matter for reading the literature. We'll spend about 25 minutes here, then take 5 minutes on what's been in the news, then move to the term-arc framing.

Distillation is a relationship between two models. The teacher knows things; the student wants to learn them. The teacher is too expensive to deploy at scale; the student is cheap. The simplest thing you can do is feed the same inputs to the teacher and the student, take the teacher's predictions as labels, and train the student via standard cross entropy on those labels. That works. It's also the recipe most companies actually use because it requires only an API to the teacher, not the teacher's logits. We'll come back to it in Act 2. But Hinton, in 2015, observed that this throws away most of what the teacher knows. The teacher's argmax is just one number. The teacher's full softmax over the vocabulary, or in our case over the 113 classes, is a much richer signal. The relative probabilities — the fact that the teacher gives 12% to class 53 and 7% to class 89, even when class 47 is the answer — those numbers contain information about what classes are similar to each other, and that information is missing from a one-hot label.

Look at the diagram on the right. Left panel: the hard label. Class 47 is the truth, so the target distribution puts probability 1 on class 47 and 0 everywhere else. Right panel: the soft target, the teacher's softmax. Class 47 still has the highest mass, but it's only 55%. The remaining 45% is spread across the other 112 classes in a specific shape — 12% on class 12, 18% on class 53, and so on. That spread is what Hinton called the dark knowledge. It's not noise. It encodes the teacher's beliefs about which classes are similar to which. A student trained against the soft target sees that structure on every example. A student trained against the hard label sees a 1 and zeros. The mathematical detail to know — and which causes confusion when reading papers — is that softmax cross-entropy produces gradients on every logit, both with the hard target and with the soft target. The difference is the *target shape*, not gradient support. Don't let anyone sell you the "CE has gradient only on the true class" version. It's wrong. Both losses pull every logit; they pull toward different things.

Here's the loss. Three things to note. First, the direction. KL divergence is asymmetric — KL(P || Q) is not equal to KL(Q || P). We want the student's distribution to match the teacher's, so the teacher is the reference and the student is the approximation — KL(teacher || student). What this objective does in plain terms: the teacher distribution is the target, and the loss penalizes the student for failing to put probability mass where the teacher puts mass. So it trains the student toward the teacher's full probability shape, not just the argmax. PyTorch's `F.kl_div` gives you this direction when you pass log_softmax of the student as the input and softmax of the teacher as the target. The argument order in PyTorch is the most common stumbling block here, and the homework's bug hunt features it. Second, the temperature. Dividing the logits by T_d before taking the softmax makes both distributions softer — closer to uniform when T is large, sharper when T is small. Soft distributions carry more information about non-top-1 classes than sharp ones. T_d = 4 for us today. Third, the T² multiplier on the KL term. When you soften logits by T, the gradient of the softmax shrinks by 1/T². Without the multiplier, the KD signal would arrive at the parameters with much smaller magnitude than the CE signal, and the α weighting would be misleading. Hinton put the T² in to neutralize this.

The temperature parameter has a clear interpretation. T_d = 1 and the softmax is unmodified — for a confident teacher, almost all the mass sits on the top class. T_d = 4 and the distribution opens up. The top class is still highest, but the 2nd and 3rd and 4th classes get visible mass too, and the student now has gradient signal pulling it toward those relative magnitudes. T_d = 8 and you're flattening even further. T_d → ∞ and the soft target becomes uniform — every class equally likely, no information. So there's a sweet spot. Too sharp and you're effectively training on hard labels; too soft and you're training on noise. Today's recipe is T_d = 4, α = 0.7. In your homework you'll sweep 3 temperatures against 2 alphas — 6 configurations total — and you'll see whether 4 was a defensible default, or whether something else dominates for your scenario.

The teacher. Qwen3-32B — same family as the Qwen 0.5B/1.5B/3B you compared against the encoder in Week 3, much bigger member. The cross-regime thread of the term lands here. LoRA rank 16, 3.2M trainable parameters. The other 32B stay frozen as they came out of pretraining. So when I say "32B teacher" — 32B of frozen pretraining knowledge plus a small adapter trained on this task. About the data. We trained the teacher on train+test combined. Deliberate experimental frame, not test-set leakage. Val remains the held-out comparison set; the original test split got repurposed as extra training data so we'd have the strongest teacher possible. In production with a true held-out test set, you do not do this. Teacher's val numbers — macro F1 0.322, ECE 0.021. Treat that as the ceiling we expect KD to approach. A small student can sometimes overshoot a teacher (regularization, CE anchoring), but don't bank on it.

Engineering pattern worth knowing. A 32B-parameter model in bf16 needs about 64 GB of VRAM. Your T4 has 16. So we cannot load the teacher into memory on the hardware you have. The trick — and it's a trick the production world uses constantly — is to run inference on the teacher once, save the logits to disk, then read them per-batch during student training. 18.6 MB for all 79,000 examples in fp16. Fits in any cache. The KD loss matches the student's logits on a batch against the corresponding teacher logits from the file. The student literally never has to know that the teacher exists as a model — it sees only an array of soft targets. This pattern lets you distill from arbitrarily large teachers as long as you have one machine that can run inference once.

This is the experimental design. ModernBERT-base, fresh classifier head, the same 149M-parameter model you've been training since Week 1. Two arms — vanilla and distilled. They share everything: the data, the batch size, the learning rate, the number of epochs, the random seed. The only thing that differs is the loss function. Vanilla minimizes cross-entropy against the hard labels. Distilled minimizes the KD loss we just wrote down. Same forward pass, same backward pass, same optimizer steps, same hyperparameters everywhere. So when you compare these two students you are isolating the contribution of the loss function and nothing else. That's a careful experiment. It's also expensive — about 80 minutes of T4 time per arm — so we ran both for you. You'll load val predictions and analyze.

The lab is structured around 4 predict-then-observe moments. Each one has a specific question, a place where you write your prediction, a cell that produces the actual number, and a reveal that you don't open until you've reconciled. This rhythm is the entire pedagogical point of the lab. Anyone can run a notebook end to end and look at numbers. The skill we're training is this — commit to a prediction with reasoning, then reconcile when the data disagrees. The memo grades you on the quality of the reconciliation, not whether you guessed right. People who guess right and write nothing get less credit than people who guess wrong and write down what they learned. So write carefully. Today's first prediction — tail F1 of the teacher — is the foundation of every other claim in the week, so it's worth the extra 30 seconds.

One foreshadow before we move on. By the time the lab ends today, you will have measured two distinct properties of what KD transferred. One I'm calling capacity — the simple argmax property, gets the right answer or doesn't, expressed as F1 or accuracy. Quick terminology note — when I say capacity in this course I mean argmax decision quality, not the standard machine learning usage of capacity meaning representational power or parameter count. We use this loose meaning consistently across the lab, the homework, and the rubric, so when you read "capacity transfer is data-bounded" in the rubric, it means F1 transfer is data-bounded. The other axis I'm calling calibration — the shape of the probability distribution, expressed as ECE for top-one calibration or NLL for the full distribution. The thesis of the week is that these two properties decouple under distillation in a way that matters for engineering. I'm not going to tell you which way; the lab will. Two axes. Two measurements. They behave the same way, or they don't. Find out yourself. The data lands harder if you measure first.

5-minute detour. We're not going to stay in the news cycle long, but we are going to land one specific technical distinction that's load-bearing for everything else in the term. The cultural moment around distillation in 2025 and 2026 is genuinely confused — sometimes deliberately, sometimes not — about what the word means. By the end of these 5 slides you'll have language to read these stories with precision.

The cultural moment around "distillation" in 2025-26 is genuinely confused — sometimes deliberately — about what the word means. Three documented allegations since 2024, none litigated. DeepSeek vs OpenAI, January 2025: Microsoft alleged proxy-network API exfiltration; OpenAI publicly called it "inappropriate distillation." Anthropic, February 2026: 16M exchanges across 24,000 fraudulent accounts, three Chinese labs named. OpenAI memo to the House Select Committee on the CCP, same month. One framing note before we dig in. The evidence we can evaluate is company statements, press reports, policy memos — not peer review, not a court filing, not a reproducible measurement. When I say "documented allegations," I mean documented in press and policy. Not adjudicated. Three questions to take into any "distillation" headline. Logits or only completions? Closed APIs almost never expose logits — so "they distilled" can't mean Hinton's recipe. Behavioral or operational evidence? Behavioral is circumstantial: model self-identifies as ChatGPT. Operational is logged proxy clusters — much stronger, but lives inside the accusing company. Filed in court? In this whole timeline, nothing has been. ByteDance 2023 came closest — leaked internal docs, OpenAI suspended their account, no lawsuit. That's the cultural shape of the moment. Now the technical question.

This is the taxonomy that almost no news article makes explicit. Hinton-style KD — the thing we did today — needs the teacher's full softmax over the output space. Closed APIs do not expose full logits. So when news says DeepSeek distilled GPT-4, they cannot mean Hinton-style KD; the necessary information isn't on the wire. What they actually mean is one of three other things. Hard-label distillation — train the student on the teacher's sampled output as if it were a ground truth label. Synthetic-data SFT — use the teacher to generate or relabel a training corpus, then train the student on that corpus. CoT trace harvesting — for reasoning models, query for the reasoning trace, train the student to imitate it. All three of these are mathematically standard supervised learning where the labels happen to come from a teacher model. There's no KL divergence, no temperature, no soft targets. The distillation framing in the news collapses these together with Hinton-style KD and the result is a vocabulary that is unhelpfully imprecise.

Here's the empirical evidence that grounds the taxonomy. Busbridge et al., ICML 2025 — "Distillation Scaling Laws," §E.8. They do a controlled experiment: same student architecture, same training data, same teacher model. The only thing they vary is whether the student sees the teacher's full softmax distribution as the target, or only the teacher's top-1 sampled token. With the full distribution, student ECE comes out at 0.1-0.6%. With top-1 only, student ECE is 22-39%. That is a 50-100× difference in calibration. Same architecture, same data, same compute. The only thing that varies is whether the recipe preserves the teacher's distributional shape or throws it away. So when news says "they distilled GPT-4" — even if literally true — the resulting student is fundamentally different from the one Hinton's recipe produces, and the difference is measurable as direct as ECE. This is also what your homework Part 3 Test A is going to measure on our setup, at small scale, using JS divergence per tier.

Connecting back to today. The gap between the news cycle's vocabulary and Hinton's is real, and our homework will let you measure it on this dataset. The lab does Hinton-style KD. You see the teacher's full softmax during training. The homework's Part 3 Test A asks "how distributionally close is the student to the teacher, per tier?" You'll compute Jensen-Shannon divergence between teacher and student softmax, on head, mid, and tail, and you'll see whether distillation closes the gap or leaves a residual. The same property Busbridge measures at LLM scale, just done on our setup. If we had instead trained the student on the teacher's argmax outputs only — the news cycle recipe — JS divergence would barely move; that's the prediction. Whether it holds is your measurement. The point is that "did they really distill?" reduces to an empirical question with a number on it. Stories that assert distillation happened but produce no measurement of distributional fidelity are pre-engineering rhetoric. You can do better.

Concrete example for the row of the taxonomy that matters most. Cactus Compute released this earlier this month — twenty-six million parameter function-calling model, MIT licensed, weights and dataset-generation code public. You can clone it this afternoon. The recipe is what matters. Cactus calls Gemini 3.1 Flash Lite via the public API and asks it to invent (query, tools, answer) triples. The student trains on those triples with standard cross-entropy. No logits cross the wire. No temperature, no KL. Sampled outputs treated as ground truth. They call this distillation — and by the everyday meaning, it is. The student learned from the teacher. But it's the synthetic-data SFT row of the taxonomy you just saw, not the Hinton-KD row. Two billion tokens of synthesized data, forty-five minutes of post-training, on top of a two-hundred-billion-token pretrain. Architecture is interesting — no FFN anywhere, INT4 QAT, Muon optimizer — but a sidebar today. The point is the recipe, and the recipe is the one the closed-API constraint forces.

Side by side. On the left, what you'll do today. On the right, what Cactus did with Needle. The key difference is the constraint. Your teacher sits in a file on the Hub — the full softmax over a hundred and thirteen classes is yours for the asking. Cactus's teacher sits behind an API — the only thing the API exposes is the sampled output. They literally cannot run Hinton's recipe. Connect this back to Busbridge two slides ago. He measured exactly this difference — same student, same data, vary only the target shape. Full distribution gives ECE in the zero-point-one to zero-point-six percent range. Top-one only gives twenty-two to thirty-nine percent. Fifty to a hundred times worse calibration on the same student. So look at Needle through that lens. The framework predicts that even if its accuracy looks competitive with Gemini, its calibration will be measurably worse. That's a falsifiable claim about a real, current model — and your homework Part 3 Test A is the small-scale version of measuring it. Footnote — the Cactus README claims Needle beats four named small models. No public benchmark supports the claim. The framework you have today predicts what a calibration measurement would find.

Section 3. We close the term arc here. Three compressions you've now seen — label, weight, knowledge. They all ran into the same wall. We name it explicitly, we measure it, and then we set up what to do about it.

Walk left to right. Label compression. Week 4 and Week 5 pipeline reveal. The dataset shipped with 153 raw label values. We collapsed near-duplicates with a merge map down to 120, then dropped 17 classes that had fewer than 5 examples per class — too rare to learn from. Down to 113 canonical classes. That's a compression. It preserves something — a tractable supervised learning problem, a label space the student can actually learn — and it loses something — those 17 classes are gone forever. Weight compression. Last week. Take fp16 weights, map to int8 or int4, save memory, sometimes save latency. It preserves equal accuracy on most tiers. It loses some calibration on the tail at int4. Knowledge compression — today. The big question mark, because we haven't measured it yet. Same shape as the other two. Something is preserved. Something else isn't. The lab tells you which is which.

Wanting #1 was scale. We tested it carefully. We trained Qwen3-32B with LoRA on this exact dataset. We compared it against ModernBERT-large, our biggest encoder, with training data held constant — both on train+test. We ran a paired bootstrap on the validation set, 6,430 examples, 1,000 resamples. The median delta in macro F1 was +0.014 — Qwen barely better. The 95% confidence interval was [-0.008, +0.045]. The interval includes zero. No statistically significant scale advantage. So when somebody on Twitter tells you bigger models always win on long-tail data, you can answer that on a 113-class problem with our tail length, holding training data constant, the gap was zero with a measurable confidence interval. Scale wasn't doing the work the field was crediting it with.

Wanting #2 was fancy tricks. The specific trick was class-weighted classifier retraining — cRT — where you weight the loss higher for rare classes during a fine-tuning stage. The hope was that this would give tail classes more gradient signal and lift tail F1. We measured. Tail F1 moved by about +0.003 — within the bootstrap noise, indistinguishable from zero. So no capacity transfer. ECE moved by about -0.024 — a real improvement in calibration, well outside noise. So cRT did do something — it just did it on the wrong axis. And then the punchline. We compared cRT to post-hoc temperature scaling on the same model, and temperature scaling reproduced essentially all of the ECE improvement, for free, at zero training cost. So the engineering takeaway: cRT improved the wrong axis at the wrong cost. The cheaper alternative — fit one parameter post hoc — got us the same calibration improvement.

Wanting #3 is what you came in for today. Two interpretations to hold in your head. The hopeful frame — maybe distillation transfers something that escapes the data ceiling. Maybe the teacher's distributional knowledge — what classes look similar, where confusion lives — is a property that can be inherited even when the student doesn't have enough data to learn it directly. That's the optimistic story. The skeptical frame — and this is also worth holding — is that the teacher itself hits the data ceiling on the tail. We know its tail F1 is around 0.20. The teacher saw the same per-class data scarcity that the student sees. What could it possibly transfer that it doesn't have? Both stories are coherent. Only one is correct on this dataset, and the lab will tell you which.

Look at the teacher's per-tier F1. Head — top 20 most frequent classes — F1 0.652. Mid — next 40 classes — 0.450. Tail — the rarest 53 classes — 0.198. Our biggest, most expensive, most carefully calibrated teacher, trained on more data than any other model in this course, gets barely 20% macro F1 on the tail. That's well below the head and mid tiers, and well below what you'd consider shipping. Why? Same reason every other model on this dataset has hit the same wall — there are too few training examples per tail class for any model, at any scale, to learn the boundaries between rare classes reliably. This is the data ceiling, made concrete. It is not a property of the model. It is a property of the data. So when you measure the distilled student's tail F1 in the lab today, hold this in mind — KD should not be expected to systematically transfer tail capacity beyond what the teacher itself has. A small student can sometimes outperform its teacher on a metric — regularization, less overfitting, lucky variance — but a large tail F1 lift from KD on a teacher this weak on tail would be a surprising result that would need careful validation. The lab measures whatever's there.

While we're talking about what does and doesn't lift the number — here's a finding from the homework's data confound study that's worth flagging now. The Week 1 baseline trained on the train split only — 57,846 examples. Got macro F1 0.209. The Week 6 vanilla — same model, same architecture, same recipe, but trained on train+test combined — 79,278 examples — got 0.264. +0.055 from data composition. No fancy techniques, no scale change, no calibration recipe. Just more representative training data. That number, +0.055, is bigger than what any of our scaling experiments produced and bigger than what cRT produced. We'll come back to it in a couple of slides because it changes how you read the lab's result. The thing that has consistently moved the F1 number on this dataset is data, not method.

The verdict so far. Two of the three wantings are closed. Scale was cracked weeks ago. cRT was cracked too. Both came up against the same wall — the per-class data scarcity at the tail. The third wanting is the one we're testing today. Mechanically distillation is a totally different recipe from scale and from class weighting, so a priori the wall could behave differently. Empirically the lab is going to give you specific numbers — per-tier F1 with paired bootstrap CIs, and per-tier ECE — and you will see whether the wall is something distillation also runs into, or whether it gets around it. Hold the question. Don't peek at the lab numbers in advance. The pedagogical point of today is the measurement, not the conclusion.

This is the chart that reframes the entire week. Three bars. Week 1 baseline at 0.209. Week 6 vanilla — same model, larger training set — at 0.264. Week 6 distilled — same training set, KD on top — at 0.279. The two lifts. +0.055 from seeing more data. +0.015 from also distilling from a 32B teacher. 3.6×. So if the question on the table were "what should we do to lift macro F1?" the empirically defensible answer based on this term's measurements would be — get more data, you idiot. But the question on the table isn't usually that, because in industry "get more data" is often impossible. Annotation budgets, rare events, regulatory constraints, ethical limits — there are real reasons you can't always just collect more labels. So the actual question for your memo is: given a fixed data budget — which is what your scenario will impose — what's the cheapest recipe that transfers the property your deployment depends on? KD is one answer. Temperature scaling is another. Doing nothing is a third. We're now ready to ask which.

Section 4. Quick walkthrough of what the lab actually does, then on to deployment framing. We've spent enough time on theory and term-arc. Time to get specific about today's 80 minutes.

Lab structure. Three acts, 80 minutes total. Act 1 — meet the teacher, look at how much per-class data each tier had to learn from, predict the teacher's tail F1, then look at top-3 teacher probabilities on examples from each tier so you see the dark knowledge concretely. Act 2 is the load-bearing one. You implement the KD loss from scratch. Then you hunt 3 silent bugs in a colleague's KD loss — these are realistic, the kind that survive a "loss is decreasing" check and produce wrong students. Then you compare distilled and vanilla, per tier, F1 and ECE both. Act 3 brings it operational. You compute threshold coverage tables, set up a deployment scenario, and write a deployment decision artifact in class. 4 predict-then-observe moments distributed across the three acts. The bootstrap cell takes about 15 seconds — far less than I had originally guessed, since it's a CPU operation.

The bug hunt is the centerpiece of Act 2. Three bugs. None of them crash. All of them silently change what the student learns. Loss still decreases when you train this — that's the trap. So how do you find them? You compare the colleague's function against your own correct implementation, line by line, on the same random batch. The values will disagree, and the disagreement tells you where the bug is. Read carefully. 2 of the 3 bugs are inside the KL term. 1 is in how the two terms are combined. I'll let you find them. The reveal in the lab walks through exactly what each bug does and what it produces in a trained student. The pedagogical point — and this is real industry muscle — is that "loss is decreasing" is not a test. You need value-comparison against a known-good reference. This is exactly what professional ML engineers do when their model is broken.

Quick statistics reminder, because the bootstrap intervals you'll compute today are load-bearing for the memo. A paired bootstrap with 1,000 resamples gives you a confidence interval on each per-tier delta. The interval excludes zero — the effect is real at this sample size, you can claim it as a finding. The interval includes zero — you cannot distinguish the effect from zero. This is not the same as "the effect is zero." It is a specific statement about your data: with this many examples, our resolution is not enough to call the effect. The honest framing in the memo is "the lift is +0.0X with CI [Y, Z], inside the noise floor on this sample size." The dishonest framing — and you will see it in industry constantly — is to look at a small point estimate and assert it's a real effect. Your tail tier has only 210 examples in val. The CI on tail F1 is going to be quite wide. Be honest about that.

Practical note. Each of these students takes about 80 minutes to train end-to-end on a T4. Two students, 160 minutes of training, plus tokenization plus evaluation, doesn't fit in your 80-minute lab. So we ran both arms for you, ahead of time, and pushed the val predictions to public Hub repos. You download two npz files, total under 10 MB, and you have everything — raw logits, argmax predictions, ground truth labels, tier assignments — to do all the analysis the lab needs. You're paying for analysis time, not training time. The trade-off is you don't experience the pain of training a student that fails because of a bug. The bug hunt is the pedagogical compensation for that. You debug a colleague's broken function instead of debugging your own.

The Act 3 artifact, at the end of lab. Pick one of three deployment scenarios. Fill in fields with specific values from tables you computed. Defend in writing. There's no single right scenario-to-recipe pairing. There IS a wrong shape of answer — anything that doesn't reference your numbers, anything that picks a recipe without naming the property of the deployment that recipe serves, anything that asserts an opinion without a constraint. The form matters. Threshold names a number. Coverage names a number. Decision rule names the recipe and the threshold together. Constraint that would flip names a specific switch point. This is the shape every deployment defense needs. We're training the form here so the homework memo inherits it. The memo grading rewards this form, not your specific recommendation.

Last act. Calibration was a Week 4 topic. We measured ECE under quantization in Week 5. Today it returns as the second axis of the dichotomy and the load-bearing axis for the deployment defense. We'll spend about 15 minutes here, then close.

ECE vs NLL. We've been computing both since Week 4. They are not the same thing. ECE — expected calibration error — bins predictions by their max probability and asks whether the accuracy in each bin matches the average confidence. So ECE is an aggregate over the top-1 prediction's confidence. It is exactly the right metric if your deployment uses confidence thresholds — auto-route at 0.85, escalate at 0.7, that kind of thing. NLL — negative log likelihood — is the cross-entropy of the model's full softmax against the true labels. It penalizes the model for putting probability mass on wrong classes, not just for being miscalibrated on the top class. So NLL is the right metric if your deployment consumes the full distribution — ensembling, Bayesian downstream steps, ranking the top 3 predictions for human review. Two metrics, two questions, two deployments. The homework's Part 3 Test B will measure both of these and you'll see they sometimes point in different directions.

Skeptic's question. KD is a real recipe. It costs 80 minutes of T4 time per student. Post-hoc temperature scaling fits 1 parameter on a calibration set, takes 30 seconds, costs nothing in inference. If KD's main effect is calibration — and Han Guo and colleagues argued exactly this in 2021 — why pay for KD when temperature scaling closes the same gap? The honest answer is: it depends on which calibration metric you care about. Top-1 calibration — ECE — is exactly what temperature scaling is engineered to fix. 1 parameter rescales softmax sharpness. NLL — full distribution — is what temperature scaling cannot fix uniformly because uniform rescaling can't change relative probabilities of non-top-1 classes; KD can. Your homework's Part 3 Test B sets up the three-way comparison: vanilla raw, vanilla + T, distilled. ECE and NLL won't necessarily move in the same direction. Which recipe wins which metric is the measurement you'll do. The right answer to the skeptic isn't "always pick KD" or "always pick temperature scaling" — it's "name the property your deployment consumes, then measure which recipe transfers it."

Operational lens. Threshold and coverage. Most production systems that consume model outputs do so behind a confidence gate. If the model says it's 85% sure, take its answer; if less, route the example to a human reviewer. This is everywhere — content moderation, customer support routing, medical pre-screening, document classification. The policy depends entirely on whether the model's confidence is meaningful. Lower the threshold, you keep more volume but have to tolerate more confident-wrong errors. Raise the threshold, accuracy on the auto-routed subset goes up, but you escalate more volume to human reviewers. The whole trade-off only works if confidence is calibrated. A poorly calibrated model has examples that look confident and are wrong. Threshold filtering on those doesn't help — the wrong-and-confident examples slip through and damage the system. Calibration is what makes threshold-based deployment policies work.

Three scenarios in your homework. Pick one, commit. The point of having three is that they have structurally different metric priorities, and the "right" recipe depends on which one you picked. Scenario A — high-throughput batch triage. Latency budget is tight. Macro F1 matters. Calibrated probabilities matter only loosely. Vanilla + a post-hoc temperature fit might dominate here because it's the cheapest recipe that meets the calibration bar at all and it has the lowest inference cost. Scenario B — regulated escalation review. A human is going to see the model's top-1 confidence and decide whether to escalate. Calibrated probabilities are the deliverable. ECE and NLL both matter. KD or KD + a post-hoc T might dominate because the full distribution shape is what the human reviewer's policy depends on. Scenario C — long-tail rare-class monitoring. Tail F1 is what matters. Spoiler — no recipe will likely win this scenario. The data ceiling holds. The homework tests whether you can defend "this is a data problem, not a recipe problem" — which is itself a defensible engineering position. Pick one. Different scenarios. Different defenses. Different winners.

The wildcard slot. Industry skill, not a measurement. In your recipe shortlist, 4 entries are configs we precomputed for you and you measure. The 5th — the wildcard — is something you propose but don't run. The point is to test whether you can predict from the mechanism what a recipe would do. Examples — T_d = 16, much heavier softening than our extreme. Or α = 0.5, lower KD weight, more hard-label signal. Or distilled + post-hoc temperature scaling on top — which I'd actually predict dominates the distilled-only recipe on ECE without changing macro F1. Or vanilla + label smoothing, which is interesting because label smoothing softens the hard target — making it kind of a poor man's KD. You pick one. You predict where it lands. You justify the prediction from the mechanism, not the data. This is what senior engineers do constantly — predict before measuring. It's also what saves compute. If your prediction says the recipe would be dominated, you don't need to spend 40 minutes training it. Your wildcard exercises this muscle.

The capstone synthesis. Prompt 5 of your homework memo is the closing exercise of the term. It asks you to recommend a single recipe for a real deployment, defend it across all the axes we've measured this term — accuracy, calibration, latency, memory, tier-level robustness — and acknowledge the cross-week interactions you haven't measured but should. Some of those open questions are listed here. Quantized distilled student — does Week 5's int4 calibration drift survive when applied to a distilled student? We don't know. The recipe combinations multiply faster than the compute budget. Vanilla + temperature scaling + int8 — that's the cheapest possible recipe, and it might actually be the right one for a high-throughput deployment. We didn't measure that combination. Distilled with the 17 dropped classes restored — would the data ceiling shift? Don't know. Prompt 5 acknowledges what you didn't measure, defends what you did, and recommends what you'd actually ship. That's the closing exercise of the term.

Last few slides. Logistics for the lab and the homework, then we close.

Logistics. The lab is in your repo, `week6_lab.ipynb`, predict-then-observe rhythm, 4 prediction cycles, KD loss implementation, bug hunt, per-tier metrics, paired bootstrap, threshold table, deployment artifact at the end. 80 minutes, you'll be done by 5:30. Homework is `week6_homework.ipynb`. ~5 hours including memo. 4 parts — diagnose the teacher, build a recipe shortlist for your scenario, run 2 literature tests, write the memo. Note on deadlines — this is the last class meeting, so the usual "Wednesday morning before next class" rule doesn't apply. The Week 6 memo is due before the final exam. Check Moodle for the exact date. Last note — you don't need a GPU for the homework. Analysis only. Everything is precomputed and you load val predictions. Run on CPU on Kaggle, on Colab, on your laptop, doesn't matter.

Today in one sentence. Distillation transfers what's in the teacher's full distribution — the soft target, the dark knowledge, the relative probabilities across non-true classes. That's its irreplaceable property. Top-1 calibration — what most threshold-based deployments actually consume — is reproducible by post-hoc temperature scaling at zero training cost. Whether that reproduction is partial, complete, or whether it overshoots is the measurement your homework Part 3 will produce. So the right answer to "should I distill" is always: name the property your deployment depends on, then measure which recipe transfers it on your setup. If it's top-1 calibration, fit a temperature first and see if you need more. If it's full-distribution shape, distill. If it's something neither one fixes — like tail capacity at this sample size — your problem is data, not recipe. By the end of today you have measurement-grounded vocabulary for all three. By the end of homework, you have a defended position. That's the term.

The term in retrospect. Six weeks ago you trained your first ModernBERT model and looked at macro F1. Today you can do 4 things most working ML engineers cannot. 1 — you can diagnose a model per tier. Most engineers in industry look at aggregate macro F1 and stop there. You know that aggregate is a weighted average that hides everything important about the long tail, and you have the tools to break it down. 2 — you can defend a number with a confidence interval. You will see in industry papers, blog posts, leaderboards, claims that one model beats another by 0.005 F1, with no measurement of variance. You will know to ask "what's the bootstrap CI on that delta?" because you've computed it. 3 — you can tell what your dataset can support from what your method should do. Most engineers conflate these. They blame their loss function for what their data was always going to do. You won't. 4 — you can read papers and press releases and ask the second question. What did they actually measure. What's underspecified. What recipe is hiding in the words. That's the discipline. The specific T_d and α you'll forget. The discipline you'll keep.

That's lecture. 5-minute break, then lab. The papers in `readings/week6/` are worth the read at some point — they're the literature anchors for everything we just did. Hinton 2015, the original distillation paper, is short, accessible, and it tells you exactly what dark knowledge means in the original context. Stanton 2021, "Does Knowledge Distillation Really Work?", is the optimization-difficulty story — even with a perfect teacher, the student doesn't fully recover the distribution, and they argue why. Busbridge 2025 is the LLM-scale measurement of the full-distribution vs top-1 distinction — the 50-100× ECE difference that I cited in Act 2. Three papers, three different parts of the picture. If you only read one, read Hinton. If you read two, add Busbridge. If you have time for the third, Stanton. Anything from the term that didn't land for you, ask now or grab me after lab. Otherwise — see you in the lab.

Distillation

ECBS5200 — Week 6

Six weeks ago you trained a model. Today a 32B teacher trains it.

Where we left off

This week's thesis

Today's shape

Vocabulary you'll hear today

The wanting trilogy

Act 1: What distillation actually is

Before we measure anything, look at the operation.

A student matches a teacher's outputs

Two target shapes; same gradient on every logit

The KD loss

Temperature: what T_d actually does

The teacher we use

Why students don't load the teacher

What you'll measure today

Predict, then observe

Two transfer axes — foreshadowed

Act 2: Distillation in the news

You've read about this. Now we get specific.

Reading "distillation" stories with precision

What journalists call "distillation"

Busbridge 2025 — the load-bearing measurement

What you measured today vs the news cycle

Needle, May 2026 — synthetic-data SFT in the open

Your lab vs Needle — same word, two recipes

Act 3: Three compressions, one ceiling

The term in one slide.

Wanting #1: scale (closed earlier this term)

Wanting #2: cRT class weighting (closed)

Wanting #3: distillation (today's question)

Even the teacher hits the data ceiling on the tail

What scale does buy you (small but real)

Where each finding leaves the term

The data confound, made explicit

Act 4: What you'll measure

Predict-then-observe, four times. Then defend.

The lab in one slide

The bug hunt — what to expect

What CIs let you say (and what they don't)

Why we precomputed both arms for you

The §3d artifact: deployment decision

Act 5: The deployment lens

Calibration is the second axis. Match metric to deployment.

Recap: ECE and NLL are different lenses

The skeptic's question

Threshold and coverage — the operational story

Three deployment scenarios in the homework

The wildcard slot

Cross-week reach

Closing

What you do today, and what the term gave you.

Lab and homework — logistics

Today in one sentence

What the term gave you

Thank you

Temperature: what `T_d` actually does