ECBS5200 Week 6

Distillation

ECBS5200 — Week 6

Six weeks ago you trained a model. Today a 32B teacher trains it.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Where we left off

Week 5 — quantization. Six configurations, five numbers each, one deployment decision.

Tool What it compresses What it costs
LLM.int8 Memory at ~equal accuracy Latency on T4
int4 NF4 Memory + (with right kernel) latency Some calibration drift on tail
AWQ / GPTQ / FP8 Production-grade compression Different hardware than you have

The take-home: quantization is a toolbox, not a technique. Match tool to constraint, measure on your hardware.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

This week's thesis

Distillation transfers specific properties at specific costs. Cheaper alternatives often transfer the same property — but some don't have a substitute.

The applied ML skill is naming the property you need, picking the cheapest recipe that delivers it, and knowing when no cheaper recipe exists.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Today's shape

Lecture → Lab (80 min) → Homework + memo.

Lab: implement KD loss from scratch, hunt three silent bugs, then compare a vanilla student against a distilled student on the same data. Per-tier F1, per-tier ECE, paired bootstrap CIs.

Homework: pick a deployment scenario, build a recipe shortlist, test two literature claims against your numbers, write a 5-section memo. Prompt 5 of the memo is the capstone synthesis of the term.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Vocabulary you'll hear today

Distillation flavors: soft-target / Hinton-style, hard-label / output-only, synthetic-data SFT, black-box, Stanford Alpaca-style

Loss math: soft target, dark knowledge, temperature T_d, KL divergence on softmax

Today's specific recipe: Qwen3-32B teacher (LoRA-fine-tuned, frozen base) → ModernBERT-base student, T_d = 4, α = 0.7, train+test combined

Mechanism diagnostics: ECE, NLL, JS divergence, per-tier paired bootstrap

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

The wanting trilogy

This term you've been chasing one number — macro F1 on a 113-class long-tail task. You've tried three things, and you've been wanting each of them to work.

Wanting #1: scale will fix it. Cracked weeks ago.
Wanting #2: fancy tricks will fix it. Cracked too.
Wanting #3: distillation will fix it. Today's question.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Act 1: What distillation actually is

Before we measure anything, look at the operation.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

A student matches a teacher's outputs

A teacher is large and slow. A student is small and fast. The student trains to reproduce the teacher's behavior on the training data.

The simple version — train the student on (input, teacher_argmax_label) pairs — works. Sort of. Industry calls this synthetic-data SFT or hard-label distillation.

But Hinton et al. 2015 noticed something better. Use the teacher's full probability distribution as the target. Not just the argmax.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Two target shapes; same gradient on every logit

Cross-entropy with a hard label points all the gradient toward making the true class confident.

Cross-entropy with the teacher's soft distribution points the student at the teacher's relative probabilities across all 113 classes.

Both losses produce gradients on every logit. The difference is the target shape.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

The KD loss

  • t, s — teacher and student logits
  • T_d — distillation temperature (softens both distributions before KL)
  • α — weight on the soft-target term; (1−α) on hard labels
  • T_d² — cancels the 1/T_d² gradient scaling that softening introduces

Direction: KL(teacher || student) — teacher is the target, student is the approximation. The loss penalizes the student for failing to place probability mass where the teacher does — training the student toward the teacher's full probability shape, not just the argmax.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Temperature: what T_d actually does

T_d Softmax behavior Information transferred
1 Sharp; close to argmax Mostly the top-1 class
4 Moderately soft Top class + relative shape of next several
8 Very soft, near-uniform Spread across many classes
Uniform Nothing

Today's recipe: T_d = 4. Homework sweeps T_d ∈ {1, 4, 8} against α ∈ {0.7, 0.9}.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

The teacher we use

Qwen3-32B + LoRA + temperature scaling. Same architectural family as the Qwen 0.5B / 1.5B / 3B decoders you compared against the encoder in Week 3 — just a much bigger member of the family.

Spec Value
Base model Qwen3-32B (decoder)
Adaptation LoRA rank 16 (~3.2M trainable params)
Training data original train + test split combined (79,278 examples)*
Calibration post-hoc temperature scaling, T = 1.25
Val macro F1 0.322
Val ECE 0.021

* "test" here means the original third split we repurposed as additional labeled training data. Val remains the held-out comparison set for this experiment. Don't do this with a true final test set in the wild.

The base model is 32B parameters. Only ~0.01% of them changed during training.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Why students don't load the teacher

A 32B model needs ~64GB of VRAM in bf16. Your T4 has 16GB.

Solution: precompute the teacher's logits once, host as a public dataset.

hf_hub_download(repo_id="earino/ecbs5200-week6-teacher-logits",
                repo_type="dataset",
                filename="train_test_logits_qwen3_32b_canonical_final.npz")
# 18.6 MB. fp16. (79,278 × 113) array.

The KD loss reads from this array per batch. The student sees the teacher's distribution without ever loading the teacher.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

What you'll measure today

Same student, same data, same hyperparameters, same seed. Only the loss function differs.

  • Vanilla student: CE(student_logits, hard_labels)
  • Distilled student: α · KL(teacher || student) · T² + (1−α) · CE(student_logits, hard_labels)

The cleanest possible isolation of "what does distillation specifically transfer."

The lab pre-computes both for you. You analyze.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Predict, then observe

Today's lab is built around four predictions:

  1. Tail F1. What macro F1 does a 32B teacher get on the rarest 53 classes?
  2. Distribution shape. Which tier has the most peaked teacher distribution?
  3. Where does KD's F1 lift land? Head, mid, tail, or evenly?
  4. Does ECE follow the same per-tier pattern as F1?

Write your prediction. Run the cell. Reconcile before opening the reveal.

The memo rewards specific corrections, not lucky guesses.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Two transfer axes — foreshadowed

By the end of the lab you'll have measured two different things the distilled student inherited from the teacher.

One is capacity — argmax decision quality on each input (F1, accuracy).*
One is calibration — the probability shape over all 113 classes (ECE, NLL).

The lab will tell you whether they decouple, and on what tiers.

* We're using "capacity" loosely to mean "argmax decision quality," not the standard ML usage of "representational power" (param count). When the rubric says "capacity transfer is data-bounded," it means F1 transfer.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Act 2: Distillation in the news

You've read about this. Now we get specific.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Reading "distillation" stories with precision

Three documented allegations since 2024 — none litigated:
DeepSeek vs OpenAI (Jan 2025) · Anthropic vs DeepSeek/Moonshot/MiniMax (Feb 2026, 16M+ exchanges alleged) · OpenAI memo to Congress (Feb 2026)

Common pattern: API queries → train on completions. Called "distillation" in every headline.

Three questions to ask of any "distillation" allegation:

  1. Logits access, or only API completions? Closed APIs don't expose full logits → not Hinton-style.
  2. Behavioral evidence or operational? "Model self-identifies as ChatGPT" vs "logged proxy clusters."
  3. Filed in court, or press / policy posture? None of these has been adjudicated.
Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

What journalists call "distillation"

Name Requires Used here?
Hinton-style KD — student matches full softmax Logits access No — closed APIs don't expose full logits
Top-k KD — match top-k logprobs logprobs API param Possible, slow, no allegations specifically claim
Hard-label distillation — train on teacher's argmax / sampled output Just API access Yes — what every "distillation attack" actually means
Synthetic-data SFT — teacher labels/rewrites unlabeled data API access Yes (often blended)
CoT trace harvesting — query for reasoning traces API + reasoning model The Anthropic / Gemini-trace allegations
Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Busbridge 2025 — the load-bearing measurement

Busbridge et al., ICML 2025, "Distillation Scaling Laws," §E.8. Same student model, same data. Vary only soft-vs-hard target:

  • Full-distribution KD: student ECE 0.1–0.6%
  • Top-1 KD: student ECE 22–39%

That's a 50–100× difference in calibration depending on which target shape you used. The full distribution is what carries calibration. Sampled outputs throw it away.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

What you measured today vs the news cycle

The news cycle is mostly about hard-label / synthetic-data SFT. Closed APIs force this, regardless of intent.

You did Hinton-style KD today. You had the teacher's full softmax.

Your homework Part 3 Test A measures what this difference buys you on this dataset. JS divergence between teacher and student, per tier. It's the small-scale version of Busbridge's ECE measurement.

The news cycle's unfalsifiable rhetoric becomes a measurement question.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Needle, May 2026 — synthetic-data SFT in the open

Cactus Compute · 26M params · MIT · weights + code public. github.com/cactus-compute/needle

Teacher Gemini 3.1 Flash Lite (public API)
What crosses the wire Sampled (query, tools, answer) triples — no logits
Student loss Standard cross-entropy on the triples
Post-training 2B tokens, 45 minutes
Architecture Encoder-decoder, no FFN, gated residual, INT4 QAT

This is the synthetic-data SFT row of the taxonomy table earlier in this act — named, dated, and open.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Your lab vs Needle — same word, two recipes

Your lab today Needle (Cactus, May 2026)
Teacher Qwen3-32B, weights on Hub Gemini 3.1 Flash Lite, behind API
Target shape Full softmax over 113 classes Sampled tool-call output
Loss KL(student ∥ teacher) + CE on labels, T_d=4, α=0.7 Cross-entropy on sampled output
What transfers Distribution + argmax Argmax only
Busbridge prediction ECE 0.1–0.6% ECE 22–39%

Same word. Different recipes. Measurable, different consequences.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Act 3: Three compressions, one ceiling

The term in one slide.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6
Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Wanting #1: scale (closed earlier this term)

We trained Qwen3-32B with LoRA on this dataset to test whether scale would break the long-tail ceiling.

Paired bootstrap on val (n=6,430), training data held constant:

Comparison Δ macro F1 (median) 95% CI
Qwen3-32B vs ModernBERT-large +0.014 [−0.008, +0.045]

CI includes 0. No statistically significant scale advantage at this dataset's tail length.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Wanting #2: cRT class weighting (closed)

Class-weighted classifier retraining (cRT) — give rare classes higher loss weight.

Outcome Magnitude
Tail F1 lift over plain CE ~+0.003 (within noise)
ECE improvement ~−0.024 (real)
What temperature scaling alone would buy ~the same ECE improvement

Class weighting bought calibration, not capacity. And post-hoc temperature scaling matched it without the recipe.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Wanting #3: distillation (today's question)

If scale didn't work and fancy losses didn't work, what about training a small student to inherit a big teacher's behavior?

The hopeful frame:

Maybe the long-tail data ceiling that bounds training a model can be sidestepped by transferring a property from one that doesn't have the ceiling.

The skeptical frame:

If the teacher itself hits the ceiling on tail, what could it possibly transfer?

The lab settles it. We measure both arms on the same val set with paired bootstrap CIs.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Even the teacher hits the data ceiling on the tail

Teacher per-tier F1: head 0.652 / mid 0.450 / tail 0.198.

A 32B-parameter, carefully-trained, post-hoc-calibrated teacher gets 0.198 macro F1 on the rarest 53 classes — far below its head and mid tiers, and well below what you'd ship.

Why this matters: KD should not be expected to systematically transfer tail capacity beyond the teacher's own weak tail performance. Any large tail gain from KD would be surprising and would need careful validation.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

What scale does buy you (small but real)

Same paired bootstrap, but compare ModernBERT-base trained on train only to the same model trained on train+test:

Setup Macro F1
Week 1 baseline (train only) 0.209
Week 6 vanilla (train+test) 0.264

Δ +0.055 from data composition. Not from scale, not from KD — just from showing the model more representative data.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Where each finding leaves the term

Wanting Verdict What we measured
#1: scale will fix tail Cracked Paired bootstrap CI [−0.008, +0.045]
#2: fancy tricks will fix tail Cracked cRT bought calibration, not capacity
#3: distillation will fix tail The lab measures today Per-tier F1 + ECE, paired bootstrap

Pattern across the three: every "easy" intervention has come up against the same wall — too few examples per tail class for any technique to escape.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

The data confound, made explicit

Lift Magnitude
Week 1 → Week 6 vanilla (data shift) +0.055
Week 6 vanilla → Week 6 distilled (KD) +0.015

Data shift bought 3.6× more macro F1 than distillation.

The right question for the memo isn't "did distillation help?" — it's "given a fixed data budget, what cheapest recipe transfers the property your scenario needs?"

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Act 4: What you'll measure

Predict-then-observe, four times. Then defend.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

The lab in one slide

80 minutes, three acts:

  • Act 1 (~20 min): meet the teacher, predict tail F1, look at probability shape
  • Act 2 (~30 min): implement KD loss, hunt three silent bugs, compare distilled vs vanilla per tier — F1 and ECE
  • Act 3 (~30 min): threshold/coverage analysis + deployment artifact

Four predict-then-observe cycles. One paired bootstrap (~15 sec on T4).

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

The bug hunt — what to expect

A colleague sends you their KD loss. Loss decreases during training. The student's calibration looks weird.

Three silent bugs. None crash. All three change what the student learns.

def colleague_kd_loss(student_logits, teacher_logits, hard_labels, T_d, alpha):
    """[BUGGY]"""
    kl_term = F.kl_div(
        F.softmax(student_logits / T_d, dim=-1),       # bug?
        F.softmax(teacher_logits / T_d, dim=-1),
        reduction="batchmean",
    )                                                   # bug?
    ce_term = F.cross_entropy(student_logits, hard_labels)
    return alpha * ce_term + (1 - alpha) * kl_term      # bug?

Vague hint: two of the three are in the KL term.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

What CIs let you say (and what they don't)

Per-tier paired bootstrap on val (n=6,430 total; head 5,155 / mid 1,065 / tail 210).

A CI that excludes 0 → "the effect is real at this sample size."
A CI that includes 0 → "we cannot distinguish this effect from zero."

NOT: "the effect is zero." Just: "we don't have the data to call it."

For your memo: "the lift is X with CI [Y, Z], inside the noise floor" is the right form.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Why we precomputed both arms for you

Training one student on T4: ~80 minutes.

Two arms × 80 min = 160 min. Doesn't fit in 80-min lab.

So you load val predictions, not weights. The Hub repos give you fp16 logits + argmax preds + tier assignments per example. Instant.

You're paying for analysis time, not training time.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

The §3d artifact: deployment decision

By the end of class: pick one of three scenarios. Fill in every field with a specific value from a table you computed. Defend in writing.

The artifact has three structural rules:

  1. Every threshold and coverage value comes from your own §3c table
  2. The recipe choice references at least one number from §3a, §3b, or §3c
  3. Names one specific constraint that would flip your decision

The form of the answer matters more than the specific recipe.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Act 5: The deployment lens

Calibration is the second axis. Match metric to deployment.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Recap: ECE and NLL are different lenses

Metric What it measures When it matters
ECE Top-1 confidence calibration Confidence-threshold routing (does 80% confidence mean 80% accuracy?)
NLL Full 113-dim distribution calibration Ensembling, downstream Bayesian inference, second-class predictions

One number cannot summarize calibration. Different deployments care about different parts of the distribution.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

The skeptic's question

If KD's main effect is calibration, and post-hoc temperature scaling fits a single parameter for free, why bother with KD?

This is a real question. The Han Guo 2021 paper (RepL4NLP) argues KD is "essentially a calibration regularizer" — and that temperature scaling reproduces most of what KD buys.

Your homework Part 3 Test B tests this on this dataset. And the answer turns out to be metric-specific.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Threshold and coverage — the operational story

A common deployment pattern: auto-route only when confidence ≥ T. Below T, escalate to human review.

The policy works only if confidence is meaningful.

Threshold T What you trade What you need
Lower More volume auto-routed Tolerance for confident-wrong errors
Higher Higher accuracy on auto-routed Better-calibrated confidence (or you escalate everything)

A poorly-calibrated model has examples that look confident and are wrong. Threshold filtering doesn't help.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Three deployment scenarios in the homework

Scenario Hard constraint Primary metric Likely winning recipe
A — High-throughput batch triage <10 ms/ex on T4; 10k QPS Macro F1 + throughput Vanilla + post-hoc T (fastest, calibrated enough)
B — Regulated escalation review Calibrated probabilities for human review ECE + NLL KD or KD + post-hoc T (full distribution matters)
C — Long-tail rare-class monitoring Tail-class detection critical Tail F1 + tail calibration No recipe likely wins (data ceiling)

The homework forces you to pick one and commit. Different scenarios pick different recipes.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

The wildcard slot

In Part 2 of the homework, you list 4 measurable recipes — and one wildcard you don't run.

The wildcard is a config you propose from first principles and predict where it would land. Examples:

  • T_d = 16 (heavier softening than the grid extreme)
  • α = 0.5 (lower KD weight than 0.7)
  • Distilled + post-hoc temperature scaling
  • Vanilla + label smoothing

You justify the prediction mechanistically. You don't run it.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Cross-week reach

Prompt 5 of your homework memo is the capstone synthesis — it integrates all six weeks into one defended deployment recommendation.

Possible cross-week recipes the homework Prompt 4 asks about:

  • Quantized distilled student — does Week 5's int4 calibration drift survive KD?
  • Vanilla + temperature scaling + int8 — cheap calibration + cheap memory
  • Distilled with merged labels reverted — does adding back the dropped 17 classes change anything?

You only measured one of these. Prompt 5 defends the recipe you'd actually ship.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Closing

What you do today, and what the term gave you.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Lab and homework — logistics

Lab in class today (~80 min).

  • notebooks/week6/week6_lab.ipynb
  • Predict-then-observe rhythm, four cycles
  • Implement KD loss + bug hunt + per-tier comparison
  • §3d deployment artifact due before you leave

Homework after class (~5 hours).

  • notebooks/week6/week6_homework.ipynb
  • 4 parts: diagnose, shortlist, literature tests, memo
  • Memo + HTML upload to Moodle — see Moodle for the exact deadline. The Week 6 memo is due before the final exam.
Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Today in one sentence

Distillation transfers what's in the teacher's full distribution. The cheaper alternative is post-hoc temperature scaling. Match metric to deployment, and know what each recipe leaves behind.

You will leave today with a measurement-grounded answer to "what does KD actually transfer," vocabulary to read the news cycle on this with precision, and a defended engineering position from this term's work.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

What the term gave you

You can now do something most working ML engineers cannot:

  • Diagnose a model per tier, not just aggregate
  • Defend a number with a bootstrap CI, not a point estimate
  • Distinguish what your dataset can support from what your method should do
  • Read a paper or a press release and ask "what did they actually measure?"

You'll forget specific (T_d, α) values. You won't forget the discipline.

Distillation — Capacity, Calibration, and the Data Ceiling
ECBS5200 Week 6

Thank you

Lab starts in 5 minutes.

Questions? Anything from the term that didn't land?

If you have time after lab — readings/week6/ has 12 papers. Hinton 2015 is short and beautiful. Stanton 2021 is the optimization-difficulty story. Busbridge 2025 is the LLM-scale measurement of everything we did at small scale.

Distillation — Capacity, Calibration, and the Data Ceiling

Welcome back. Last class of the term. You've now spent five weeks fine-tuning, comparing, diagnosing, and compressing. Today we close the term with the third compression — knowledge compression, distillation. You take a 32B-parameter teacher, and you train a 149M-parameter student to inherit something useful from it. The interesting word in that sentence is "something useful." That's what we're going to spend the next 85 minutes unpacking. By the end of class you will have specific opinions about which property a student inherits, which property it doesn't, and what cheaper alternatives exist for the property you actually want. The lab measures it. The homework defends a deployment recommendation against fixed constraints. And the news cycle, fortunately, is going to give us a free pedagogical hook on the way through.

One week ago you sat in this room and we did three things. We looked at what quantization actually does numerically. We looked at the production stack the field uses in 2026 — AWQ on vLLM and FP8 on H100s, not bitsandbytes on T4. And then you measured six configurations on your own hardware. The take-home was that quantization isn't one technique with one set of trade-offs; it's a family of tools, each best on a specific constraint, on specific hardware. The lesson today rhymes with that one. Distillation is also not one technique — and we'll see that explicitly when we look at what's been in the news.

Here is the sentence I want you to leave with today. Distillation transfers specific properties at specific costs. Cheaper alternatives often transfer the same property — but some don't have a substitute. Read that twice. Most of the time when somebody says "we distilled GPT-4," what they mean is "we trained on GPT-4's outputs." That works for some properties. It does not work for others. By the end of class you'll know precisely which is which, and you'll know what test you'd run to find out. This week's lab and homework both turn on this distinction. So does next week's news cycle, which is the convenient fact we'll exploit in Act 2.

Rhythm of the day. Lecture first — we'll build the math, look at what distillation is actually doing under the hood, then take a 5-minute detour through the news cycle, then close with what you'll measure. Lab is 80 minutes, predict-then-observe rhythm, three silent bugs in a colleague's KD loss for you to find. Homework picks up after class — and this week it's a design exercise, not an analysis exercise. You pick one of three deployment scenarios, build a candidate recipe shortlist, test two specific claims from the literature against your own numbers, defend your choice with measurements. The memo is five sections, weighted toward the deployment defense — and Prompt 5 is the capstone synthesis of the term, integrating every measurement axis from Weeks 1 through 6 into one defended deployment recommendation.

Quick orientation on the proper nouns. Distillation has flavors. Hinton-style is what we'll spend most of today on — student matches teacher's full softmax. Hard-label, output-only, Alpaca-style — those are all the same thing under different names: train the student on the teacher's sampled outputs, no soft target. We'll see in Act 2 why the distinction matters and why the news cycle calls all of these things "distillation" without specifying. The recipe today is concrete. The teacher is Qwen3-32B, fine-tuned with LoRA, frozen base. The student is ModernBERT-base, the same one you've been training since Week 1. Distillation temperature T_d = 4, α = 0.7 — those are the hyperparameters we'll defend or sweep in homework. Mechanism diagnostics — ECE, NLL, JS divergence — these are how you'll diagnose what got transferred and what didn't.

This is the term in one diagram. You've been chasing macro F1 on a 113-class long-tail problem since Week 1. You've tried scaling — bigger encoder, then a decoder, then a bigger decoder. You've tried fancy tricks — class weighting, kitchen sink recipes, two-phase training. Each time you wanted the thing to work, and each time we ran a careful enough measurement to see what it actually bought you. Wanting #1 was scale. The paired bootstrap on Qwen3-32B against ModernBERT-large with training data held constant produced a confidence interval that includes zero. No statistically significant scale advantage at this dataset's tail length. Wanting #2 was fancy tricks. Class-weighted training in particular. We measured it and found that it improved calibration but didn't lift tail F1 — a useful effect on the wrong axis. Wanting #3 is distillation. We arrive there honestly today: we've eliminated the easy answers, and now we're going to measure the harder one.

Section 1. Same playbook as Week 5. Before we run anything, look at what's happening mechanically. The KD loss has two parts, a temperature parameter, and a specific math direction that turns out to matter for reading the literature. We'll spend about 25 minutes here, then take 5 minutes on what's been in the news, then move to the term-arc framing.

Distillation is a relationship between two models. The teacher knows things; the student wants to learn them. The teacher is too expensive to deploy at scale; the student is cheap. The simplest thing you can do is feed the same inputs to the teacher and the student, take the teacher's predictions as labels, and train the student via standard cross entropy on those labels. That works. It's also the recipe most companies actually use because it requires only an API to the teacher, not the teacher's logits. We'll come back to it in Act 2. But Hinton, in 2015, observed that this throws away most of what the teacher knows. The teacher's argmax is just one number. The teacher's full softmax over the vocabulary, or in our case over the 113 classes, is a much richer signal. The relative probabilities — the fact that the teacher gives 12% to class 53 and 7% to class 89, even when class 47 is the answer — those numbers contain information about what classes are similar to each other, and that information is missing from a one-hot label.

Look at the diagram on the right. Left panel: the hard label. Class 47 is the truth, so the target distribution puts probability 1 on class 47 and 0 everywhere else. Right panel: the soft target, the teacher's softmax. Class 47 still has the highest mass, but it's only 55%. The remaining 45% is spread across the other 112 classes in a specific shape — 12% on class 12, 18% on class 53, and so on. That spread is what Hinton called the dark knowledge. It's not noise. It encodes the teacher's beliefs about which classes are similar to which. A student trained against the soft target sees that structure on every example. A student trained against the hard label sees a 1 and zeros. The mathematical detail to know — and which causes confusion when reading papers — is that softmax cross-entropy produces gradients on every logit, both with the hard target and with the soft target. The difference is the *target shape*, not gradient support. Don't let anyone sell you the "CE has gradient only on the true class" version. It's wrong. Both losses pull every logit; they pull toward different things.

Here's the loss. Three things to note. First, the direction. KL divergence is asymmetric — KL(P || Q) is not equal to KL(Q || P). We want the student's distribution to match the teacher's, so the teacher is the reference and the student is the approximation — KL(teacher || student). What this objective does in plain terms: the teacher distribution is the target, and the loss penalizes the student for failing to put probability mass where the teacher puts mass. So it trains the student toward the teacher's full probability shape, not just the argmax. PyTorch's `F.kl_div` gives you this direction when you pass log_softmax of the student as the input and softmax of the teacher as the target. The argument order in PyTorch is the most common stumbling block here, and the homework's bug hunt features it. Second, the temperature. Dividing the logits by T_d before taking the softmax makes both distributions softer — closer to uniform when T is large, sharper when T is small. Soft distributions carry more information about non-top-1 classes than sharp ones. T_d = 4 for us today. Third, the T² multiplier on the KL term. When you soften logits by T, the gradient of the softmax shrinks by 1/T². Without the multiplier, the KD signal would arrive at the parameters with much smaller magnitude than the CE signal, and the α weighting would be misleading. Hinton put the T² in to neutralize this.

The temperature parameter has a clear interpretation. T_d = 1 and the softmax is unmodified — for a confident teacher, almost all the mass sits on the top class. T_d = 4 and the distribution opens up. The top class is still highest, but the 2nd and 3rd and 4th classes get visible mass too, and the student now has gradient signal pulling it toward those relative magnitudes. T_d = 8 and you're flattening even further. T_d → ∞ and the soft target becomes uniform — every class equally likely, no information. So there's a sweet spot. Too sharp and you're effectively training on hard labels; too soft and you're training on noise. Today's recipe is T_d = 4, α = 0.7. In your homework you'll sweep 3 temperatures against 2 alphas — 6 configurations total — and you'll see whether 4 was a defensible default, or whether something else dominates for your scenario.

The teacher. Qwen3-32B — same family as the Qwen 0.5B/1.5B/3B you compared against the encoder in Week 3, much bigger member. The cross-regime thread of the term lands here. LoRA rank 16, 3.2M trainable parameters. The other 32B stay frozen as they came out of pretraining. So when I say "32B teacher" — 32B of frozen pretraining knowledge plus a small adapter trained on this task. About the data. We trained the teacher on train+test combined. Deliberate experimental frame, not test-set leakage. Val remains the held-out comparison set; the original test split got repurposed as extra training data so we'd have the strongest teacher possible. In production with a true held-out test set, you do not do this. Teacher's val numbers — macro F1 0.322, ECE 0.021. Treat that as the ceiling we expect KD to approach. A small student can sometimes overshoot a teacher (regularization, CE anchoring), but don't bank on it.

Engineering pattern worth knowing. A 32B-parameter model in bf16 needs about 64 GB of VRAM. Your T4 has 16. So we cannot load the teacher into memory on the hardware you have. The trick — and it's a trick the production world uses constantly — is to run inference on the teacher once, save the logits to disk, then read them per-batch during student training. 18.6 MB for all 79,000 examples in fp16. Fits in any cache. The KD loss matches the student's logits on a batch against the corresponding teacher logits from the file. The student literally never has to know that the teacher exists as a model — it sees only an array of soft targets. This pattern lets you distill from arbitrarily large teachers as long as you have one machine that can run inference once.

This is the experimental design. ModernBERT-base, fresh classifier head, the same 149M-parameter model you've been training since Week 1. Two arms — vanilla and distilled. They share everything: the data, the batch size, the learning rate, the number of epochs, the random seed. The only thing that differs is the loss function. Vanilla minimizes cross-entropy against the hard labels. Distilled minimizes the KD loss we just wrote down. Same forward pass, same backward pass, same optimizer steps, same hyperparameters everywhere. So when you compare these two students you are isolating the contribution of the loss function and nothing else. That's a careful experiment. It's also expensive — about 80 minutes of T4 time per arm — so we ran both for you. You'll load val predictions and analyze.

The lab is structured around 4 predict-then-observe moments. Each one has a specific question, a place where you write your prediction, a cell that produces the actual number, and a reveal that you don't open until you've reconciled. This rhythm is the entire pedagogical point of the lab. Anyone can run a notebook end to end and look at numbers. The skill we're training is this — commit to a prediction with reasoning, then reconcile when the data disagrees. The memo grades you on the quality of the reconciliation, not whether you guessed right. People who guess right and write nothing get less credit than people who guess wrong and write down what they learned. So write carefully. Today's first prediction — tail F1 of the teacher — is the foundation of every other claim in the week, so it's worth the extra 30 seconds.

One foreshadow before we move on. By the time the lab ends today, you will have measured two distinct properties of what KD transferred. One I'm calling capacity — the simple argmax property, gets the right answer or doesn't, expressed as F1 or accuracy. Quick terminology note — when I say capacity in this course I mean argmax decision quality, not the standard machine learning usage of capacity meaning representational power or parameter count. We use this loose meaning consistently across the lab, the homework, and the rubric, so when you read "capacity transfer is data-bounded" in the rubric, it means F1 transfer is data-bounded. The other axis I'm calling calibration — the shape of the probability distribution, expressed as ECE for top-one calibration or NLL for the full distribution. The thesis of the week is that these two properties decouple under distillation in a way that matters for engineering. I'm not going to tell you which way; the lab will. Two axes. Two measurements. They behave the same way, or they don't. Find out yourself. The data lands harder if you measure first.

5-minute detour. We're not going to stay in the news cycle long, but we are going to land one specific technical distinction that's load-bearing for everything else in the term. The cultural moment around distillation in 2025 and 2026 is genuinely confused — sometimes deliberately, sometimes not — about what the word means. By the end of these 5 slides you'll have language to read these stories with precision.

The cultural moment around "distillation" in 2025-26 is genuinely confused — sometimes deliberately — about what the word means. Three documented allegations since 2024, none litigated. DeepSeek vs OpenAI, January 2025: Microsoft alleged proxy-network API exfiltration; OpenAI publicly called it "inappropriate distillation." Anthropic, February 2026: 16M exchanges across 24,000 fraudulent accounts, three Chinese labs named. OpenAI memo to the House Select Committee on the CCP, same month. One framing note before we dig in. The evidence we can evaluate is company statements, press reports, policy memos — not peer review, not a court filing, not a reproducible measurement. When I say "documented allegations," I mean documented in press and policy. Not adjudicated. Three questions to take into any "distillation" headline. Logits or only completions? Closed APIs almost never expose logits — so "they distilled" can't mean Hinton's recipe. Behavioral or operational evidence? Behavioral is circumstantial: model self-identifies as ChatGPT. Operational is logged proxy clusters — much stronger, but lives inside the accusing company. Filed in court? In this whole timeline, nothing has been. ByteDance 2023 came closest — leaked internal docs, OpenAI suspended their account, no lawsuit. That's the cultural shape of the moment. Now the technical question.

This is the taxonomy that almost no news article makes explicit. Hinton-style KD — the thing we did today — needs the teacher's full softmax over the output space. Closed APIs do not expose full logits. So when news says DeepSeek distilled GPT-4, they cannot mean Hinton-style KD; the necessary information isn't on the wire. What they actually mean is one of three other things. Hard-label distillation — train the student on the teacher's sampled output as if it were a ground truth label. Synthetic-data SFT — use the teacher to generate or relabel a training corpus, then train the student on that corpus. CoT trace harvesting — for reasoning models, query for the reasoning trace, train the student to imitate it. All three of these are mathematically standard supervised learning where the labels happen to come from a teacher model. There's no KL divergence, no temperature, no soft targets. The distillation framing in the news collapses these together with Hinton-style KD and the result is a vocabulary that is unhelpfully imprecise.

Here's the empirical evidence that grounds the taxonomy. Busbridge et al., ICML 2025 — "Distillation Scaling Laws," §E.8. They do a controlled experiment: same student architecture, same training data, same teacher model. The only thing they vary is whether the student sees the teacher's full softmax distribution as the target, or only the teacher's top-1 sampled token. With the full distribution, student ECE comes out at 0.1-0.6%. With top-1 only, student ECE is 22-39%. That is a 50-100× difference in calibration. Same architecture, same data, same compute. The only thing that varies is whether the recipe preserves the teacher's distributional shape or throws it away. So when news says "they distilled GPT-4" — even if literally true — the resulting student is fundamentally different from the one Hinton's recipe produces, and the difference is measurable as direct as ECE. This is also what your homework Part 3 Test A is going to measure on our setup, at small scale, using JS divergence per tier.

Connecting back to today. The gap between the news cycle's vocabulary and Hinton's is real, and our homework will let you measure it on this dataset. The lab does Hinton-style KD. You see the teacher's full softmax during training. The homework's Part 3 Test A asks "how distributionally close is the student to the teacher, per tier?" You'll compute Jensen-Shannon divergence between teacher and student softmax, on head, mid, and tail, and you'll see whether distillation closes the gap or leaves a residual. The same property Busbridge measures at LLM scale, just done on our setup. If we had instead trained the student on the teacher's argmax outputs only — the news cycle recipe — JS divergence would barely move; that's the prediction. Whether it holds is your measurement. The point is that "did they really distill?" reduces to an empirical question with a number on it. Stories that assert distillation happened but produce no measurement of distributional fidelity are pre-engineering rhetoric. You can do better.

Concrete example for the row of the taxonomy that matters most. Cactus Compute released this earlier this month — twenty-six million parameter function-calling model, MIT licensed, weights and dataset-generation code public. You can clone it this afternoon. The recipe is what matters. Cactus calls Gemini 3.1 Flash Lite via the public API and asks it to invent (query, tools, answer) triples. The student trains on those triples with standard cross-entropy. No logits cross the wire. No temperature, no KL. Sampled outputs treated as ground truth. They call this distillation — and by the everyday meaning, it is. The student learned from the teacher. But it's the synthetic-data SFT row of the taxonomy you just saw, not the Hinton-KD row. Two billion tokens of synthesized data, forty-five minutes of post-training, on top of a two-hundred-billion-token pretrain. Architecture is interesting — no FFN anywhere, INT4 QAT, Muon optimizer — but a sidebar today. The point is the recipe, and the recipe is the one the closed-API constraint forces.

Side by side. On the left, what you'll do today. On the right, what Cactus did with Needle. The key difference is the constraint. Your teacher sits in a file on the Hub — the full softmax over a hundred and thirteen classes is yours for the asking. Cactus's teacher sits behind an API — the only thing the API exposes is the sampled output. They literally cannot run Hinton's recipe. Connect this back to Busbridge two slides ago. He measured exactly this difference — same student, same data, vary only the target shape. Full distribution gives ECE in the zero-point-one to zero-point-six percent range. Top-one only gives twenty-two to thirty-nine percent. Fifty to a hundred times worse calibration on the same student. So look at Needle through that lens. The framework predicts that even if its accuracy looks competitive with Gemini, its calibration will be measurably worse. That's a falsifiable claim about a real, current model — and your homework Part 3 Test A is the small-scale version of measuring it. Footnote — the Cactus README claims Needle beats four named small models. No public benchmark supports the claim. The framework you have today predicts what a calibration measurement would find.

Section 3. We close the term arc here. Three compressions you've now seen — label, weight, knowledge. They all ran into the same wall. We name it explicitly, we measure it, and then we set up what to do about it.

Walk left to right. Label compression. Week 4 and Week 5 pipeline reveal. The dataset shipped with 153 raw label values. We collapsed near-duplicates with a merge map down to 120, then dropped 17 classes that had fewer than 5 examples per class — too rare to learn from. Down to 113 canonical classes. That's a compression. It preserves something — a tractable supervised learning problem, a label space the student can actually learn — and it loses something — those 17 classes are gone forever. Weight compression. Last week. Take fp16 weights, map to int8 or int4, save memory, sometimes save latency. It preserves equal accuracy on most tiers. It loses some calibration on the tail at int4. Knowledge compression — today. The big question mark, because we haven't measured it yet. Same shape as the other two. Something is preserved. Something else isn't. The lab tells you which is which.

Wanting #1 was scale. We tested it carefully. We trained Qwen3-32B with LoRA on this exact dataset. We compared it against ModernBERT-large, our biggest encoder, with training data held constant — both on train+test. We ran a paired bootstrap on the validation set, 6,430 examples, 1,000 resamples. The median delta in macro F1 was +0.014 — Qwen barely better. The 95% confidence interval was [-0.008, +0.045]. The interval includes zero. No statistically significant scale advantage. So when somebody on Twitter tells you bigger models always win on long-tail data, you can answer that on a 113-class problem with our tail length, holding training data constant, the gap was zero with a measurable confidence interval. Scale wasn't doing the work the field was crediting it with.

Wanting #2 was fancy tricks. The specific trick was class-weighted classifier retraining — cRT — where you weight the loss higher for rare classes during a fine-tuning stage. The hope was that this would give tail classes more gradient signal and lift tail F1. We measured. Tail F1 moved by about +0.003 — within the bootstrap noise, indistinguishable from zero. So no capacity transfer. ECE moved by about -0.024 — a real improvement in calibration, well outside noise. So cRT did do something — it just did it on the wrong axis. And then the punchline. We compared cRT to post-hoc temperature scaling on the same model, and temperature scaling reproduced essentially all of the ECE improvement, for free, at zero training cost. So the engineering takeaway: cRT improved the wrong axis at the wrong cost. The cheaper alternative — fit one parameter post hoc — got us the same calibration improvement.

Wanting #3 is what you came in for today. Two interpretations to hold in your head. The hopeful frame — maybe distillation transfers something that escapes the data ceiling. Maybe the teacher's distributional knowledge — what classes look similar, where confusion lives — is a property that can be inherited even when the student doesn't have enough data to learn it directly. That's the optimistic story. The skeptical frame — and this is also worth holding — is that the teacher itself hits the data ceiling on the tail. We know its tail F1 is around 0.20. The teacher saw the same per-class data scarcity that the student sees. What could it possibly transfer that it doesn't have? Both stories are coherent. Only one is correct on this dataset, and the lab will tell you which.

Look at the teacher's per-tier F1. Head — top 20 most frequent classes — F1 0.652. Mid — next 40 classes — 0.450. Tail — the rarest 53 classes — 0.198. Our biggest, most expensive, most carefully calibrated teacher, trained on more data than any other model in this course, gets barely 20% macro F1 on the tail. That's well below the head and mid tiers, and well below what you'd consider shipping. Why? Same reason every other model on this dataset has hit the same wall — there are too few training examples per tail class for any model, at any scale, to learn the boundaries between rare classes reliably. This is the data ceiling, made concrete. It is not a property of the model. It is a property of the data. So when you measure the distilled student's tail F1 in the lab today, hold this in mind — KD should not be expected to systematically transfer tail capacity beyond what the teacher itself has. A small student can sometimes outperform its teacher on a metric — regularization, less overfitting, lucky variance — but a large tail F1 lift from KD on a teacher this weak on tail would be a surprising result that would need careful validation. The lab measures whatever's there.

While we're talking about what does and doesn't lift the number — here's a finding from the homework's data confound study that's worth flagging now. The Week 1 baseline trained on the train split only — 57,846 examples. Got macro F1 0.209. The Week 6 vanilla — same model, same architecture, same recipe, but trained on train+test combined — 79,278 examples — got 0.264. +0.055 from data composition. No fancy techniques, no scale change, no calibration recipe. Just more representative training data. That number, +0.055, is bigger than what any of our scaling experiments produced and bigger than what cRT produced. We'll come back to it in a couple of slides because it changes how you read the lab's result. The thing that has consistently moved the F1 number on this dataset is data, not method.

The verdict so far. Two of the three wantings are closed. Scale was cracked weeks ago. cRT was cracked too. Both came up against the same wall — the per-class data scarcity at the tail. The third wanting is the one we're testing today. Mechanically distillation is a totally different recipe from scale and from class weighting, so a priori the wall could behave differently. Empirically the lab is going to give you specific numbers — per-tier F1 with paired bootstrap CIs, and per-tier ECE — and you will see whether the wall is something distillation also runs into, or whether it gets around it. Hold the question. Don't peek at the lab numbers in advance. The pedagogical point of today is the measurement, not the conclusion.

This is the chart that reframes the entire week. Three bars. Week 1 baseline at 0.209. Week 6 vanilla — same model, larger training set — at 0.264. Week 6 distilled — same training set, KD on top — at 0.279. The two lifts. +0.055 from seeing more data. +0.015 from also distilling from a 32B teacher. 3.6×. So if the question on the table were "what should we do to lift macro F1?" the empirically defensible answer based on this term's measurements would be — get more data, you idiot. But the question on the table isn't usually that, because in industry "get more data" is often impossible. Annotation budgets, rare events, regulatory constraints, ethical limits — there are real reasons you can't always just collect more labels. So the actual question for your memo is: given a fixed data budget — which is what your scenario will impose — what's the cheapest recipe that transfers the property your deployment depends on? KD is one answer. Temperature scaling is another. Doing nothing is a third. We're now ready to ask which.

Section 4. Quick walkthrough of what the lab actually does, then on to deployment framing. We've spent enough time on theory and term-arc. Time to get specific about today's 80 minutes.

Lab structure. Three acts, 80 minutes total. Act 1 — meet the teacher, look at how much per-class data each tier had to learn from, predict the teacher's tail F1, then look at top-3 teacher probabilities on examples from each tier so you see the dark knowledge concretely. Act 2 is the load-bearing one. You implement the KD loss from scratch. Then you hunt 3 silent bugs in a colleague's KD loss — these are realistic, the kind that survive a "loss is decreasing" check and produce wrong students. Then you compare distilled and vanilla, per tier, F1 and ECE both. Act 3 brings it operational. You compute threshold coverage tables, set up a deployment scenario, and write a deployment decision artifact in class. 4 predict-then-observe moments distributed across the three acts. The bootstrap cell takes about 15 seconds — far less than I had originally guessed, since it's a CPU operation.

The bug hunt is the centerpiece of Act 2. Three bugs. None of them crash. All of them silently change what the student learns. Loss still decreases when you train this — that's the trap. So how do you find them? You compare the colleague's function against your own correct implementation, line by line, on the same random batch. The values will disagree, and the disagreement tells you where the bug is. Read carefully. 2 of the 3 bugs are inside the KL term. 1 is in how the two terms are combined. I'll let you find them. The reveal in the lab walks through exactly what each bug does and what it produces in a trained student. The pedagogical point — and this is real industry muscle — is that "loss is decreasing" is not a test. You need value-comparison against a known-good reference. This is exactly what professional ML engineers do when their model is broken.

Quick statistics reminder, because the bootstrap intervals you'll compute today are load-bearing for the memo. A paired bootstrap with 1,000 resamples gives you a confidence interval on each per-tier delta. The interval excludes zero — the effect is real at this sample size, you can claim it as a finding. The interval includes zero — you cannot distinguish the effect from zero. This is not the same as "the effect is zero." It is a specific statement about your data: with this many examples, our resolution is not enough to call the effect. The honest framing in the memo is "the lift is +0.0X with CI [Y, Z], inside the noise floor on this sample size." The dishonest framing — and you will see it in industry constantly — is to look at a small point estimate and assert it's a real effect. Your tail tier has only 210 examples in val. The CI on tail F1 is going to be quite wide. Be honest about that.

Practical note. Each of these students takes about 80 minutes to train end-to-end on a T4. Two students, 160 minutes of training, plus tokenization plus evaluation, doesn't fit in your 80-minute lab. So we ran both arms for you, ahead of time, and pushed the val predictions to public Hub repos. You download two npz files, total under 10 MB, and you have everything — raw logits, argmax predictions, ground truth labels, tier assignments — to do all the analysis the lab needs. You're paying for analysis time, not training time. The trade-off is you don't experience the pain of training a student that fails because of a bug. The bug hunt is the pedagogical compensation for that. You debug a colleague's broken function instead of debugging your own.

The Act 3 artifact, at the end of lab. Pick one of three deployment scenarios. Fill in fields with specific values from tables you computed. Defend in writing. There's no single right scenario-to-recipe pairing. There IS a wrong shape of answer — anything that doesn't reference your numbers, anything that picks a recipe without naming the property of the deployment that recipe serves, anything that asserts an opinion without a constraint. The form matters. Threshold names a number. Coverage names a number. Decision rule names the recipe and the threshold together. Constraint that would flip names a specific switch point. This is the shape every deployment defense needs. We're training the form here so the homework memo inherits it. The memo grading rewards this form, not your specific recommendation.

Last act. Calibration was a Week 4 topic. We measured ECE under quantization in Week 5. Today it returns as the second axis of the dichotomy and the load-bearing axis for the deployment defense. We'll spend about 15 minutes here, then close.

ECE vs NLL. We've been computing both since Week 4. They are not the same thing. ECE — expected calibration error — bins predictions by their max probability and asks whether the accuracy in each bin matches the average confidence. So ECE is an aggregate over the top-1 prediction's confidence. It is exactly the right metric if your deployment uses confidence thresholds — auto-route at 0.85, escalate at 0.7, that kind of thing. NLL — negative log likelihood — is the cross-entropy of the model's full softmax against the true labels. It penalizes the model for putting probability mass on wrong classes, not just for being miscalibrated on the top class. So NLL is the right metric if your deployment consumes the full distribution — ensembling, Bayesian downstream steps, ranking the top 3 predictions for human review. Two metrics, two questions, two deployments. The homework's Part 3 Test B will measure both of these and you'll see they sometimes point in different directions.

Skeptic's question. KD is a real recipe. It costs 80 minutes of T4 time per student. Post-hoc temperature scaling fits 1 parameter on a calibration set, takes 30 seconds, costs nothing in inference. If KD's main effect is calibration — and Han Guo and colleagues argued exactly this in 2021 — why pay for KD when temperature scaling closes the same gap? The honest answer is: it depends on which calibration metric you care about. Top-1 calibration — ECE — is exactly what temperature scaling is engineered to fix. 1 parameter rescales softmax sharpness. NLL — full distribution — is what temperature scaling cannot fix uniformly because uniform rescaling can't change relative probabilities of non-top-1 classes; KD can. Your homework's Part 3 Test B sets up the three-way comparison: vanilla raw, vanilla + T, distilled. ECE and NLL won't necessarily move in the same direction. Which recipe wins which metric is the measurement you'll do. The right answer to the skeptic isn't "always pick KD" or "always pick temperature scaling" — it's "name the property your deployment consumes, then measure which recipe transfers it."

Operational lens. Threshold and coverage. Most production systems that consume model outputs do so behind a confidence gate. If the model says it's 85% sure, take its answer; if less, route the example to a human reviewer. This is everywhere — content moderation, customer support routing, medical pre-screening, document classification. The policy depends entirely on whether the model's confidence is meaningful. Lower the threshold, you keep more volume but have to tolerate more confident-wrong errors. Raise the threshold, accuracy on the auto-routed subset goes up, but you escalate more volume to human reviewers. The whole trade-off only works if confidence is calibrated. A poorly calibrated model has examples that look confident and are wrong. Threshold filtering on those doesn't help — the wrong-and-confident examples slip through and damage the system. Calibration is what makes threshold-based deployment policies work.

Three scenarios in your homework. Pick one, commit. The point of having three is that they have structurally different metric priorities, and the "right" recipe depends on which one you picked. Scenario A — high-throughput batch triage. Latency budget is tight. Macro F1 matters. Calibrated probabilities matter only loosely. Vanilla + a post-hoc temperature fit might dominate here because it's the cheapest recipe that meets the calibration bar at all and it has the lowest inference cost. Scenario B — regulated escalation review. A human is going to see the model's top-1 confidence and decide whether to escalate. Calibrated probabilities are the deliverable. ECE and NLL both matter. KD or KD + a post-hoc T might dominate because the full distribution shape is what the human reviewer's policy depends on. Scenario C — long-tail rare-class monitoring. Tail F1 is what matters. Spoiler — no recipe will likely win this scenario. The data ceiling holds. The homework tests whether you can defend "this is a data problem, not a recipe problem" — which is itself a defensible engineering position. Pick one. Different scenarios. Different defenses. Different winners.

The wildcard slot. Industry skill, not a measurement. In your recipe shortlist, 4 entries are configs we precomputed for you and you measure. The 5th — the wildcard — is something you propose but don't run. The point is to test whether you can predict from the mechanism what a recipe would do. Examples — T_d = 16, much heavier softening than our extreme. Or α = 0.5, lower KD weight, more hard-label signal. Or distilled + post-hoc temperature scaling on top — which I'd actually predict dominates the distilled-only recipe on ECE without changing macro F1. Or vanilla + label smoothing, which is interesting because label smoothing softens the hard target — making it kind of a poor man's KD. You pick one. You predict where it lands. You justify the prediction from the mechanism, not the data. This is what senior engineers do constantly — predict before measuring. It's also what saves compute. If your prediction says the recipe would be dominated, you don't need to spend 40 minutes training it. Your wildcard exercises this muscle.

The capstone synthesis. Prompt 5 of your homework memo is the closing exercise of the term. It asks you to recommend a single recipe for a real deployment, defend it across all the axes we've measured this term — accuracy, calibration, latency, memory, tier-level robustness — and acknowledge the cross-week interactions you haven't measured but should. Some of those open questions are listed here. Quantized distilled student — does Week 5's int4 calibration drift survive when applied to a distilled student? We don't know. The recipe combinations multiply faster than the compute budget. Vanilla + temperature scaling + int8 — that's the cheapest possible recipe, and it might actually be the right one for a high-throughput deployment. We didn't measure that combination. Distilled with the 17 dropped classes restored — would the data ceiling shift? Don't know. Prompt 5 acknowledges what you didn't measure, defends what you did, and recommends what you'd actually ship. That's the closing exercise of the term.

Last few slides. Logistics for the lab and the homework, then we close.

Logistics. The lab is in your repo, `week6_lab.ipynb`, predict-then-observe rhythm, 4 prediction cycles, KD loss implementation, bug hunt, per-tier metrics, paired bootstrap, threshold table, deployment artifact at the end. 80 minutes, you'll be done by 5:30. Homework is `week6_homework.ipynb`. ~5 hours including memo. 4 parts — diagnose the teacher, build a recipe shortlist for your scenario, run 2 literature tests, write the memo. Note on deadlines — this is the last class meeting, so the usual "Wednesday morning before next class" rule doesn't apply. The Week 6 memo is due before the final exam. Check Moodle for the exact date. Last note — you don't need a GPU for the homework. Analysis only. Everything is precomputed and you load val predictions. Run on CPU on Kaggle, on Colab, on your laptop, doesn't matter.

Today in one sentence. Distillation transfers what's in the teacher's full distribution — the soft target, the dark knowledge, the relative probabilities across non-true classes. That's its irreplaceable property. Top-1 calibration — what most threshold-based deployments actually consume — is reproducible by post-hoc temperature scaling at zero training cost. Whether that reproduction is partial, complete, or whether it overshoots is the measurement your homework Part 3 will produce. So the right answer to "should I distill" is always: name the property your deployment depends on, then measure which recipe transfers it on your setup. If it's top-1 calibration, fit a temperature first and see if you need more. If it's full-distribution shape, distill. If it's something neither one fixes — like tail capacity at this sample size — your problem is data, not recipe. By the end of today you have measurement-grounded vocabulary for all three. By the end of homework, you have a defended position. That's the term.

The term in retrospect. Six weeks ago you trained your first ModernBERT model and looked at macro F1. Today you can do 4 things most working ML engineers cannot. 1 — you can diagnose a model per tier. Most engineers in industry look at aggregate macro F1 and stop there. You know that aggregate is a weighted average that hides everything important about the long tail, and you have the tools to break it down. 2 — you can defend a number with a confidence interval. You will see in industry papers, blog posts, leaderboards, claims that one model beats another by 0.005 F1, with no measurement of variance. You will know to ask "what's the bootstrap CI on that delta?" because you've computed it. 3 — you can tell what your dataset can support from what your method should do. Most engineers conflate these. They blame their loss function for what their data was always going to do. You won't. 4 — you can read papers and press releases and ask the second question. What did they actually measure. What's underspecified. What recipe is hiding in the words. That's the discipline. The specific T_d and α you'll forget. The discipline you'll keep.

That's lecture. 5-minute break, then lab. The papers in `readings/week6/` are worth the read at some point — they're the literature anchors for everything we just did. Hinton 2015, the original distillation paper, is short, accessible, and it tells you exactly what dark knowledge means in the original context. Stanton 2021, "Does Knowledge Distillation Really Work?", is the optimization-difficulty story — even with a perfect teacher, the student doesn't fully recover the distribution, and they argue why. Busbridge 2025 is the LLM-scale measurement of the full-distribution vs top-1 distinction — the 50-100× ECE difference that I cited in Act 2. Three papers, three different parts of the picture. If you only read one, read Hinton. If you read two, add Busbridge. If you have time for the third, Stanton. Anything from the term that didn't land for you, ask now or grab me after lab. Otherwise — see you in the lab.