ECBS5200 Week 4

Error Diagnosis

ECBS5200 — Week 4

You made a recommendation. How do you know you were right?

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Where we left off

Model Accuracy Macro F1
Encoder LoRA (ModernBERT 149M) 56.6% 0.209
Decoder LoRA (Qwen 0.5B, 494M) 57.0% 0.240

Decoder wins by 0.031 macro F1. ~3× slower per example.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

The black box you've been calling

from utils.data_utils import load_course_data
train_ds, val_ds, test_ds = load_course_data()

Four weeks of using this. What does it do to the raw CFPB data?

Before you diagnose errors today, see the pipeline that made your dataset.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Raw: 153 Issue labels, many redundant

Label Raw count
Incorrect information on credit report 7,607
Incorrect information on your report 7,208

CFPB reworded the complaint form in April 2017. Historical complaints kept old labels.

data/label_merge_mapping.json catches 36 pairs. 153 → 120.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

The mapping is imperfect in both directions

No-op we left in:
"Can't repay my loan" → "Can't repay my loan"

Duplicates we missed — 5 "Advertising" classes survive to the canonical 113:

Class Train
Advertising 7
Advertising and marketing 142
Advertising and marketing, including promotional offers 98
Advertising, marketing or disclosures 7
Confusing or misleading advertising or marketing 18

Over-merges we made:
"Struggling to pay your bill" → "Struggling to pay your loan"

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

The filter and the tail

MIN_CLASS_COUNT=5: 7 classes dropped (23 examples lost).

Stratified split, seed=42: 57,846 / 6,430 / 21,432.

17 val classes have 1 example each. One wrong prediction moves that class's F1 from 1.0 to 0.0.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

You inherited this. Now diagnose.

Upstream schema choices, filters, and splits shape what your model CAN learn.

You did not build this pipeline.
In industry you will not build most of them.

Knowing what was decided for you is part of the job.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

The question

"I made a recommendation. How do I know I was right?"

Two models. One bag of diagnostic tools. Which one is really better — and how would you know?

D'Amour et al. 2022 call this underspecification: an ML pipeline can return predictors with equivalent held-out metrics that behave very differently in deployment.

📄 readings/week4/damour2022_underspecification.pdf

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

From problem to approach

Underspecification is a problem statement. What's the approach?

D'Amour's answer: stress tests. Don't aggregate more metrics — design evaluations targeted at where predictors differ.

Your five diagnostic tools this week are all lightweight stress tests:

  • Slice analysis — stress on input subsets
  • Calibration — stress on model's confidence claims
  • Confusion patterns — stress on WHAT gets confused with WHAT
  • Noise floor — stress against resampling variability
  • Bug hunt — stress an allegedly-broken model

Each one isolates a failure mode the aggregate number can't see.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Three ways aggregate metrics lie

  1. Averages hide subgroups. A model that's 80% accurate might be 95% on head classes and 40% on the tail.

  2. Accuracy doesn't measure confidence. A model can be right 80% of the time while claiming 99%.

  3. Small gaps may not be real. Does 0.03 macro F1 survive resampling?

Aggregate numbers answer "how well on average." They don't answer "well enough for what I'm doing."

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Today's plan

Block 1 — Lecture (after the quiz)

  • Aggregate metrics and their limits
  • Slice analysis: the where-does-it-fail tool
  • Calibration: the can-you-trust-it tool
  • Confusion matrices at scale: patterns, not cells
  • Noise floor: how much of a gap is real?
  • A broken model you'll diagnose in the lab

Block 2 — Lab

  • Load both models, six diagnostic sections
  • Predict, measure, explain. End with the bug hunt.
Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Section 1 — Slice Analysis

The where-does-it-fail tool

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

The problem with averages

Your decoder's macro F1 is 0.240.

  • Is it 0.240 on short complaints AND long complaints?
  • On complaints with redacted XXXX markers AND clean ones?
  • On complaints that start with "I" AND everything else?

An average number doesn't answer any of those.

Oakden-Rayner et al. 2020 call this hidden stratification: aggregate accuracy routinely hides >20% performance gaps on unidentified subgroups — with real clinical consequences in their medical-imaging setting.

📄 readings/week4/oakdenrayner2020_hidden_stratification.pdf

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

What is a slice?

A slice is a subset of the val set defined by a property of the INPUT.

Examples:

  • Complaints under 200 characters
  • Complaints with at least one XXXX redaction marker
  • Complaints that start with "I"
  • Complaints with heavy all-caps usage
  • Complaints mentioning specific legal language

For each slice, compute per-slice macro F1 separately. Compare models slice-by-slice, not just on the average.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

The six axes you'll test

In the lab, you'll use two axes. In the homework, all six:

Axis Split
Character length Quartile buckets (q1 short → q4 long)
Redaction XXXX marker present or not
Token length 1–30 / 31–80 / 81–127 / truncated at 128
Numeric content Dollar sign or 4+ digit number
Opener Starts with "I" or not
All-caps usage Above or below median rate

These are one-liners in Python. The signal they produce isn't.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Per-slice F1 example (illustrative)

Slice n Model A Model B Diff
slice X 1,500 0.20 0.26 +0.06
slice Y 1,500 0.22 0.24 +0.02
slice Z 1,500 0.28 0.28 +0.00

Not all slices are the same. Where the gap disappears is often more interesting than where it's biggest.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Null-result axes

Sometimes an axis shows no signal — encoder and decoder behave identically across every slice on that axis.

This is evidence, not absence of evidence:

  • The axis doesn't distinguish the models
  • Whatever feature that axis captures, both models handle it the same way
  • You learn what ISN'T driving the gap

Report null-result axes. Don't quietly drop them.

Important caveat: these are exploratory diagnostics, not confirmatory hypothesis tests. You're characterizing where models differ, not rejecting null hypotheses with multiple-comparison correction. With six axes tested, some apparent "signal" slices might be noise — cross-check with the bootstrap CI.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

When you don't know what axis to try

The six axes you'll test are hand-picked. In practice you'd want to:

  • Test dozens of axes, not six
  • Discover axes you didn't think of
  • Automate the coherence check

Automated slice discovery is an active area:

  • Chung et al. 2020 (TKDE) — SliceFinder: automated slice search over predicate combinations
  • Eyuboglu et al. 2022 (ICLR) — Domino: embedding-based slice discovery for hidden failure modes
  • Yu et al. 2026 (AAAI) — slice coherence without predefined labels

You'll see the hand-picked version in lab. These are reading-list papers if you want to go further.

📄 readings/week4/chung2020_slicefinder.pdf
📄 readings/week4/yu2026_manifold_slicing.pdf

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Section 2 — Calibration

Can you trust what the model says about itself?

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

The calibration question

When the model says "I'm 80% confident in this prediction" —

is it right 80% of the time?

A calibrated model's confidence matches its accuracy.

An overconfident one says 95%, is right 70%. (Common for networks fine-tuned with cross-entropy on imbalanced classification — but the effect is measured, not assumed.)

An underconfident one says 60%, is right 85%. (Rare.)

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

ECE: Expected Calibration Error

Pre-work module 5 covered this. Today you measure it. Guo et al. 2017 popularized ECE as the standard single-number calibration metric.

ECE = weighted average of per-bin gaps. Each bin's gap contributes proportional to its sample count (bottom panel).

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

ECE works, with caveats

ECE is the standard. It's also imperfect.

Chidambaram et al. 2024 showed ECE is discontinuous in predictor space — where you put the 15 bin boundaries can shift ECE without the underlying calibration changing.

They propose Logit-Smoothed ECE (LS-ECE) to fix this.

Empirical finding: on pretrained image classifiers, binned ECE tracks LS-ECE closely anyway. The theoretical pathology rarely bites in practice.

Takeaway: teach ECE. Measure ECE. Don't worship ECE.

📄 readings/week4/chidambaram2024_ece_flawed.pdf

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

The reliability diagram

For each confidence bin:

  • X = mean confidence the model claimed
  • Y = empirical accuracy on those examples

Perfect calibration = the diagonal.

Below the diagonal = overconfident.
Above the diagonal = underconfident.

In the lab you'll plot this for both models.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

An intuition check

You have two models:

  • Encoder — 149M parameters, ~2T pretraining tokens (English web)
  • Decoder — 494M parameters, ~18T pretraining tokens (multilingual)

Both fine-tuned with cross-entropy for 3 epochs on the same data.

Before measuring: do you expect their post-fine-tuning ECE to be similar or meaningfully different? If different, in which direction, and by how much?

Guo 2017 already showed larger models can be WORSE calibrated than smaller ones — don't treat "bigger = better" as the prior. Fine-tuning dynamics can shift calibration in either direction. Write down your prediction; measure it in the lab.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Temperature scaling — a one-parameter fix

Fit a single scalar T on held-out calibration data. Then at inference:

  • T > 1 → softer distribution (fixes overconfidence)
  • T < 1 → sharper distribution (fixes underconfidence)
  • T = 1 → no change

Critical property (for post-hoc scalar T applied to fixed logits at inference): T is scalar → argmax unchanged → macro F1 unchanged.

This does NOT generalize to all calibration methods. Per-class scaling, Platt scaling, or retraining-based approaches can change argmax.

📄 readings/week4/guo2017_calibration.pdf

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Homework temperature-scaling exercise

In the homework you'll:

  1. Split val 50/50 — calibration fold and eval fold
  2. Fit T on calibration fold (minimize NLL on softmax(logits / T))
  3. Apply T to eval fold, compute ECE before and after
  4. Also compute macro F1 before and after — should be identical
  5. Report: does scaling change which model is better-calibrated?

The last question is the one that matters.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Section 3 — Confusion Matrix Patterns

At scale, patterns matter. Individual cells do not.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Pre-work 04: the 5×5 matrix

You drew a 5×5 confusion matrix in pre-work module 4. 25 cells. You could read each one.

Today you have 113×113 = 12,769 cells.

You cannot read 12,769 cells. You have to look for patterns.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Frequency tiers

Group the 113 classes by training frequency into three tiers:

Tier Count Training frequency range
Head 20 Top 20 (most common)
Mid 40 Next 40
Tail 53 Bottom 53 (least common)

Then ask questions at the TIER level, not the class level.

Jin et al. 2017 call this the confusion community view: at scale, identify groups of classes that systematically confuse each other, then analyze at the group level rather than cell-by-cell.

📄 readings/week4/jin2017_confusion_graph.pdf

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

The headline question

When a tail-class complaint is misclassified, the model's wrong prediction lands on:

  • (A) A head class — the model defaults to something common
  • (B) A mid-tier class — confused with medium-frequency classes
  • (C) Another tail class — confused with a similar rare class

In the lab, you'll predict the percentage breakdown. Then measure.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Three candidate stories

Each pattern implies a different fix. A → class weighting. B → better features for semantic neighbors. C → more data for those specific rare classes.

Same metric, three diagnoses. The data will tell you which.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Section 4 — Val-Set Reliability

Not every F1 number is equally trustworthy.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

The problem

113 classes. 6,430 val examples.

If examples were evenly distributed, that'd be ~57 per class.

They aren't. Some classes have 100+ val examples. Some have a handful.

Some have exactly 1.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

What a 1-example class does to your F1

Class has 1 val example. Model gets it right → F1 = 1.0. Wrong → F1 = 0.

Nothing in between. A single prediction flips the F1 by 1.0.

Macro F1 is the average of all 113 per-class F1s.

Every 1-example class is a coin flip contributing equally to that average.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Why this matters for model comparison

Your encoder-vs-decoder macro F1 gap is ~0.03.

If 15% of your macro F1 is noisy single-example classes, what fraction of that 0.03 gap could just be noise from those specific classes?

You can't answer that from point estimates. You need confidence intervals.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Bootstrap: measure the noise floor

Efron 1979 introduced the bootstrap for exactly this kind of problem — what's the sampling variability of a statistic computed from finite data?

Method: Resample the val set with replacement N times. Recompute macro F1 each time.

Now you have a distribution of macro F1 values — not a point estimate.

For two models:

  • Resample encoder val → distribution of encoder macro F1
  • Resample decoder val → distribution of decoder macro F1
  • Compute the difference distribution

If the CI for the difference excludes 0, the gap is detectable above the data-side resampling noise you've measured. Model-side noise is a separate question (next slide).

📄 readings/week4/efron1979_bootstrap.pdf

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Reading a bootstrap CI

Two overlapping histograms: encoder and decoder macro F1 across 1,000 resamples.

Questions to ask:

  • How much do they overlap?
  • Where's the 95% CI of the difference?
  • Does the difference CI contain 0?

If yes → you can't tell the two models apart within noise.
If no → the gap is real within noise.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

What wide CIs look like in practice

A 95% CI of [+0.010, +0.048] says:

  • The decoder IS better than the encoder within noise
  • BUT the actual advantage is somewhere between tiny and substantial
  • Don't over-interpret the specific point estimate. It's one draw from this distribution.

In the homework you compute this CI. Write it in your memo. Defend the interpretation.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

The noise you haven't measured

Bootstrap quantifies data-side variability — uncertainty from which val examples you happened to sample.

There's a second source: model-side variability — uncertainty from which random seed / checkpoint / training run you happened to pick.

Sälevä et al. 2025 show that accounting for only one source substantially underestimates real replication variability.

For the memo: if you retrained the decoder with a different seed, how would macro F1 move? You don't know. That's noise you haven't measured.

Practical takeaway: bootstrap CIs are a lower bound on your true uncertainty. The real "is the gap real?" range is at least this wide, probably wider.

📄 readings/week4/saleva2025_uncertainty_nlp.pdf

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Section 5 — Bug Hunt Preview

Given a broken model, diagnose it from symptoms alone.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

The scenario

Someone trained a decoder on this same data. Same base model. Same LoRA config. Same 3 epochs. Pushed it to HuggingFace Hub:

earino/ecbs5200-week4-flawed-checkpoint

It's broken. You don't know how.

You get to load it, run inference, and diagnose from symptoms alone.

This is a deliberately clean instructional case — real production failures are messier (distribution drift, partial OOV input, gradient updates gone wrong during continuous training) and rarely preserve everything else while scrambling one thing. The clean version teaches the diagnostic move. The messy version is what you face in a job.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Failure modes that all look "random" at aggregate

When accuracy and macro F1 are near random, several very different bugs fit:

Bug Behavior
Never trained All predictions are argmax of random init
Collapsed to majority Always predicts the most common class
Output space scrambled Coherent predictions, wrong targets
Wrong features Responds to noise instead of signal

You distinguish these from confusion-matrix structure, not from accuracy.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Diagnosing models you haven't inspected yet

In lab you diagnose ONE specific broken model. Someone told you it was broken.

In production: you don't know which models are broken. You need continuous monitoring that fires BEFORE you know to diagnose.

Nguyen et al. 2025 (NeurIPS): D3M — Disagreement-Driven Deterioration Monitoring.

  • Train multiple models (or keep snapshots at different checkpoints)
  • In deployment, track how often they disagree
  • Disagreement-rate spike → alert → investigate
  • No ground-truth labels required.

Today's lab is the diagnostic side. Production ML systems need detection — the alert that says something is wrong before you know what.

Week 6 preview: that disagreement signal is also the input to distillation.

📄 readings/week4/nguyen2025_d3m.pdf

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

What you'll do in lab

  1. Load the flawed checkpoint, run inference
  2. See the metrics (near-random)
  3. Build a diagnostic plot — "most-predicted class per true class"
  4. Recognize the pattern
  5. Apply a one-line transformation that recovers performance

If you pick the right transformation, the model's accuracy jumps from near-random back to ~57%.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Section 6 — Wrap + Week 5 Preview

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Today's toolkit, restated

  1. Slice analysis — averages hide subgroups
  2. Calibration — accuracy ≠ trustworthiness
  3. Confusion-matrix patterns — wrong how, not just wrong
  4. Val-set reliability — small val = noisy F1
  5. Bug hunt — given a broken model, narrow the failure mode

Five tools. They compound. Each rules out a different class of failure.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

What Week 5 covers

Val reliability made you look at how classes are DISTRIBUTED.

Next week: how the data pipeline PRODUCES that distribution.

  • Why 113 classes and not 153?
  • Why does a cluster of classes have so few val examples?
  • What decisions upstream of training created these shapes?

We'll open up the pipeline that made your dataset.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

What Week 6 covers

You'll pull examples where encoder and decoder disagree.

Those disagreement examples have structure — some patterns favor the encoder, some the decoder.

Week 6: distillation. Use the decoder to generate labels for the disagreement set, train the encoder to match.

The diagnostic work you do this week is the INPUT to that.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Homework arc

Section Work Points
Slice analysis All 6 axes; interpret 2 20
Calibration Temperature scaling experiment 20
Confusion matrix Per-class drill-down on worst 3 classes/model 25
Bootstrap CI 1000 resamples, read the CIs 20
Synthesis Your Week 3 recommendation — has your confidence changed? 15

~4.5 hours. Memo is embedded. Due: Wednesday morning.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis
ECBS5200 Week 4

Today — in one sentence

You entered with two models and a recommendation.

You leave with the tools to defend or update that recommendation — with evidence.

Error Diagnosis — Slices, Calibration, Cross-Model Analysis

Welcome back. Three weeks in, and you've now fine-tuned two architectures, compared them on macro F1, and made a deployment recommendation. This week we ask a harder question: how do you know the recommendation was right? What if the aggregate numbers you based it on hid something important? By the end of today you'll have five new diagnostic tools. In the lab you'll apply them to the two models you already know. At least one of the findings is going to contradict a piece of folk wisdom you probably believe.

Here's where we ended Week 3. The encoder is 56.6 percent accurate, macro F1 0.209. The decoder — a 494 million parameter model trained with LoRA on the same data — is 57 percent accurate, macro F1 0.240. Decoder wins by about 0.03 macro F1 points. It's also about three times slower per example at inference, so there's a quality-versus-speed tradeoff. Your homework at the end of Week 3 was to use those numbers to make a deployment recommendation. Today's question is about that recommendation, not about the numbers that led to it.

For four weeks you have imported this function and moved on. Today, before you diagnose a model's errors, you need to see what the pipeline does to the raw CFPB data. Every number in your memo this week describes a model trained on the output of this function. Its decisions shape what your model could have learned. When we finish this section you will not be surprised by the confusion patterns in your lab. You will recognize some of them as ours, not the model's.

The raw dataset has one hundred fifty-three unique Issue strings. That number is too high. Same concept appears twice because the CFPB reworded its form in April twenty-seventeen. "Incorrect information on credit report" has seventy-six hundred seven examples. "Incorrect information on your report" has seventy-two hundred eight. Same meaning, new wording. The data captured both because complaints from before the change kept their original labels. There is a mapping file at data slash label underscore merge underscore mapping dot json that collapses thirty-six pairs. One hundred fifty-three raw labels down to one hundred twenty canonical ones after the merge. Three lines of code. Not a research project.

The mapping is real infrastructure. Real means imperfect. Three fingerprints. First. One line in the mapping file maps the label "Can't repay my loan" to itself. Identity. No-op. It sits there doing nothing. Someone added an entry, did not read the target they were pasting, and committed. Normal. Second. Open label underscore list dot json in the data directory. Five classes contain the word Advertising. "Advertising" with seven training examples. "Advertising and marketing" with one hundred forty-two. "Advertising and marketing, including promotional offers" with ninety-eight. These are pre- and post-twenty-seventeen form revision duplicates and we did not catch them. Our mapping is incomplete. Third. We merged "Struggling to pay your bill" into "Struggling to pay your loan." Bills and loans are different complaints. That was an over-merge. Some of the confusion you see in the mid tier today is going to be our fault, not the model's. When you look at confusion patterns in the lab, you will see these. Keep this slide in mind.

Two more pipeline decisions. First, we drop every class with fewer than five total examples. Seven classes go. Twenty-three examples lost. A threshold of five is a choice. It could have been three. It could have been ten. Nothing natural about it. One hundred twenty classes become one hundred thirteen. Second, we stratified-split with seed forty-two. That gives us the three split sizes you have been reading for four weeks: fifty-seven thousand eight hundred forty-six train, six thousand four hundred thirty val, twenty-one thousand four hundred thirty-two test. Look at the histogram on the right. Seventeen classes land at n equals one in the validation set. These are the support boundary of our filter. A single wrong prediction on one of these classes moves its F1 from one to zero. When you read your confusion matrix in the lab, the seventeen-class zone is the visible edge of our tail.

In industry you will inherit data pipelines like this one. Someone else will have decided which labels to merge, which to drop, and how to split. Your job is to reason about models trained on that data. A diagnostic question you cannot answer without pipeline knowledge is this: "is this confusion pattern real or is it ours?" You can only answer it if you know what the pipeline did. That is what the last few minutes were about. Now, with that in hand, the rest of today is about diagnosing models against this dataset.

You wrote a memo that said "deploy the decoder" or "deploy the encoder" or "it depends." Whatever you wrote, you had a reason. Now I want you to imagine someone at your company reads your memo and says "OK, but how confident are you?" And you don't get to just say "very." You need evidence. What evidence would you produce? That's the question we answer this week. D'Amour and 38 coauthors published a paper in JMLR that gives this failure mode a name: underspecification. The core empirical claim in their paper: an ML pipeline can return predictors with equivalent held-out metrics that behave very differently in deployment. That's almost a verbatim description of the problem you face. The paper is on your reading list.

D'Amour's paper doesn't just diagnose the problem of underspecification — it proposes the general solution. Stress testing. Design evaluations that are TARGETED at surfacing the places two predictors differ, not just generic aggregate metrics on more held-out data. This is useful conceptual framing for the week. Every diagnostic tool I'm about to teach you is a lightweight stress test — a way to probe a model in a more targeted way than aggregate accuracy. Slice analysis stresses on input subsets. Calibration stresses on the model's own confidence claims. Confusion patterns stress on the structure of mistakes. Noise floor stresses on resampling. And the bug hunt in the lab stresses on an already-broken model. When you finish this week, you have five tests you can run on any deployed model. That's the conceptual glue for all five tools.

Three ways that aggregate metrics lie to you. First, averages hide subgroups — a model that scores 80 percent on average might be 95 percent on head classes and 40 percent on tail classes. The average is useless for deciding whether to deploy. Second, accuracy doesn't measure confidence. A model can be right 80 percent of the time while acting like it's 99 percent sure. When the model says "probability 0.95," you should be able to trust that number — if you can't, downstream decisions are unreliable. Third, small gaps might not be real. When your decoder beats your encoder by 0.03 macro F1, is that an effect of the model, or did you just get lucky on the random split? Aggregate numbers are the START of diagnostic work, not the end.

Here's the plan. In the lecture I'll cover the conceptual tools — slice analysis, calibration, confusion-matrix patterns, noise floor. I'll also set up the bug hunt. Then in the lab, you apply everything to the two models you already know, and the lab ends with a broken model sitting on the Hub that you diagnose from its symptoms. The homework extends each tool with a quantitative deep-dive and builds up to a 5-section memo.

First tool. Slice analysis. The idea is simple. The execution matters.

Consider the decoder's 0.240 macro F1. That's averaged over all 6,430 val examples. It tells you the model does reasonably well in aggregate. It doesn't tell you whether it does equally well on short complaints versus long ones, on redacted text versus clean text, on different product categories. An average is one dimension of evidence. You typically need 5 or 6 before you have a useful picture of model behavior. Oakden-Rayner and colleagues at Stanford gave this a name — hidden stratification — and studied it empirically in medical imaging. On several diagnostic tasks they found subgroup performance gaps exceeding 20 percentage points that were invisible in aggregate accuracy. It's the kind of gap you don't want to find in production.

A slice is just a subset of the validation set, defined by some property of the input. It's not a label group or a model behavior — it's a purely input-side partition. Short versus long complaints. Redacted versus clean. Emotional versus measured. You choose axes based on what you think might matter. Then for each slice, you compute macro F1 separately, and you look at the numbers side by side. The interesting findings are usually in the gaps — where the two models disagree the most, and where they disagree the least.

Six axes. Two in the lab for time budget, all six in the homework. Character length — bucket into four quartiles of increasing length. Redaction — binary, does the complaint contain "XXXX" markers. Token length — bucket by actual tokenizer output. Numeric content — dollar amounts and four-plus digit numbers. Opener — does the complaint literally start with the word "I." All-caps — fraction of words that are fully uppercase, then median-split. Each is a one-line regex or Python expression. The signal you get from them is not one-line. Some of them reveal significant performance differences between the two models. Some of them don't. And "some don't" is still evidence.

Here's what a slice table looks like. Three slices, two models, the gap between them on each slice. In this illustrative example, slice X shows a 0.06 gap — big. Slice Y shows a small gap. Slice Z shows no gap at all — the two models are tied. That "tied" slice is often where the real story is. It tells you: whatever advantage Model B has over Model A, it vanishes on this subset. That's a deployment constraint. If your production traffic looks like slice Z, the model choice doesn't matter. If it looks like slice X, it does. In the lab, you'll find one of your six axes has exactly this pattern — a slice where encoder and decoder are essentially tied.

A null result is still a result. When an axis shows no differential signal between two models, you learn that whatever that axis captures — say, personal framing like starting with "I" — isn't what's driving the encoder-decoder performance gap. That's useful. In the homework, one of your six axes is almost certainly going to be null, and I want you to report it in your memo, not drop it. When engineers silently drop null-result experiments, they cherry-pick. When they report them, they show what they ruled out. That's the sign of a trustworthy diagnostic write-up.

One honest acknowledgment before we move on. The six slice axes you're about to try in lab — they're hand-picked. I thought about the dataset and picked six properties I expected would show signal. For pedagogy that's the right move. In production you'd go broader. You'd test dozens of axes. More importantly, you'd want to discover axes you didn't think of — latent failure modes hidden in the embedding space, combinations of features that individually look fine. There's a whole automated slice discovery literature for exactly this. Chung and colleagues introduced SliceFinder and extended it to TKDE in 2020 — it searches predicate combinations automatically. Eyuboglu's Domino paper uses cross-modal embeddings to find underperforming slices without predefined labels. And Yu et al. at AAAI 2026 proposed a slice coherence metric that doesn't need predefined categories at all. For today's lecture and lab we're hand-picking because it's concrete and teachable. If you want to see the automation, these are your reading-list papers.

Second tool. Calibration. The question it answers matters in every deployment where confidence is used as a gate.

Calibration is about whether you can trust the number the model assigns to its own prediction. Suppose the model says "class X, probability 0.8." Across all the predictions where the model said 0.8, is it right about 80 percent of the time? If yes, calibrated. If the actual right-rate is 70 percent, overconfident — the model claims more certainty than it has. If it's 85 percent, underconfident. Overconfidence is empirically common for networks fine-tuned with cross-entropy loss on imbalanced classification — Guo 2017 documented this on several image benchmarks, and Chidambaram 2024 extended the analysis — but this is an observation, not a law. Different training regimes, label-smoothing, and model families can land in different calibration regimes. The right way to know is to measure on your specific model. Don't assume from training recipe.

ECE is the standard single-number calibration metric. You bin your predictions by the model's confidence — 15 bins by convention, each spanning 0.067 in confidence. For each bin, you compute two things: the mean confidence the model claimed, and the empirical accuracy on those examples. If those match, the bin is calibrated. If they differ, there's a gap. The weighted average of gaps across all bins is ECE. A well-calibrated model has ECE near zero. A typical fine-tuned neural network might have ECE around 0.05 — 5 percentage points off on average. If ECE hits 0.10, that's meaningful miscalibration. The model's claims about its confidence are off by 10 percentage points.

One caveat on ECE before we draw reliability diagrams. It's the standard metric, we'll use it all week, but graduate students should know its known issues. Chidambaram and colleagues published an ICML 2024 paper analyzing exactly how flawed ECE is. The theoretical issue: ECE is discontinuous in predictor space. Small changes to where you put the 15 bin boundaries can shift ECE meaningfully without the underlying calibration actually changing. That's a problem on paper. Their proposed fix is Logit-Smoothed ECE, which avoids the binning step. But here's the interesting empirical finding from their paper — on real pretrained image classifiers, binned ECE and LS-ECE track each other closely. The theoretical pathology is usually not the bottleneck in practice. So the operating lesson: use ECE, it's fine, don't treat it as ground truth. If you find two systems where ECE differs by a tiny amount, remember it's a noisy binned estimator — don't over-interpret. Chidambaram 2024 is the reading-list paper on this.

The reliability diagram is the visualization. For each of the 15 confidence bins, you plot mean confidence on the x-axis, empirical accuracy on the y-axis. Perfect calibration would put every point exactly on the diagonal — at 80 percent confidence, 80 percent accuracy. If the line drops below the diagonal, the model is overconfident — it's claiming more than it can deliver. If it rises above, it's underconfident. In the lab, you'll plot this diagram for both the encoder and the decoder, overlaid on the same axes. Before you look at the numbers, you'll predict which model is better calibrated. Most people predict the same thing. Many of them will be wrong.

Here's the setup for the lab's calibration exercise. You have two models — the encoder at 149M params pretrained on 2T tokens, the decoder at 494M params pretrained on ~9x more data. I'm deliberately NOT asking you to "predict which is better-calibrated" as if there's folk wisdom to debunk. The Guo et al. 2017 paper we're already citing settled the scale-vs-calibration question in the OPPOSITE direction — they showed larger ResNets were worse calibrated than smaller ones on ImageNet. So there's no clean "bigger = better" prior for you to overturn. What I actually want you to predict is more open: do you expect these two specific models, after fine-tuning on this specific dataset, to end up with similar ECE, or meaningfully different ECE? If different — which direction? How much? And how much will temperature scaling help each? Those are genuinely open questions. Whatever you predict, write it down, measure it, compare.

Temperature scaling is a classic one-parameter fix for miscalibration. You fit a single scalar T on held-out data — specifically, T that minimizes negative log-likelihood when you divide logits by it before softmax. T greater than 1 softens the distribution — reduces confidence — which fixes overconfidence. T less than 1 sharpens — fixes underconfidence. T equal to 1 means the model was already calibrated and needs no fix. The critical property: because T is a scalar, dividing logits by it doesn't change which class has the largest logit. So argmax predictions are identical before and after scaling, which means accuracy and macro F1 are also identical. Temperature scaling improves CALIBRATION without changing CLASSIFICATION PERFORMANCE. The Guo et al. 2017 paper in your readings is the foundational reference — it's worth reading the first three pages to see how they originally formulated this.

In the homework, you run a controlled temperature-scaling experiment. You split val 50/50, fit T on one half, apply it to the other half, and measure ECE before and after. Standard protocol. You'll find that T scaling reduces ECE substantially — typically 70 to 80 percent on this task. You'll also confirm that macro F1 is unchanged post-scaling, which is a useful sanity check — if it DID change, you did something wrong. But the question I want you to answer in your memo is the last one. If the encoder was better-calibrated than the decoder before scaling, is that still true after scaling? Does the ranking change? That answer determines how you should rank models on calibration in production.

Third tool. Confusion-matrix reading. The shift from "what's in this specific cell" to "what pattern do the cells form."

In pre-work module 4, you built a 5-by-5 confusion matrix by hand. Each cell had a number, you could read all 25 of them, and you could draw conclusions about individual class confusions. With 113 classes, you have 12,769 cells. Most of them are zero or one. You can't read 12,769 cells, and even if you could, most of them would be noise. At this scale you have to look at STRUCTURE — patterns across groups of cells, not individual cell values.

The structure we care about for long-tail classification is class frequency. Group the 113 classes into three tiers by training frequency. The 20 most common classes are head — these are things like "incorrect information on your credit report" with over 5,000 training examples each. The next 40 are mid — still substantial, hundreds of examples. The bottom 53 are tail — some with as few as four training examples. Now you can ask questions at the tier level. Jin and colleagues published a paper at IJCAI in 2017 that made this formal — they call it "confusion community" detection, borrowing the community-detection metaphor from social networks. The key insight: in large confusion matrices, cells aren't independent — they cluster into groups of classes that systematically confuse each other. You look at group-level patterns, not individual cells. That's exactly the move from 5-by-5 in pre-work to 113-by-113 today. When a tail-class complaint is misclassified, where does the prediction land?

Here's the headline question for the confusion-matrix work. When a tail-class complaint — one of your 53 rarest classes — is misclassified, where does the wrong prediction land? There are three possibilities. A: the model defaults to a head class, one of the top 20 most common labels. B: it goes to a mid-tier class. C: it gets confused with another tail class — a similar rare class. In the lab, you'll write down your prediction for the percentage breakdown before you compute anything, then you'll measure. I'm not going to tell you the answer. I'll tell you this: most people get one of the three percentages meaningfully wrong.

Here's what each of the three candidate stories would look like as a chart. Same x-axis — where tail-class errors land. Different patterns. Story A has a tall head bar: rare classes get swallowed by the majority. Story B has a tall mid bar: rare classes confused with semantic neighbors of different frequency. Story C has a tall tail bar: rare confused with other rare. The reason to predict before measuring is that each of the three answers implies a different FIX. If A dominates, class weighting helps — it directly counters majority-class dominance. If B dominates, the fix is better features that distinguish semantically-adjacent classes. If C dominates, you need better data for those specific classes. Different stories, different interventions. Knowing which story is true changes what engineering effort you spend. The measurement is diagnostic in the clinical sense — you don't prescribe until you know what you're treating.

Fourth tool. Val-set reliability — a specific kind of noise awareness.

113 classes, 6,430 val examples. If examples were evenly split across classes, each would have about 57 val examples, which is plenty for a stable F1 estimate. But class frequency is never evenly split. Some classes have hundreds of val examples. Some have a handful. And some have exactly one — where the per-class F1 is either 1.0 if that one example was predicted right, or 0 if it wasn't. A coin flip. That's a coin flip contributing to the macro F1 you're comparing across models.

Let me make this concrete. Consider a class with exactly one val example. If your model predicts that example correctly, per-class F1 for that class is 1.0 — perfect. If your model predicts it wrong, per-class F1 is 0. There's no smooth gradient in between — it's one prediction. Now remember: macro F1 is the average of all 113 per-class F1s, weighted equally. So every single-example class is a coin flip that contributes 1/113th of your macro F1. If 10 classes are coin flips, about 9 percent of your macro F1 is noise. In the lab, you'll count exactly how many classes fall into each val-count bucket. I'll tell you now: the number of 1-example classes is bigger than most people predict.

Now connect this back to the 0.03 macro F1 gap between your encoder and decoder. That gap is the difference between two averaged-over-113-classes numbers. If a meaningful fraction of those 113 contributing scores are essentially coin flips, then some of that 0.03 gap might just be noise. How much? You can't tell from the point estimates alone. You need to quantify the noise floor. The tool for that is bootstrap confidence intervals — which is the next thing.

Bootstrap is beautifully simple. Efron introduced it in 1979 in the Annals of Statistics — nearly 50 years ago — and it's still the standard answer to "how uncertain is this estimate?" You treat your val set as a proxy for the distribution of possible val sets you could have drawn. You resample it with replacement — a new 6,430-example set drawn from your original 6,430 — and compute macro F1 on that resampled set. Do this a thousand times. Now you have a distribution of possible macro F1 values. For two models, you can compute the distribution of the DIFFERENCE — decoder macro F1 minus encoder macro F1 — across the same bootstrap iterations. If that difference distribution sits entirely above zero, the difference is robust to resampling noise. If it straddles zero, you can't distinguish the decoder's advantage from chance variation in which examples happened to be in val.

Here's what the bootstrap output looks like conceptually. Two histograms, one per model, each showing 1,000 resampled macro F1 values. The key reading question: do the histograms overlap substantially? If yes, the models are within noise of each other on this dataset. If the distributions are clearly separated, the gap is real. Even more useful: compute the distribution of the DIFFERENCE and the 95 percent confidence interval of that difference. If the CI is, say, [+0.01, +0.05] — cleanly above zero — you have evidence the decoder is actually better. If the CI is [-0.01, +0.07], the difference might be real, might be noise, you can't tell. The width of the CI tells you how confident you should be in a specific point estimate.

Wide confidence intervals are common in long-tail problems with noisy per-class F1. A CI like [+0.010, +0.048] is genuinely informative — it says the decoder IS better, the gap excludes zero, but the range of plausible advantages spans from barely-detectable to substantial. That's important for what you tell a stakeholder. You don't say "the decoder is 0.03 better." You say "the decoder is better — but the range of plausible advantages is pretty wide, so for deployment decisions we should also consider other factors." Your homework asks you to compute this CI on your specific val set and defend what conclusions you can and cannot draw. Being honest about uncertainty is a professional skill. Memo section 4 specifically grades this.

One more piece before we move off the noise floor. The bootstrap teaches you about ONE source of noise — data-side noise. It resamples the val set, which tells you what happens if you'd had a different draw of validation examples. There's a second source you haven't measured: model-side noise. If you retrained the decoder with a different random seed, different initial LR, one more epoch — the model would have somewhat different outputs, somewhat different per-class F1. That's uncertainty from the training process, not the data. Sälevä and colleagues at Brandeis published this at IJCNLP-AACL 2025. Their finding: accounting for only data-side noise, which is what bootstrap alone gives you, substantially underestimates total replication variability — sometimes by a lot. So for your Week 3 memo where you wrote "decoder is 0.03 better than encoder" — if both had been trained with different seeds, would that 0.03 hold? You don't know. Your bootstrap CI is a LOWER bound on the right answer. For the homework's memo section 4, the honest position is: bootstrap gives us data-side variability, model-side is out of scope for this week, but the true uncertainty is at least this wide and plausibly wider.

Fifth tool. Bug hunt. This is the applied exercise where everything else you learned comes together.

Here's the lab's final exercise. Someone — not you — trained a decoder using the same recipe you know. Same base model, same LoRA config, same 3 epochs, same data. They pushed it to HuggingFace at that repo name. Except it's broken. Your aggregate metrics will tell you something is very wrong. They won't tell you WHAT is wrong. Your job is to use the other four diagnostic tools to figure out what happened. Not fix it, necessarily — just diagnose.

The aggregate metric will tell you something is broken. It won't tell you what. All of these bugs — never-trained, collapsed-to-majority, scrambled output space, wrong features — produce accuracy near 1 over 113, roughly 0.9 percent. They produce macro F1 near 0.003. At the aggregate level, they're indistinguishable. But they have very different confusion-matrix SIGNATURES. A never-trained model produces noise. A collapsed model produces a single column. A scrambled output space produces a coherent but offset pattern. Different bug, different visual. Your job in the lab is to look at the confusion matrix, recognize the signature, and work out which failure mode this is.

One last thing before we get to the lab setup. In the lab you'll diagnose a specific broken model — you know it's broken because the repo is literally named "flawed-checkpoint." Real production systems don't announce they're broken. They fail silently — gradually as the input distribution drifts, or suddenly when an upstream dependency changes. You need continuous monitoring that can detect deterioration without ground-truth labels, because by the time labels arrive the damage is done. Nguyen and colleagues published this at NeurIPS 2025. Their method is D3M — Disagreement-Driven Deterioration Monitoring. The idea: train multiple models or keep multiple checkpoints of the same model, deploy them in parallel, and monitor how often they disagree with each other. Disagreement rate rises when input distribution shifts away from training. When it spikes, something upstream changed — you alert and investigate. The beautiful part: no labels required. This is the production-monitoring companion to the diagnostic toolkit you're learning today. And the disagreement signal is the same thing we'll use for distillation in Week 6 — where you use disagreements between a big model and a small model as a training signal. Same primitive, two different uses. Keep it in the back of your head.

The bug-hunt exercise walks you through a standard diagnostic progression. Load, observe metrics, build a diagnostic plot, recognize the pattern, apply a fix. The fix itself is a single line of Python. I'm not going to tell you what the line is. I'm not going to tell you what the bug is. The pattern you see in the diagnostic plot will point you to a specific one-line transformation on the predictions. If you pick the right one, accuracy jumps from 0.14 percent back to 57 percent — essentially recovering the full decoder. That gap — the difference between the broken output and the recovered output — is the bug. And the lesson: the model wasn't "broken" in the weights. It was broken in how its outputs were INTERPRETED.

Wrapping up.

To wrap up the lecture: five tools, five distinct failure-mode detectors. Slice analysis catches subgroup issues that averages hide. Calibration catches unreliable confidence. Confusion-matrix patterns catch structural biases the accuracy number can't reveal. Val-set reliability catches noise in your evaluation itself. Bug hunt composes all four to diagnose a specific broken model. When you're debugging a deployed ML system — and you will — these are the tools that produce evidence-based answers instead of hand-waving.

Quick preview of Week 5. This week you encountered the downstream effects of data-pipeline decisions — specifically, the val-count distribution. Next week we open up the pipeline itself. Why 113 classes? Your raw CFPB data had 153 unique "Issue" labels. Why did we merge them down? Why the specific minimum-count filter? Why is the val split stratified the way it is? Most of what feels like "the model is behaving oddly" actually traces to choices made before any training ran. We'll rebuild the pipeline end to end — and you'll see exactly where the n-equals-one classes you count in the lab came from.

And Week 6 preview: distillation. This week in the homework, you pull disagreement examples — cases where one model is right and the other is wrong. You tag them, look for patterns. That disagreement set is exactly the dataset distillation operates on. In Week 6 you'll use the better-performing decoder as a "teacher" to label the disagreement set, then fine-tune the smaller encoder to match. This is how you get decoder-quality decisions with encoder-speed inference at deploy time. The diagnostic analysis you do this week is the INPUT to that. If your tagging work identifies real patterns, Week 6 has a clean signal to work from. If your tagging is sloppy, distillation has nothing to learn.

The homework has five sections mapped to the memo rubric's five criteria. Section 1 extends your slice analysis to all six axes and asks you to interpret two of them in writing. Section 2 runs the full temperature-scaling experiment. Section 3 is the per-class drill-down — for the worst three classes of each model, what are they getting confused with? Section 4 is the bootstrap CI. Section 5 is the synthesis question — given everything you now know, is your Week 3 recommendation still defensible? Has your confidence gone up, down, or shifted? That section carries 15 points, the smallest individual weight, but it's the one that distinguishes "did the diagnostics" from "understood the diagnostics." Plan about four and a half hours total. Memo sections are embedded in the homework notebook.

Last slide. You came in today with two models and a recommendation. By the end of the lab and homework, you will either have evidence that defends the recommendation you made, or evidence that updates it. Either way, the answer you write in memo section 5 is based on diagnostic work, not intuition. That's the whole point of the week. Let's go do the quiz, then we'll see you in the lab at 3:30.