Error Diagnosis

Model	Accuracy	Macro F1
Encoder LoRA (ModernBERT 149M)	56.6%	0.209
Decoder LoRA (Qwen 0.5B, 494M)	57.0%	0.240

Label	Raw count
Incorrect information on credit report	7,607
Incorrect information on your report	7,208

Class	Train
Advertising	7
Advertising and marketing	142
Advertising and marketing, including promotional offers	98
Advertising, marketing or disclosures	7
Confusing or misleading advertising or marketing	18

Axis	Split
Character length	Quartile buckets (q1 short → q4 long)
Redaction	XXXX marker present or not
Token length	1–30 / 31–80 / 81–127 / truncated at 128
Numeric content	Dollar sign or 4+ digit number
Opener	Starts with "I" or not
All-caps usage	Above or below median rate

Slice	n	Model A	Model B	Diff
slice X	1,500	0.20	0.26	+0.06
slice Y	1,500	0.22	0.24	+0.02
slice Z	1,500	0.28	0.28	+0.00

Tier	Count	Training frequency range
Head	20	Top 20 (most common)
Mid	40	Next 40
Tail	53	Bottom 53 (least common)

Bug	Behavior
Never trained	All predictions are argmax of random init
Collapsed to majority	Always predicts the most common class
Output space scrambled	Coherent predictions, wrong targets
Wrong features	Responds to noise instead of signal

Section	Work	Points
Slice analysis	All 6 axes; interpret 2	20
Calibration	Temperature scaling experiment	20
Confusion matrix	Per-class drill-down on worst 3 classes/model	25
Bootstrap CI	1000 resamples, read the CIs	20
Synthesis	Your Week 3 recommendation — has your confidence changed?	15

Welcome back. Three weeks in, and you've now fine-tuned two architectures, compared them on macro F1, and made a deployment recommendation. This week we ask a harder question: how do you know the recommendation was right? What if the aggregate numbers you based it on hid something important? By the end of today you'll have five new diagnostic tools. In the lab you'll apply them to the two models you already know. At least one of the findings is going to contradict a piece of folk wisdom you probably believe.

Here's where we ended Week 3. The encoder is 56.6 percent accurate, macro F1 0.209. The decoder — a 494 million parameter model trained with LoRA on the same data — is 57 percent accurate, macro F1 0.240. Decoder wins by about 0.03 macro F1 points. It's also about three times slower per example at inference, so there's a quality-versus-speed tradeoff. Your homework at the end of Week 3 was to use those numbers to make a deployment recommendation. Today's question is about that recommendation, not about the numbers that led to it.

For four weeks you have imported this function and moved on. Today, before you diagnose a model's errors, you need to see what the pipeline does to the raw CFPB data. Every number in your memo this week describes a model trained on the output of this function. Its decisions shape what your model could have learned. When we finish this section you will not be surprised by the confusion patterns in your lab. You will recognize some of them as ours, not the model's.

The raw dataset has one hundred fifty-three unique Issue strings. That number is too high. Same concept appears twice because the CFPB reworded its form in April twenty-seventeen. "Incorrect information on credit report" has seventy-six hundred seven examples. "Incorrect information on your report" has seventy-two hundred eight. Same meaning, new wording. The data captured both because complaints from before the change kept their original labels. There is a mapping file at data slash label underscore merge underscore mapping dot json that collapses thirty-six pairs. One hundred fifty-three raw labels down to one hundred twenty canonical ones after the merge. Three lines of code. Not a research project.

The mapping is real infrastructure. Real means imperfect. Three fingerprints. First. One line in the mapping file maps the label "Can't repay my loan" to itself. Identity. No-op. It sits there doing nothing. Someone added an entry, did not read the target they were pasting, and committed. Normal. Second. Open label underscore list dot json in the data directory. Five classes contain the word Advertising. "Advertising" with seven training examples. "Advertising and marketing" with one hundred forty-two. "Advertising and marketing, including promotional offers" with ninety-eight. These are pre- and post-twenty-seventeen form revision duplicates and we did not catch them. Our mapping is incomplete. Third. We merged "Struggling to pay your bill" into "Struggling to pay your loan." Bills and loans are different complaints. That was an over-merge. Some of the confusion you see in the mid tier today is going to be our fault, not the model's. When you look at confusion patterns in the lab, you will see these. Keep this slide in mind.

Two more pipeline decisions. First, we drop every class with fewer than five total examples. Seven classes go. Twenty-three examples lost. A threshold of five is a choice. It could have been three. It could have been ten. Nothing natural about it. One hundred twenty classes become one hundred thirteen. Second, we stratified-split with seed forty-two. That gives us the three split sizes you have been reading for four weeks: fifty-seven thousand eight hundred forty-six train, six thousand four hundred thirty val, twenty-one thousand four hundred thirty-two test. Look at the histogram on the right. Seventeen classes land at n equals one in the validation set. These are the support boundary of our filter. A single wrong prediction on one of these classes moves its F1 from one to zero. When you read your confusion matrix in the lab, the seventeen-class zone is the visible edge of our tail.

In industry you will inherit data pipelines like this one. Someone else will have decided which labels to merge, which to drop, and how to split. Your job is to reason about models trained on that data. A diagnostic question you cannot answer without pipeline knowledge is this: "is this confusion pattern real or is it ours?" You can only answer it if you know what the pipeline did. That is what the last few minutes were about. Now, with that in hand, the rest of today is about diagnosing models against this dataset.

You wrote a memo that said "deploy the decoder" or "deploy the encoder" or "it depends." Whatever you wrote, you had a reason. Now I want you to imagine someone at your company reads your memo and says "OK, but how confident are you?" And you don't get to just say "very." You need evidence. What evidence would you produce? That's the question we answer this week. D'Amour and 38 coauthors published a paper in JMLR that gives this failure mode a name: underspecification. The core empirical claim in their paper: an ML pipeline can return predictors with equivalent held-out metrics that behave very differently in deployment. That's almost a verbatim description of the problem you face. The paper is on your reading list.

D'Amour's paper doesn't just diagnose the problem of underspecification — it proposes the general solution. Stress testing. Design evaluations that are TARGETED at surfacing the places two predictors differ, not just generic aggregate metrics on more held-out data. This is useful conceptual framing for the week. Every diagnostic tool I'm about to teach you is a lightweight stress test — a way to probe a model in a more targeted way than aggregate accuracy. Slice analysis stresses on input subsets. Calibration stresses on the model's own confidence claims. Confusion patterns stress on the structure of mistakes. Noise floor stresses on resampling. And the bug hunt in the lab stresses on an already-broken model. When you finish this week, you have five tests you can run on any deployed model. That's the conceptual glue for all five tools.

Three ways that aggregate metrics lie to you. First, averages hide subgroups — a model that scores 80 percent on average might be 95 percent on head classes and 40 percent on tail classes. The average is useless for deciding whether to deploy. Second, accuracy doesn't measure confidence. A model can be right 80 percent of the time while acting like it's 99 percent sure. When the model says "probability 0.95," you should be able to trust that number — if you can't, downstream decisions are unreliable. Third, small gaps might not be real. When your decoder beats your encoder by 0.03 macro F1, is that an effect of the model, or did you just get lucky on the random split? Aggregate numbers are the START of diagnostic work, not the end.

Here's the plan. In the lecture I'll cover the conceptual tools — slice analysis, calibration, confusion-matrix patterns, noise floor. I'll also set up the bug hunt. Then in the lab, you apply everything to the two models you already know, and the lab ends with a broken model sitting on the Hub that you diagnose from its symptoms. The homework extends each tool with a quantitative deep-dive and builds up to a 5-section memo.

First tool. Slice analysis. The idea is simple. The execution matters.

Consider the decoder's 0.240 macro F1. That's averaged over all 6,430 val examples. It tells you the model does reasonably well in aggregate. It doesn't tell you whether it does equally well on short complaints versus long ones, on redacted text versus clean text, on different product categories. An average is one dimension of evidence. You typically need 5 or 6 before you have a useful picture of model behavior. Oakden-Rayner and colleagues at Stanford gave this a name — hidden stratification — and studied it empirically in medical imaging. On several diagnostic tasks they found subgroup performance gaps exceeding 20 percentage points that were invisible in aggregate accuracy. It's the kind of gap you don't want to find in production.

A slice is just a subset of the validation set, defined by some property of the input. It's not a label group or a model behavior — it's a purely input-side partition. Short versus long complaints. Redacted versus clean. Emotional versus measured. You choose axes based on what you think might matter. Then for each slice, you compute macro F1 separately, and you look at the numbers side by side. The interesting findings are usually in the gaps — where the two models disagree the most, and where they disagree the least.

Six axes. Two in the lab for time budget, all six in the homework. Character length — bucket into four quartiles of increasing length. Redaction — binary, does the complaint contain "XXXX" markers. Token length — bucket by actual tokenizer output. Numeric content — dollar amounts and four-plus digit numbers. Opener — does the complaint literally start with the word "I." All-caps — fraction of words that are fully uppercase, then median-split. Each is a one-line regex or Python expression. The signal you get from them is not one-line. Some of them reveal significant performance differences between the two models. Some of them don't. And "some don't" is still evidence.

Here's what a slice table looks like. Three slices, two models, the gap between them on each slice. In this illustrative example, slice X shows a 0.06 gap — big. Slice Y shows a small gap. Slice Z shows no gap at all — the two models are tied. That "tied" slice is often where the real story is. It tells you: whatever advantage Model B has over Model A, it vanishes on this subset. That's a deployment constraint. If your production traffic looks like slice Z, the model choice doesn't matter. If it looks like slice X, it does. In the lab, you'll find one of your six axes has exactly this pattern — a slice where encoder and decoder are essentially tied.

A null result is still a result. When an axis shows no differential signal between two models, you learn that whatever that axis captures — say, personal framing like starting with "I" — isn't what's driving the encoder-decoder performance gap. That's useful. In the homework, one of your six axes is almost certainly going to be null, and I want you to report it in your memo, not drop it. When engineers silently drop null-result experiments, they cherry-pick. When they report them, they show what they ruled out. That's the sign of a trustworthy diagnostic write-up.

One honest acknowledgment before we move on. The six slice axes you're about to try in lab — they're hand-picked. I thought about the dataset and picked six properties I expected would show signal. For pedagogy that's the right move. In production you'd go broader. You'd test dozens of axes. More importantly, you'd want to discover axes you didn't think of — latent failure modes hidden in the embedding space, combinations of features that individually look fine. There's a whole automated slice discovery literature for exactly this. Chung and colleagues introduced SliceFinder and extended it to TKDE in 2020 — it searches predicate combinations automatically. Eyuboglu's Domino paper uses cross-modal embeddings to find underperforming slices without predefined labels. And Yu et al. at AAAI 2026 proposed a slice coherence metric that doesn't need predefined categories at all. For today's lecture and lab we're hand-picking because it's concrete and teachable. If you want to see the automation, these are your reading-list papers.

Second tool. Calibration. The question it answers matters in every deployment where confidence is used as a gate.

Calibration is about whether you can trust the number the model assigns to its own prediction. Suppose the model says "class X, probability 0.8." Across all the predictions where the model said 0.8, is it right about 80 percent of the time? If yes, calibrated. If the actual right-rate is 70 percent, overconfident — the model claims more certainty than it has. If it's 85 percent, underconfident. Overconfidence is empirically common for networks fine-tuned with cross-entropy loss on imbalanced classification — Guo 2017 documented this on several image benchmarks, and Chidambaram 2024 extended the analysis — but this is an observation, not a law. Different training regimes, label-smoothing, and model families can land in different calibration regimes. The right way to know is to measure on your specific model. Don't assume from training recipe.

ECE is the standard single-number calibration metric. You bin your predictions by the model's confidence — 15 bins by convention, each spanning 0.067 in confidence. For each bin, you compute two things: the mean confidence the model claimed, and the empirical accuracy on those examples. If those match, the bin is calibrated. If they differ, there's a gap. The weighted average of gaps across all bins is ECE. A well-calibrated model has ECE near zero. A typical fine-tuned neural network might have ECE around 0.05 — 5 percentage points off on average. If ECE hits 0.10, that's meaningful miscalibration. The model's claims about its confidence are off by 10 percentage points.

One caveat on ECE before we draw reliability diagrams. It's the standard metric, we'll use it all week, but graduate students should know its known issues. Chidambaram and colleagues published an ICML 2024 paper analyzing exactly how flawed ECE is. The theoretical issue: ECE is discontinuous in predictor space. Small changes to where you put the 15 bin boundaries can shift ECE meaningfully without the underlying calibration actually changing. That's a problem on paper. Their proposed fix is Logit-Smoothed ECE, which avoids the binning step. But here's the interesting empirical finding from their paper — on real pretrained image classifiers, binned ECE and LS-ECE track each other closely. The theoretical pathology is usually not the bottleneck in practice. So the operating lesson: use ECE, it's fine, don't treat it as ground truth. If you find two systems where ECE differs by a tiny amount, remember it's a noisy binned estimator — don't over-interpret. Chidambaram 2024 is the reading-list paper on this.

The reliability diagram is the visualization. For each of the 15 confidence bins, you plot mean confidence on the x-axis, empirical accuracy on the y-axis. Perfect calibration would put every point exactly on the diagonal — at 80 percent confidence, 80 percent accuracy. If the line drops below the diagonal, the model is overconfident — it's claiming more than it can deliver. If it rises above, it's underconfident. In the lab, you'll plot this diagram for both the encoder and the decoder, overlaid on the same axes. Before you look at the numbers, you'll predict which model is better calibrated. Most people predict the same thing. Many of them will be wrong.

Here's the setup for the lab's calibration exercise. You have two models — the encoder at 149M params pretrained on 2T tokens, the decoder at 494M params pretrained on ~9x more data. I'm deliberately NOT asking you to "predict which is better-calibrated" as if there's folk wisdom to debunk. The Guo et al. 2017 paper we're already citing settled the scale-vs-calibration question in the OPPOSITE direction — they showed larger ResNets were worse calibrated than smaller ones on ImageNet. So there's no clean "bigger = better" prior for you to overturn. What I actually want you to predict is more open: do you expect these two specific models, after fine-tuning on this specific dataset, to end up with similar ECE, or meaningfully different ECE? If different — which direction? How much? And how much will temperature scaling help each? Those are genuinely open questions. Whatever you predict, write it down, measure it, compare.

Temperature scaling is a classic one-parameter fix for miscalibration. You fit a single scalar T on held-out data — specifically, T that minimizes negative log-likelihood when you divide logits by it before softmax. T greater than 1 softens the distribution — reduces confidence — which fixes overconfidence. T less than 1 sharpens — fixes underconfidence. T equal to 1 means the model was already calibrated and needs no fix. The critical property: because T is a scalar, dividing logits by it doesn't change which class has the largest logit. So argmax predictions are identical before and after scaling, which means accuracy and macro F1 are also identical. Temperature scaling improves CALIBRATION without changing CLASSIFICATION PERFORMANCE. The Guo et al. 2017 paper in your readings is the foundational reference — it's worth reading the first three pages to see how they originally formulated this.

In the homework, you run a controlled temperature-scaling experiment. You split val 50/50, fit T on one half, apply it to the other half, and measure ECE before and after. Standard protocol. You'll find that T scaling reduces ECE substantially — typically 70 to 80 percent on this task. You'll also confirm that macro F1 is unchanged post-scaling, which is a useful sanity check — if it DID change, you did something wrong. But the question I want you to answer in your memo is the last one. If the encoder was better-calibrated than the decoder before scaling, is that still true after scaling? Does the ranking change? That answer determines how you should rank models on calibration in production.

Third tool. Confusion-matrix reading. The shift from "what's in this specific cell" to "what pattern do the cells form."

In pre-work module 4, you built a 5-by-5 confusion matrix by hand. Each cell had a number, you could read all 25 of them, and you could draw conclusions about individual class confusions. With 113 classes, you have 12,769 cells. Most of them are zero or one. You can't read 12,769 cells, and even if you could, most of them would be noise. At this scale you have to look at STRUCTURE — patterns across groups of cells, not individual cell values.

The structure we care about for long-tail classification is class frequency. Group the 113 classes into three tiers by training frequency. The 20 most common classes are head — these are things like "incorrect information on your credit report" with over 5,000 training examples each. The next 40 are mid — still substantial, hundreds of examples. The bottom 53 are tail — some with as few as four training examples. Now you can ask questions at the tier level. Jin and colleagues published a paper at IJCAI in 2017 that made this formal — they call it "confusion community" detection, borrowing the community-detection metaphor from social networks. The key insight: in large confusion matrices, cells aren't independent — they cluster into groups of classes that systematically confuse each other. You look at group-level patterns, not individual cells. That's exactly the move from 5-by-5 in pre-work to 113-by-113 today. When a tail-class complaint is misclassified, where does the prediction land?

Here's the headline question for the confusion-matrix work. When a tail-class complaint — one of your 53 rarest classes — is misclassified, where does the wrong prediction land? There are three possibilities. A: the model defaults to a head class, one of the top 20 most common labels. B: it goes to a mid-tier class. C: it gets confused with another tail class — a similar rare class. In the lab, you'll write down your prediction for the percentage breakdown before you compute anything, then you'll measure. I'm not going to tell you the answer. I'll tell you this: most people get one of the three percentages meaningfully wrong.

Here's what each of the three candidate stories would look like as a chart. Same x-axis — where tail-class errors land. Different patterns. Story A has a tall head bar: rare classes get swallowed by the majority. Story B has a tall mid bar: rare classes confused with semantic neighbors of different frequency. Story C has a tall tail bar: rare confused with other rare. The reason to predict before measuring is that each of the three answers implies a different FIX. If A dominates, class weighting helps — it directly counters majority-class dominance. If B dominates, the fix is better features that distinguish semantically-adjacent classes. If C dominates, you need better data for those specific classes. Different stories, different interventions. Knowing which story is true changes what engineering effort you spend. The measurement is diagnostic in the clinical sense — you don't prescribe until you know what you're treating.

Fourth tool. Val-set reliability — a specific kind of noise awareness.

113 classes, 6,430 val examples. If examples were evenly split across classes, each would have about 57 val examples, which is plenty for a stable F1 estimate. But class frequency is never evenly split. Some classes have hundreds of val examples. Some have a handful. And some have exactly one — where the per-class F1 is either 1.0 if that one example was predicted right, or 0 if it wasn't. A coin flip. That's a coin flip contributing to the macro F1 you're comparing across models.

Let me make this concrete. Consider a class with exactly one val example. If your model predicts that example correctly, per-class F1 for that class is 1.0 — perfect. If your model predicts it wrong, per-class F1 is 0. There's no smooth gradient in between — it's one prediction. Now remember: macro F1 is the average of all 113 per-class F1s, weighted equally. So every single-example class is a coin flip that contributes 1/113th of your macro F1. If 10 classes are coin flips, about 9 percent of your macro F1 is noise. In the lab, you'll count exactly how many classes fall into each val-count bucket. I'll tell you now: the number of 1-example classes is bigger than most people predict.

Now connect this back to the 0.03 macro F1 gap between your encoder and decoder. That gap is the difference between two averaged-over-113-classes numbers. If a meaningful fraction of those 113 contributing scores are essentially coin flips, then some of that 0.03 gap might just be noise. How much? You can't tell from the point estimates alone. You need to quantify the noise floor. The tool for that is bootstrap confidence intervals — which is the next thing.

Bootstrap is beautifully simple. Efron introduced it in 1979 in the Annals of Statistics — nearly 50 years ago — and it's still the standard answer to "how uncertain is this estimate?" You treat your val set as a proxy for the distribution of possible val sets you could have drawn. You resample it with replacement — a new 6,430-example set drawn from your original 6,430 — and compute macro F1 on that resampled set. Do this a thousand times. Now you have a distribution of possible macro F1 values. For two models, you can compute the distribution of the DIFFERENCE — decoder macro F1 minus encoder macro F1 — across the same bootstrap iterations. If that difference distribution sits entirely above zero, the difference is robust to resampling noise. If it straddles zero, you can't distinguish the decoder's advantage from chance variation in which examples happened to be in val.

Here's what the bootstrap output looks like conceptually. Two histograms, one per model, each showing 1,000 resampled macro F1 values. The key reading question: do the histograms overlap substantially? If yes, the models are within noise of each other on this dataset. If the distributions are clearly separated, the gap is real. Even more useful: compute the distribution of the DIFFERENCE and the 95 percent confidence interval of that difference. If the CI is, say, [+0.01, +0.05] — cleanly above zero — you have evidence the decoder is actually better. If the CI is [-0.01, +0.07], the difference might be real, might be noise, you can't tell. The width of the CI tells you how confident you should be in a specific point estimate.

Wide confidence intervals are common in long-tail problems with noisy per-class F1. A CI like [+0.010, +0.048] is genuinely informative — it says the decoder IS better, the gap excludes zero, but the range of plausible advantages spans from barely-detectable to substantial. That's important for what you tell a stakeholder. You don't say "the decoder is 0.03 better." You say "the decoder is better — but the range of plausible advantages is pretty wide, so for deployment decisions we should also consider other factors." Your homework asks you to compute this CI on your specific val set and defend what conclusions you can and cannot draw. Being honest about uncertainty is a professional skill. Memo section 4 specifically grades this.

One more piece before we move off the noise floor. The bootstrap teaches you about ONE source of noise — data-side noise. It resamples the val set, which tells you what happens if you'd had a different draw of validation examples. There's a second source you haven't measured: model-side noise. If you retrained the decoder with a different random seed, different initial LR, one more epoch — the model would have somewhat different outputs, somewhat different per-class F1. That's uncertainty from the training process, not the data. Sälevä and colleagues at Brandeis published this at IJCNLP-AACL 2025. Their finding: accounting for only data-side noise, which is what bootstrap alone gives you, substantially underestimates total replication variability — sometimes by a lot. So for your Week 3 memo where you wrote "decoder is 0.03 better than encoder" — if both had been trained with different seeds, would that 0.03 hold? You don't know. Your bootstrap CI is a LOWER bound on the right answer. For the homework's memo section 4, the honest position is: bootstrap gives us data-side variability, model-side is out of scope for this week, but the true uncertainty is at least this wide and plausibly wider.

Fifth tool. Bug hunt. This is the applied exercise where everything else you learned comes together.

Here's the lab's final exercise. Someone — not you — trained a decoder using the same recipe you know. Same base model, same LoRA config, same 3 epochs, same data. They pushed it to HuggingFace at that repo name. Except it's broken. Your aggregate metrics will tell you something is very wrong. They won't tell you WHAT is wrong. Your job is to use the other four diagnostic tools to figure out what happened. Not fix it, necessarily — just diagnose.

The aggregate metric will tell you something is broken. It won't tell you what. All of these bugs — never-trained, collapsed-to-majority, scrambled output space, wrong features — produce accuracy near 1 over 113, roughly 0.9 percent. They produce macro F1 near 0.003. At the aggregate level, they're indistinguishable. But they have very different confusion-matrix SIGNATURES. A never-trained model produces noise. A collapsed model produces a single column. A scrambled output space produces a coherent but offset pattern. Different bug, different visual. Your job in the lab is to look at the confusion matrix, recognize the signature, and work out which failure mode this is.

One last thing before we get to the lab setup. In the lab you'll diagnose a specific broken model — you know it's broken because the repo is literally named "flawed-checkpoint." Real production systems don't announce they're broken. They fail silently — gradually as the input distribution drifts, or suddenly when an upstream dependency changes. You need continuous monitoring that can detect deterioration without ground-truth labels, because by the time labels arrive the damage is done. Nguyen and colleagues published this at NeurIPS 2025. Their method is D3M — Disagreement-Driven Deterioration Monitoring. The idea: train multiple models or keep multiple checkpoints of the same model, deploy them in parallel, and monitor how often they disagree with each other. Disagreement rate rises when input distribution shifts away from training. When it spikes, something upstream changed — you alert and investigate. The beautiful part: no labels required. This is the production-monitoring companion to the diagnostic toolkit you're learning today. And the disagreement signal is the same thing we'll use for distillation in Week 6 — where you use disagreements between a big model and a small model as a training signal. Same primitive, two different uses. Keep it in the back of your head.

The bug-hunt exercise walks you through a standard diagnostic progression. Load, observe metrics, build a diagnostic plot, recognize the pattern, apply a fix. The fix itself is a single line of Python. I'm not going to tell you what the line is. I'm not going to tell you what the bug is. The pattern you see in the diagnostic plot will point you to a specific one-line transformation on the predictions. If you pick the right one, accuracy jumps from 0.14 percent back to 57 percent — essentially recovering the full decoder. That gap — the difference between the broken output and the recovered output — is the bug. And the lesson: the model wasn't "broken" in the weights. It was broken in how its outputs were INTERPRETED.

To wrap up the lecture: five tools, five distinct failure-mode detectors. Slice analysis catches subgroup issues that averages hide. Calibration catches unreliable confidence. Confusion-matrix patterns catch structural biases the accuracy number can't reveal. Val-set reliability catches noise in your evaluation itself. Bug hunt composes all four to diagnose a specific broken model. When you're debugging a deployed ML system — and you will — these are the tools that produce evidence-based answers instead of hand-waving.

Quick preview of Week 5. This week you encountered the downstream effects of data-pipeline decisions — specifically, the val-count distribution. Next week we open up the pipeline itself. Why 113 classes? Your raw CFPB data had 153 unique "Issue" labels. Why did we merge them down? Why the specific minimum-count filter? Why is the val split stratified the way it is? Most of what feels like "the model is behaving oddly" actually traces to choices made before any training ran. We'll rebuild the pipeline end to end — and you'll see exactly where the n-equals-one classes you count in the lab came from.

And Week 6 preview: distillation. This week in the homework, you pull disagreement examples — cases where one model is right and the other is wrong. You tag them, look for patterns. That disagreement set is exactly the dataset distillation operates on. In Week 6 you'll use the better-performing decoder as a "teacher" to label the disagreement set, then fine-tune the smaller encoder to match. This is how you get decoder-quality decisions with encoder-speed inference at deploy time. The diagnostic analysis you do this week is the INPUT to that. If your tagging work identifies real patterns, Week 6 has a clean signal to work from. If your tagging is sloppy, distillation has nothing to learn.

The homework has five sections mapped to the memo rubric's five criteria. Section 1 extends your slice analysis to all six axes and asks you to interpret two of them in writing. Section 2 runs the full temperature-scaling experiment. Section 3 is the per-class drill-down — for the worst three classes of each model, what are they getting confused with? Section 4 is the bootstrap CI. Section 5 is the synthesis question — given everything you now know, is your Week 3 recommendation still defensible? Has your confidence gone up, down, or shifted? That section carries 15 points, the smallest individual weight, but it's the one that distinguishes "did the diagnostics" from "understood the diagnostics." Plan about four and a half hours total. Memo sections are embedded in the homework notebook.

Last slide. You came in today with two models and a recommendation. By the end of the lab and homework, you will either have evidence that defends the recommendation you made, or evidence that updates it. Either way, the answer you write in memo section 5 is based on diagnostic work, not intuition. That's the whole point of the week. Let's go do the quiz, then we'll see you in the lab at 3:30.

Error Diagnosis

ECBS5200 — Week 4

You made a recommendation. How do you know you were right?

Where we left off

The black box you've been calling

Raw: 153 Issue labels, many redundant

The mapping is imperfect in both directions

The filter and the tail

You inherited this. Now diagnose.

The question

From problem to approach

Three ways aggregate metrics lie

Today's plan

Section 1 — Slice Analysis

The where-does-it-fail tool

The problem with averages

What is a slice?

The six axes you'll test

Per-slice F1 example (illustrative)

Null-result axes

When you don't know what axis to try

Section 2 — Calibration

Can you trust what the model says about itself?

The calibration question

ECE: Expected Calibration Error

ECE works, with caveats

The reliability diagram

An intuition check

Temperature scaling — a one-parameter fix

Homework temperature-scaling exercise

Section 3 — Confusion Matrix Patterns

At scale, patterns matter. Individual cells do not.

Pre-work 04: the 5×5 matrix

Frequency tiers

The headline question

Three candidate stories

Section 4 — Val-Set Reliability

Not every F1 number is equally trustworthy.

The problem

What a 1-example class does to your F1

Why this matters for model comparison

Bootstrap: measure the noise floor

Reading a bootstrap CI

What wide CIs look like in practice

The noise you haven't measured

Section 5 — Bug Hunt Preview

Given a broken model, diagnose it from symptoms alone.

The scenario

Failure modes that all look "random" at aggregate

Diagnosing models you haven't inspected yet

What you'll do in lab

Section 6 — Wrap + Week 5 Preview

Today's toolkit, restated

What Week 5 covers

What Week 6 covers

Homework arc

Today — in one sentence