ECBS5200 Week 2

Controlled Improvement and Error Analysis

ECBS5200 — Week 2

A model is only as good as your ability to understand why it works.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Where we left off

Model Accuracy Macro F1 Zero-F1 classes
Majority class 23.0% ~0.003 112
TF-IDF + LogReg 54.2% 0.132 70
Your fine-tuned encoder ~56% ~0.20 ~47

Your model rescued ~23 classes from zero.

But 47 classes still get F1 = 0. Can we do better?

Controlled Improvement and Error Analysis
ECBS5200 Week 2

This week

Two skills:

  1. Diagnostic reasoning — given a model's metrics and training curves, figure out what happened during training

  2. Controlled improvement — change one thing at a time and measure the effect

The lecture covers both. The lab tests the first. The homework tests both.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Today's plan

Block 1 — Lecture (after the quiz)

  • Training curves: how to read them
  • Class imbalance: why your model ignores 47 classes
  • Class weighting: the biggest lever for rare classes
  • Error analysis: where does the model fail?
  • Controlled experiments: changing things systematically
  • When to trust your numbers

Block 2 — Lab

  • You receive 4 pre-trained models
  • Your job: figure out what produced each one
Controlled Improvement and Error Analysis
ECBS5200 Week 2

Training Curves

The single most diagnostic artifact you can look at.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

What is a training curve?

Two lines plotted over epochs:

  • Train loss (blue): what the optimizer directly minimizes
  • Val loss (red): performance on data the model never trained on

They start high. They should both decrease. What happens next is the diagnostic.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Three regimes

Undertrained — both curves still dropping when training stopped.
The model wasn't done learning. Train more.

Well-trained — both curves have flattened. Val loss is at its minimum.
This is the sweet spot.

Overtrained / data-starved — train loss near zero, val loss rising.
The model memorized its training data. On small datasets, predictions degrade.
On large datasets, the model may become overconfident without losing accuracy.

Nakkiran et al. (2019) showed this story is incomplete — look up "deep double descent" in the readings.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Reading the curves

Look for:

  1. Is train loss still dropping? → Undertrained
  2. Have both lines flattened? → Well-trained
  3. Train loss near zero, val loss rising? → Overtrained or data-starved

In the lab, the training curve is your primary diagnostic tool.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Data-starved vs overtrained: a crucial distinction

Both produce rising val loss. But the causes — and fixes — are different.

Overtrained Data-starved
Data size Enough data, too many epochs Not enough data, any epoch count
Fix Early stopping More data
Prediction quality May or may not degrade Hits a low ceiling

A 149M-parameter model on 5,000 examples hits a low ceiling, but not because it "overfitted" — it simply never had enough signal.

Rising val loss means worse calibration. Not necessarily worse predictions.

Mallinar et al. (2022) call this "tempered overfitting" — the model gets overconfident, not wrong.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Worked example: read this curve

Suppose you see:

  • Epoch 1: Train loss 2.8, val loss 2.9
  • Epoch 2: Train loss 1.4, val loss 1.6
  • Epoch 3: Train loss 0.9, val loss 1.5
  • Epoch 4: Train loss 0.6, val loss 1.55
  • Epoch 5: Train loss 0.4, val loss 1.6

What regime? What's the best checkpoint? What would you try next?

Controlled Improvement and Error Analysis
ECBS5200 Week 2

The paper: Zhang et al. 2017

"Understanding Deep Learning Requires Rethinking Generalization" (ICLR 2017)

The experiment that broke classical intuition:

  • Take a standard network. Train it on real labels → generalizes.
  • Same network, same data, random labels → memorizes perfectly. Train loss → 0.

If the model can memorize random noise, why doesn't it memorize real data and fail to generalize?

Standard capacity measures (VC dimension, Rademacher complexity) are too loose to explain this. They predict these models should always overfit. They don't.

📄 readings/week2/zhang2017_rethinking_generalization.pdf

Controlled Improvement and Error Analysis
ECBS5200 Week 2

The Class Imbalance Problem

Before we fix it, let's see how bad it really is.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Your class distribution

The largest class: 13,333 training examples.
The smallest class: 5 training examples.

That's a 2,666:1 ratio.

67 of your 113 classes have fewer than 100 training examples. These are the "tail."

This isn't unusual — it's the norm. Customer complaints, medical diagnoses, fraud detection, rare species identification. Real-world classification is almost always long-tailed.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

What cross-entropy actually does

Each example contributes equally. But classes contribute proportionally to their frequency.

Class Examples Share of total loss
"Incorrect info on your report" 13,333 ~23%
"Lost or stolen money order" 5 ~0.009%

Getting a rare class right changes the loss by 0.009%. The optimizer's rational response: ignore it entirely.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Class weighting: make the loss care

Multiply each example's loss by a weight based on class frequency.

Rare classes → higher weight → bigger gradient → model pays attention.

But how much weight? This is where people go wrong.

Approach Max weight Result
No weighting 1.0 47 classes at F1 = 0
Sqrt-inverse 3.6 Works — rescues 10 classes
sklearn balanced 128 Crashes training
Raw inverse 2,666 Don't even try

Focal loss (Lin et al. 2017) attacks the same problem differently — reshaping the loss to down-weight easy examples rather than up-weighting rare ones. Both approaches are in the readings.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

The paper: Cui et al. 2019 — Effective Number of Samples

"Class-Balanced Loss Based on Effective Number of Samples" (CVPR 2019)

Key idea: the effective number of samples for a class is less than the actual count.

As you add more examples for a class, each new one overlaps more in feature space. Diminishing returns.

controls assumed overlap: means every sample counts fully; means high overlap (diminishing returns). This gives a principled spectrum from no reweighting to full inverse.

Our sqrt-inverse weighting is a related heuristic — same intuition (diminishing returns), simpler formula.

📄 readings/week2/cui2019_class_balanced_loss.pdf

Controlled Improvement and Error Analysis
ECBS5200 Week 2

What the weights look like

Class Examples Weight
"Incorrect info on your report" 13,333 0.06
"Problem with a purchase on your statement" 3,905 0.12
"Lost or stolen money order" 5 3.23
"Shopping for a line of credit" 4 3.61

The rarest class gets 60x the weight of the most common.

Not 2,000x. That would break training.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

The accuracy-F1 paradox

Accuracy Macro F1 Zero-F1 classes
Without weighting 55.4% 0.199 49
With weighting 50.2% 0.216 39

Accuracy drops 5 points. Macro F1 rises. 10 classes rescued from zero.

This is not a failure. This is a trade-off.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Why the paradox happens

Without weighting: Big spike at F1 = 0. Head classes at 0.5–0.7.

With weighting: Spike shrinks. Head classes drop slightly. New bars in 0.1–0.3 range.

The model doesn't make every class better. It makes some worse and others possible.

Kang et al. (2020) showed this cost may come from distorting the representation to help the classifier. See the readings.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Metrics disagreement: which number do you trust?

Accuracy = "what fraction did I get right?" Dominated by head classes.

Macro F1 = "average F1 across all classes." Every class counts equally.

They optimize for different objectives. Accuracy rewards predicting common classes. Macro F1 rewards covering all classes.

When they disagree, the question isn't "which is right?" — it's "what do you care about?"

A spam filter needs accuracy. A medical triage system needs macro F1.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

The paper: Menon et al. 2021 — Logit Adjustment

"Long-Tail Learning via Logit Adjustment" (ICLR 2021)

Shows that logit adjustment by log class priors is Fisher consistent for balanced error — and unifies several prior imbalance-handling heuristics.

Applied post-hoc or during training. The key insight: the optimal classifier for balanced error is different from the optimal classifier for accuracy.

You literally cannot maximize both.

📄 readings/week2/menon2021_logit_adjustment.pdf

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Error Analysis

Where exactly does the model fail — and why?

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Per-class F1 by frequency tier

Tier Training examples Typical F1 range
Head (6 classes) ≥ 2,000 0.5 – 0.7
Mid (~40 classes) 100 – 1,999 0.2 – 0.4
Tail (~67 classes) < 100 Many at 0.0

Performance tracks class frequency. Class weighting narrows the gap but can't close it.

5 training examples is not enough to learn a decision boundary. Fang et al. (2021) showed that beyond a critical imbalance ratio, minority-class classifiers collapse toward each other in the final layer — the model cannot tell them apart, not because it hasn't learned features, but because the decision boundaries become geometrically indistinguishable.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Confusion matrix: where do errors land?

When a tail class gets misclassified, where does the error usually go?

(a) Head classes — rare examples get swamped by the majority

(b) Semantically similar mid-tier classes — the model can't distinguish fine-grained categories

(c) Other tail classes

You'll answer this empirically in the homework.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Hard-example inspection

The 5–10 most confident wrong predictions tell you more than any aggregate metric.

What to look for:

  • Semantic overlap between true and predicted class
  • Ambiguous text that could genuinely go either way
  • Systematic patterns (always confuses X with Y)

Some "errors" are actually label ambiguity, not model failure.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

The Diagnostic Mindset

Given metrics + training curves, you should be able to answer:

  • Did the model converge? → training curve
  • Is it overfitting? → train-val loss divergence (but check: tempered or catastrophic?)
  • Does it handle rare classes? → per-class F1, zero-F1 count
  • Is the accuracy-F1 gap explained? → class weighting? (Menon: different optimal classifiers)
  • Are improvements real? → compare to noise floor

Numbers don't speak for themselves.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

The Lab: Diagnostic Forensics

4 pre-trained ModernBERT models. Same architecture. Same data. Different configs.

For each model you get:

  • Model weights (you evaluate on the val set)
  • Training log (per-epoch train loss, val loss, accuracy, F1)

You don't know what config produced each model.

Your job: evaluate, examine, hypothesize, then bring your evidence to me.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Controlled Experiments

Changing things systematically

Controlled Improvement and Error Analysis
ECBS5200 Week 2

The principle

Change one variable. Hold everything else constant. Measure the effect.

Most ML work in practice: change three things at once, number goes up, declare victory.

You will not do that.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

The mixing knobs warning

Real example from building this course:

We changed the learning rate AND the scheduler in the same run. F1 improved by 2 points.

Was it the learning rate? The scheduler? Both?

We couldn't tell. We had to throw away the result and rerun two separate experiments.

Two separate experiments cost 2x the compute. But the confounded experiment cost 3x — the original run plus both reruns.

Confounded experiments don't save time. They waste it.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

The experiment template

Field What to write
Variable changed Exactly ONE thing
Held constant List everything else
Prediction What you expect — write this BEFORE running
Result What actually happened
Meaningful? Is this difference larger than noise?

Write the prediction before you run. That's the field that makes you think.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Learning rate and schedulers

LR recap: 2e-5 for fine-tuning. Too high → catastrophic forgetting. Too low → barely adapts.

An interesting finding: ModernBERT handled lr=1e-3 without collapsing. The safe range is wider than the textbooks suggest.

Schedulers: We tested cosine vs linear on this task. The difference was less than 0.5 F1 points — within noise. Not every knob matters.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Batch size

Smaller batches = more optimizer steps per epoch.

With 113 classes and batch = 32, most batches contain zero examples of any given tail class.

In our experiments: the effect was modest and task-dependent.

This is an empirical question, not a settled principle. Test it on your data.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Early stopping

Train for more epochs than you think you need. Stop when the target metric stops improving.

The tension on long-tail tasks:

  • Val loss may start rising (model less calibrated)
  • Macro F1 may still be improving (model better at rare classes)

Use macro F1 for stopping — it's what you care about. But don't over-trust small movements.

Patience: 2–3 epochs. Best checkpoint ≠ last checkpoint.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Common training bugs

Before you go into the lab, one more diagnostic skill: recognizing when code is broken.

The softmax-before-CrossEntropyLoss bug:

# WRONG — loss stuck at ~4.5, model never learns
loss = F.cross_entropy(F.softmax(logits, dim=-1), labels)

# CORRECT — CrossEntropyLoss applies softmax internally
loss = F.cross_entropy(logits, labels)

cross_entropy expects raw logits. Applying softmax first compresses the input into [0,1], which compresses gradients to near zero. Loss gets stuck. Model doesn't learn.

You'll encounter this in the lab. The training curve is the tell.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Val set reliability

Your val set: 6,430 examples across 113 classes.

Some classes have hundreds of val examples. Some have very few.

  • ~20 classes have ≤ 5 val examples (F1 is essentially random)
  • ~10 classes have exactly 1 val example (F1 is a literal coin flip)

That means ~18% of your macro F1 average is noise. A 1-point improvement could be real — or it could be coin flips landing differently.

This doesn't mean metrics are useless. It means you need to read them carefully.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Seeds and reproducibility

What changes with a different random seed:

  1. Classification head initialization
  2. Data shuffle order
  3. Dropout masks

Same config, different seed → different number. That's normal.

Explore with 1 seed. Confirm with 3. Report mean ± std.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Your homework this week

You train your own models.

  • Reproduce the class weighting trade-off on full data
  • Test whether batch size matters
  • Implement early stopping on macro F1
  • Error analysis: confusion matrix, per-class deep dive
  • Discover how noisy your validation metrics really are
  • Write your memo (5 sections, prompts in the notebook)

Due: Wednesday morning before Week 3 class. HTML via Moodle.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Week 2 Reading List

Paper Year Key idea
Zhang et al. — Rethinking Generalization 2017 DNNs memorize yet generalize
Lin et al. — Focal Loss 2017 Reshape loss to focus on hard examples
Cui et al. — Class-Balanced Loss 2019 Effective number of samples
Nakkiran et al. — Deep Double Descent 2019 The U-curve has a second descent
Kang et al. — Decoupling 2020 Representations are fine; classifiers are biased
Menon et al. — Logit Adjustment 2021 You can't maximize accuracy and balanced error
Fang et al. — Minority Collapse 2021 Minority-class classifiers collapse in geometry
Mallinar et al. — Tempered Overfitting 2022 Benign vs tempered vs catastrophic

All PDFs in readings/week2/. Bold = covered in lecture. Pick 2–3 others that connect to your memo.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Next week

We try a completely different approach to this same task.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

⬇ REVEAL BELOW ⬇

Do not advance past this slide until students have written their hypotheses.

Controlled Improvement and Error Analysis
ECBS5200 Week 2

Lab Reveal

Model Configuration Key evidence
Charlie 1 epoch, full data (undertrained) Only 1 epoch in logs, both curves still descending
Bravo 15 epochs, 5K subset (data-starved) Most epochs but worst metrics. 157 batches/epoch vs 1808
Delta 3 epochs, full data (vanilla baseline) Highest accuracy, matches Week 1 numbers
Alpha 3 epochs, full data, class weighting Lower accuracy, higher F1, fewer zero-F1 classes
Controlled Improvement and Error Analysis

Welcome back. Last week you built your first neural model. You fine-tuned ModernBERT on consumer complaints and got numbers meaningfully better than TF-IDF. This week we ask two questions. Can you make the model better? And when a number moves, how do you know it's real? Those are the two skills: controlled improvement and diagnostic reasoning.

Here's where we are. The majority baseline gets 23 percent. TF-IDF gets 54 percent accuracy but only 0.132 macro F1 — 70 classes completely ignored. Your fine-tuned model improved both: roughly 56 percent accuracy, 0.20 macro F1, about 47 classes still at zero. You rescued about 23 classes. Real progress. But 47 classes still get nothing. The question: can we fix that? And just as important: if we change something and the metric moves by a point or two, is that real or noise?

This week is about two skills. First, diagnostic reasoning: if I hand you a trained model, can you look at the metrics and training curves and figure out what happened? Did it converge? Did it overfit? Was a specific intervention applied? Second, controlled improvement: when you make a change to the training pipeline, can you attribute the result to the change you made? The lecture covers both. In the lab, you'll practice diagnostic reasoning — figuring out what happened to models someone else trained. In the homework, you'll practice controlled improvement — training your own models and measuring the effects of specific changes.

Here's the plan. In the lecture, we'll cover the diagnostic tools — how to read training curves, what class weighting does, how to examine errors — and then the methodology of running controlled experiments. In the lab, you'll receive four pre-trained models with their training curves. All four are ModernBERT trained on the same data, but with different configurations. Some are good. Some have problems. You figure out which is which. Then in the homework, you train your own models.

Let's start with the most important diagnostic tool. If someone hands you a trained model and you can only look at one thing, look at the training curve. It tells you more about what happened during training than any single metric.

A training curve plots two values over the course of training. Train loss — the cross-entropy loss on the training batches — is what the optimizer directly minimizes. Val loss — the same metric on held-out data — tells you how the model performs on examples it has never seen. At the start, both are high. As training progresses, both should decrease. The diagnostic information is in what happens after that initial decrease. Do they both flatten? Does one keep going down while the other goes up? Does one stop at epoch 1? These patterns tell you different stories about what happened during training.

There are three regimes. Undertrained means you stopped too early. Both curves were still decreasing — the model had more to learn. The fix is simple: train for more epochs. Well-trained means you stopped at the right time. Training loss has stabilized, validation loss is at its lowest point. Overtrained means you went too far. Training loss has been driven to near zero — the model has memorized the training data — but validation loss is climbing. Now, I should flag: Nakkiran and colleagues showed in 2019 that this classical story is incomplete. If you keep training past the overfitting bump, test error can sometimes drop again — "deep double descent." The paper is in your readings. For today, the three regimes are your primary diagnostic tool, but keep in mind the picture is richer than this.

Here's how to read the curves in practice. On the right you see three example training runs. The first model — both curves still dropping. Undertrained. The second — both curves stable, val loss at its minimum. Well-trained. The third — train loss driven to near zero while val loss climbs steadily after epoch 3. Overtrained. The gap between the blue and red lines in that third panel is the overfitting gap. In the lab today, you'll see training curves from four different models and you'll use exactly this pattern recognition to diagnose them.

Here's a distinction that trips up even experienced practitioners. Both overtrained and data-starved models show rising validation loss. But the stories are completely different. An overtrained model had enough data but trained too long — early stopping fixes it. A data-starved model never had enough data to begin with — no amount of early stopping helps, because the model hit its ceiling on epoch 2 and everything after that is just noise. We saw this in our own experiments: a 149-million parameter model trained on just 5,000 examples learned what it could in the first few epochs, then the training loss kept dropping as it memorized, but its actual prediction quality plateaued. The val loss rose because the model got more confident, not because its predictions got worse. Mallinar and Nakkiran — yes, the same Nakkiran — published a taxonomy of overfitting in 2022. They call this regime "tempered" overfitting: some degradation in calibration, but much less than classical theory predicts. Not benign, not catastrophic, somewhere in between. Both papers are in your readings.

Let's work through one. Train loss drops steadily from 2.8 to 0.4 across 5 epochs. Val loss drops sharply in epoch 1, reaches its minimum around epoch 2 at 1.6, then starts creeping up. By epoch 5 it's back to 1.6 — above the minimum. This is a classic overtraining pattern. The best checkpoint is epoch 2, when val loss was lowest. What would you try next? Save that epoch 2 checkpoint and try early stopping. Or: ask whether you have enough data. If this is a small subset, maybe the issue isn't too many epochs — it's too few examples. You can't tell from the curve alone. You need to know the data size. This is the kind of reasoning I want you doing in the lab.

This is one of the most important papers in modern deep learning theory. Zhang and colleagues took a standard neural network and trained it on CIFAR-10 with the correct labels. It generalized fine — standard result. Then they replaced all the labels with random noise. The same network memorized the entire dataset perfectly — training loss went to zero. But of course it couldn't generalize, because there was nothing to generalize. Here's the puzzle: the network has enough capacity to memorize 50,000 random labels. So when you train it on real labels, why doesn't it just memorize those too? Why does it learn patterns instead of memorizing examples? The common distribution-free capacity measures — VC dimension, Rademacher complexity — are too loose to distinguish these cases. They predict that any model with enough capacity to fit random data should also overfit real data. But it doesn't. The paper's point isn't that all of statistical learning theory is dead — it's that the standard explanations based on explicit regularization and older capacity intuitions are inadequate for modern deep networks. This paper is the reason you should never trust a simple story about overfitting. The PDF is in your readings folder.

Now let's talk about the biggest problem with our dataset and the biggest lever for improving the model. Class imbalance. You've seen the numbers — 47 classes at zero F1. But numbers in a table are abstract. Let's make the problem concrete.

Let's look at the actual distribution. Your largest class — "incorrect information on your report" — has over 13,000 training examples. Your smallest classes have 5 examples each. That's a 2,666 to 1 ratio. And 67 of your 113 classes — more than half — have fewer than 100 examples. This is the "tail." And this isn't some pathological dataset I constructed to make a point. This IS the real distribution of consumer complaints. Real-world classification is almost always like this. Medical diagnoses: a few common conditions dominate, hundreds of rare ones in the tail. Fraud detection: 99.9% of transactions are legitimate. Species identification: a few common species make up most observations. If you only ever work on balanced benchmarks like CIFAR-10, you'll never see this. But the moment you touch real data, you're in long-tail territory.

Let's be precise about the mechanism. Cross-entropy loss averages over all N training examples. Every example gets equal weight. But that means each class's contribution to the total loss is proportional to how many examples it has. The largest class contributes 23 percent of the total loss. The smallest contributes nine thousandths of a percent. Think about what that means for the optimizer. The gradient from getting a rare-class example right is negligible. The gradient from getting a common-class example right is enormous. The optimizer follows the gradient. And the gradient says: common classes matter, rare classes don't. Those 47 classes with F1 of zero? That's not a bug. That's the optimizer doing exactly what you told it to do.

The fix for class imbalance in cross-entropy is conceptually simple: multiply each example's loss by a weight that depends on its class frequency. Rare classes get higher weights, producing larger gradients, so the optimizer pays attention. But the critical question is: how much weight? And this is where people get into trouble. No weighting gives you 47 classes at zero. Sqrt-inverse — the square root of the frequency ratio — gives a maximum weight of about 3.6x. That's enough to rescue 10 classes from zero without destabilizing training. Sklearn's "balanced" mode uses raw inverse frequency, which gives weights up to 128x. We tried it. Training collapsed — the model oscillated wildly and never converged. Raw inverse frequency gives 2,666x for the rarest class. That would be catastrophic. I should also mention focal loss — Lin et al. 2017. Instead of reweighting by class frequency, focal loss down-weights well-classified examples regardless of class. Same problem, different angle. The paper is in your readings. We use class weighting in this course because it's simpler and directly addresses the class imbalance, but you should know focal loss exists.

Cui and colleagues at Cornell and Google gave this problem a theoretical foundation. Their key insight: the effective number of samples for a class is less than the raw count. Why? Because as you add more examples, each new example increasingly overlaps with existing ones in feature space. The 100th example of "incorrect information on your report" tells the model less than the first example did. They formalize this with a single parameter, beta, that controls how fast diminishing returns kick in. When beta is near zero, every example counts equally — no reweighting. When beta is near one, you get full inverse-frequency weighting. The sweet spot is somewhere in between. Our sqrt-inverse weighting is a related heuristic — it shares the same intuition of diminishing returns with class size, but it's not a special case of their formula. Their weighting is based on the inverse of the effective number, which for beta near 1 behaves like inverse frequency, not inverse square root. Sqrt-inverse is a simpler, more conservative compromise that works well in practice. The paper gives you the theoretical framework for thinking about WHY raw inverse over-corrects: it treats every rare example as maximally informative, but in reality, even 5 examples have some overlap.

Here are the actual weights from our dataset. The most common class — incorrect information on your report, with over 13,000 examples — gets a weight of 0.06. That means its loss contribution is reduced. The rarest class — shopping for a line of credit, with 4 examples — gets a weight of 3.61. That's a 60x ratio. The model will care 60 times more about getting a rare-class example right than a common-class example. That's significant but not insane. Full inverse-frequency weighting would give a ratio of over 2,000x. We tried that. Training collapsed. Sqrt-inverse is the practical middle ground.

Here's what happens when you apply class weighting. Look at these numbers from our experiments. Accuracy drops from 55.4 to 50.2 percent — a 5 point hit. But macro F1 improves from 0.199 to 0.216. And the number of zero-F1 classes drops from 49 to 39 — ten classes that the model previously ignored now get at least some correct predictions. This looks like a contradiction. How can the model get worse and better at the same time?

This chart makes the trade-off concrete. On the left, without class weighting, there's a big spike at F1 equals zero — those 49 classes the model ignores. The head classes cluster around 0.5 to 0.7. On the right, with class weighting, the zero spike shrinks. Ten classes that were at zero now have F1 between 0.1 and 0.3. They're not great, but they exist. The head classes drop slightly — they give up a few F1 points. The net effect on macro F1 is positive because rescuing a class from zero to 0.1 adds more to the average than losing a class from 0.7 to 0.65. But accuracy drops because the model now sometimes predicts a rare class where it used to correctly predict a common one. Kang and colleagues at Facebook AI showed in 2020 that on their benchmarks, representations learned on imbalanced data are largely fine — the problem is in the classifier layer. When we reweight the whole model, we may be distorting useful representations to help the classifier. That paper is in your readings and it's a good one.

This isn't just a quirk of our dataset. Accuracy and macro F1 are fundamentally different objectives. Accuracy asks: what fraction of examples did you classify correctly? Since most examples come from common classes, accuracy rewards getting common classes right. Macro F1 asks: what's the average per-class F1? Every class counts equally regardless of size. So macro F1 rewards covering rare classes even at the expense of common ones. When they disagree — as they do here — neither metric is "wrong." They're measuring different things. The question you have to answer is: what does your application need? A spam filter? You care about accuracy — most emails are legitimate, and you need to classify the common case correctly. A medical triage system? You care about macro F1 — missing a rare condition could be fatal. This is a design decision, not a statistical one.

Menon and colleagues at Google Research proved something that practitioners had intuited but never formalized. If you want to optimize for balanced error — treating every class equally regardless of size — the optimal classifier isn't the standard argmax of the logits. You need to adjust the logits by log class priors. Common classes get penalized; rare classes get boosted. This can be applied post-hoc at inference time or baked into the training loss. The paper shows this is Fisher consistent for balanced error and unifies several prior approaches to handling imbalance. But the deeper insight is the one that matters for us: the optimal classifier for accuracy is a DIFFERENT function from the optimal classifier for balanced error. You cannot maximize both simultaneously. So when you see accuracy go down and macro F1 go up with class weighting, that's not a bug in your training — it's a mathematical necessity. You're moving from one optimal frontier to another.

Class weighting helps rare classes. But which ones? And where does the model still fail? Error analysis tells you. This is the skill that separates someone who trains models from someone who understands them.

When you compute F1 for each class and group by frequency, the pattern is stark. Head classes — the 6 classes with 2,000 or more training examples — get F1 between 0.5 and 0.7. Mid-tier classes get 0.2 to 0.4. Tail classes — the 67 classes with fewer than 100 examples — many are at zero. Performance tracks class frequency almost perfectly. Class weighting helps the tail — it rescues some classes from zero. But it can't overcome data scarcity. A class with 5 training examples simply doesn't have enough signal for the model to learn a reliable decision boundary. No amount of loss weighting fixes that. Fang and colleagues published a paper in PNAS in 2021 showing that beyond a critical imbalance ratio, something they call "minority collapse" occurs: the last-layer geometry skews so that minority-class classifiers converge toward each other. The model can't tell those classes apart no matter what you do. Separately — and this is our observation, not the paper's claim — we've found that scaling model size from 395 million to 8 billion parameters doesn't help on our task. The paper is in your readings.

Here's a question to think about. When the model misclassifies a tail-class example, where does the prediction land? There are three stories. Maybe the model defaults to predicting head classes because they dominate training. Maybe the model predicts a semantically similar class — it knows the general topic but can't distinguish subcategories. Or maybe errors are scattered randomly among tail classes. I want you to form a prediction now. You'll test it empirically in the homework by building a confusion matrix and categorizing where the tail-class errors actually go. The answer tells you something fundamental about the model's failure mode.

After the confusion matrix, look at individual examples. Find the 5 to 10 predictions where the model was most confident and most wrong. These are gold. They tell you exactly what the model misunderstands. Common patterns you'll see: the true label and predicted label share vocabulary, or the complaint describes a situation that genuinely could be classified either way. Those aren't model failures — they're labeling ambiguities. Distinguishing between model errors and label ambiguity is a real skill, and one you'll practice in the homework.

Let me pull this together. Everything we've covered — training curves, class weighting, per-class metrics, error analysis — is part of a diagnostic mindset. A trained model is a body of evidence. The training curve tells you about optimization. Per-class F1 tells you about failure patterns. The confusion matrix tells you about error structure. Your job is to read this evidence and understand what happened. Not just "the number went up" — but why, and whether you can trust it.

Now the lab. This is different from last week. You'll receive four models. All four are ModernBERT-base fine-tuned on the same consumer complaints dataset. But each was trained differently. Each comes with its training log — per-epoch train loss, validation loss, accuracy, and macro F1. What you don't know is what training configuration produced each model. Your job: evaluate them on the validation set, build a comparison table, plot the training curves, form a hypothesis about what each config was, and bring your hypothesis to me with evidence. Not to the notebook — to me, in person. I'll confirm or correct one model at a time.

Now the second half. You've learned how to diagnose what happened to a model. In the homework, you'll make changes yourself. Let's talk about how to do that systematically.

The principle is dead simple. Change one variable. Hold everything else constant. Measure the effect. This is basic experimental design. And yet in practice, most ML work violates it. People change the learning rate and the scheduler and add class weighting in the same run, the number goes up, and they declare victory. But they don't know which change helped. Maybe two of the three made things worse. In the homework, every experiment you run uses a template that forces you to name exactly one variable you changed and list everything you held constant.

I want to share a real mistake we made while building this course's materials. We ran an experiment that changed both the learning rate and the learning rate scheduler at the same time. Macro F1 improved by 2 points. Great result, right? But then a reviewer asked: which change helped? And we couldn't answer. Maybe the new learning rate was better. Maybe the new scheduler was better. Maybe one helped and the other hurt, and the net effect happened to be positive. We had to throw the result away and rerun two separate experiments — one changing only the learning rate, one changing only the scheduler. The confounded experiment wasn't free. It cost GPU time and wall-clock time that we then had to spend again, plus the reruns. Controlled experiments feel slower in the moment. They're faster in total because you never have to redo them.

Here's the template. Five fields. Variable changed — exactly one thing. Held constant — list everything else explicitly. Prediction — what you expect to happen and why, and you write this before you run the experiment. Result — what actually happened. And meaningful — is the difference you observed large enough to trust, or could it be noise? That last field is the one most people skip. We'll talk about why it matters in a few slides.

Quick notes on two knobs. Learning rate — you covered this last week. 2e-5 is the standard for fine-tuning. But we found that ModernBERT at 1e-3 — 50x higher — still produced a working model. Modern architectures with AdamW are more robust than the original BERT guidance suggests. Schedulers — we tested cosine versus linear at the same learning rate. The difference was less than half a macro F1 point. On this task, with this training length, the scheduler barely matters. That's a real finding. Not every knob is worth turning. The skill is figuring out which ones actually move the needle and spending your time there.

Batch size is another knob. Smaller batches mean more optimizer steps per epoch, which means more chances for rare classes to show up in a batch and influence the gradient. We tested batch 16 versus 32. The effect was real but modest — sometimes it helped, sometimes the difference was within noise. The mechanism is debated. What matters: it's an empirical question. You'll test it in the homework and decide for yourself whether it helps on this task.

Early stopping. Instead of training for a fixed number of epochs, monitor a metric after each epoch and stop when it stops improving. The tension on long-tail tasks: validation loss and macro F1 can disagree. Val loss might start climbing because the model is becoming overconfident on common classes, while macro F1 is still improving because the model is getting better at rare classes. Use macro F1 for your stopping criterion — it's aligned with what you care about. But be cautious: macro F1 is noisier than loss, so small epoch-to-epoch movements might be noise. Use a patience of at least 2 epochs. And always save checkpoints so you can go back to the best one. Best checkpoint is rarely the last checkpoint.

One more thing before the lab. I want you to be able to recognize when training code is broken, not just when hyperparameters are suboptimal. Here's a real bug. PyTorch's cross-entropy loss expects raw logits — the unnormalized outputs of the model. Internally, it applies log-softmax for numerical stability. But if someone applies softmax to the logits before passing them to cross-entropy, the function sees values between 0 and 1 instead of values that could be any real number. The gradients become tiny because softmax compresses the range. The loss gets stuck around 4.5 — which is negative log of 1/113, the entropy of a uniform distribution. The model effectively never learns. In the lab, you'll train a small model on a data subset and you'll see this bug in action. The training curve is unmistakable: the loss flatlines. Once you've seen it, you'll never miss it again.

Everything we've talked about — metrics, comparisons, experiments — assumes the numbers mean something. But do they? Your validation set has 6,430 examples across 113 classes. That averages to about 57 per class. But the distribution follows the same long tail as the training set. Some classes have hundreds of validation examples. Some have very few. About 20 classes have 5 or fewer validation examples — for these classes, F1 is essentially random. About 10 classes have exactly 1 validation example — a literal coin flip. That's roughly 18 percent of the classes in your macro F1 average contributing noise, not signal. So when you see macro F1 move by one point — say, from 0.199 to 0.210 — you need to ask: did 93 classes get slightly better, or did a handful of coin flips land differently? The homework has you build a histogram of validation examples per class so you can see this for yourself. The takeaway isn't that metrics are useless. It's that you need to interpret them with awareness of their noise floor.

Last concept before we wrap up. Reproducibility. When you train the same model with the same config but a different random seed, three things change: the classification head's initialization, the order data is batched, and the dropout masks. Same config, different seed, different number. A 1-2 point variation in macro F1 between seeds is completely normal. This means a single run is anecdotal. The practical workflow given free T4 constraints: explore with one seed — try many ideas quickly. When you find something promising, confirm with three seeds. Report the mean and standard deviation. If your improvement disappears across seeds, it wasn't real.

After the lab, the homework is where you train your own models. You'll reproduce the class weighting result on full data — you've seen what it does in the lab, now verify it yourself. You'll test batch size. You'll implement early stopping. You'll do a real error analysis — confusion matrix, per-class F1 by tier, hard-example inspection. And you'll discover how noisy your validation metrics actually are. The memo has five sections with prompts embedded in the notebook. Due Wednesday morning, HTML via Moodle. Plan for five to six hours.

Here's your reading list. Eight papers spanning 2017 to 2022 that cover the theoretical foundations of everything we discussed today. The three in bold — Zhang, Cui, and Menon — we covered in the lecture. The other five were referenced on individual slides. You are not expected to read all eight — that would be unreasonable for a one-week turnaround. Pick two or three that connect to whatever you find most interesting in your memo. If you write about the accuracy-F1 trade-off, read Menon and Kang. If you write about why the tail is hard, read Fang. If the training curve behavior surprised you, read Nakkiran and Mallinar. Engaging with the literature is what separates a graduate-level memo from a homework report.

One sentence. Next week we try something completely different. See you in the lab.

STOP HERE. Everything after this slide is the lab reveal. Do not advance until every student has submitted their hypotheses in the notebook. When ready, go one model at a time.

Show this slide when everyone has written their hypotheses. Go through one at a time. Charlie is the easiest — everyone should get "undertrained." Bravo is the dramatic one — 15 epochs but worst performance because of data scarcity. Delta is the control — matches what they built in Week 1. Alpha is the deepest lesson — the accuracy-F1 paradox from class weighting. Pause on Alpha and ask: why would you accept lower accuracy? When is that trade-off worth it?