Macro-F1 = compute F1 for each class, then take the unweighted average.
Every class counts equally, whether it has 10,000 examples or 10.
This is our primary evaluation metric for the entire course.
Why? Because we want our model to work well on all complaint types — not just the popular ones.
| Averaging | How it works | Good when... |
|---|---|---|
| Macro | Average F1 across classes equally | You care about rare classes as much as common ones |
| Micro | Pool all TP/FP/FN, compute one F1 | You care about overall correctness (≈ accuracy) |
| Weighted | Average F1, weighted by class size | You want a middle ground |
For this course: macro.
Micro F1 is basically accuracy in disguise for multiclass problems. It lets common classes dominate. Weighted is a compromise, but it still under-weights rare classes. Macro is the fairest to all classes.
Our dataset has 153 complaint categories in the raw data.
A majority-class predictor on this data:
Accuracy says "not bad." Macro-F1 says "you learned almost nothing." Macro-F1 is telling the truth.
Welcome to Module 04. This one is about how we measure whether our model is actually doing a good job. And I'll warn you up front — your intuition about what "good" means is probably wrong. Accuracy, the metric everyone reaches for first, can actively lie to you. Today we're going to understand why, and we're going to learn the metric we'll actually use all semester: macro-F1.
Let me give you a scenario that will bother you. Suppose you have a hundred complaint categories. One category — say, "billing disputes" — makes up 40 percent of the data. The other 99 categories share the remaining 60 percent. Now imagine I build a model that always predicts "billing dispute" no matter what you give it. That model gets 40 percent accuracy. It has learned absolutely nothing. It can't classify a single rare complaint correctly. But 40 percent accuracy? That doesn't sound catastrophic. And here's the really insidious part — if you add a few more dominant classes, a dumb majority-class predictor can hit 50, 60, even 70 percent accuracy. Meanwhile it's completely useless for any rare class. Accuracy is dominated by whatever classes have the most examples.
So how do we get past accuracy and actually understand what the model is doing? The confusion matrix. It's a table where the rows are the true labels and the columns are the predicted labels. The diagonal — top-left to bottom-right — shows the counts where the model got it right. Everything off the diagonal is a mistake. Each cell tells you: "for examples that were truly class X, my model predicted class Y this many times." It's a complete accounting of every prediction your model made. No hiding behind a single number.
Let's read this matrix. Look at row B. Out of 50 class-B examples, the model only got 30 right. It called 8 of them A, and it called 12 of them C. So class B is the hardest class for our model. Now look at the relationship between B and C — the model confuses them in both directions. That's a signal. Maybe B and C have similar text. Maybe the labels are ambiguous. The confusion matrix doesn't just tell you "the model is 79 percent accurate." It tells you where it's confused and how badly. That's actionable information. When you're doing error analysis in this course, this is where you'll start.
Now let's turn the confusion matrix into numbers we can track. For each class, we compute three things. Precision answers: "when my model says this is class X, how often is it actually class X?" It's about the model's predictions being trustworthy. Recall answers: "of all the examples that truly are class X, how many did my model find?" It's about coverage — not missing things. F1 is the harmonic mean of precision and recall. Why harmonic mean and not regular average? Because the harmonic mean punishes you hard if either precision or recall is low. You can't get a high F1 by having great precision but terrible recall, or vice versa. You need both to be decent.
Here's the metric that matters for us: macro-F1. The idea is simple. You compute F1 for each of your K classes independently. Then you average them. Unweighted. A class with 10 examples and a class with 10,000 examples contribute equally to the final number. This is a deliberate choice. We're saying: "I care about getting the rare classes right just as much as the common ones." And for our use case — classifying consumer complaints — that makes sense. A customer with a rare complaint type deserves a correct classification just as much as someone with a common one. This is the number we'll track all semester. When I say "your model's F1," I mean macro-F1.
Let me quickly distinguish the three averaging strategies you'll see in sklearn. Macro averages F1 across classes with equal weight — that's what we use. Micro pools all the true positives, false positives, and false negatives across all classes and computes one global F1. For multiclass problems, micro-F1 turns out to be mathematically equivalent to accuracy. So it has the same problem — common classes dominate. Weighted F1 averages by class size, which is a middle ground, but it still under-represents rare classes. We're going with macro because we explicitly want to treat every complaint type as equally important. If your model ignores a class with only 50 examples, macro-F1 will punish you for it. That's the behavior we want.
Let's bring this home to our actual dataset. The raw data has 153 complaint categories. The class distribution is wildly imbalanced. The biggest class has about 7,600 examples — 11.8 percent of the data. The smallest classes have literally one or two examples. You'll also notice some suspiciously similar class names — like "Incorrect information on credit report" and "Incorrect information on your report." We'll deal with that label cleanup in Week 1. For now, the point is the imbalance. If I build a model that just predicts the single most common class every time, it gets 11.8 percent accuracy. That might not sound impressive, but random guessing across 153 classes would give you about 0.65 percent. So accuracy says this dumb model is 18 times better than random. Meanwhile, macro-F1 is essentially zero — 0.001 — because it gets zero F1 on 152 out of 153 classes. Macro-F1 is telling you the truth: a model that only knows one class has learned almost nothing useful.
Let me leave you with the key points. First, accuracy lies when classes are imbalanced — and our classes are very imbalanced. Second, confusion matrices are your best friend for understanding model behavior — they show you exactly what's going wrong. Third, precision and recall measure different things, and F1 balances them. Fourth, macro-F1 is our metric because it treats every class equally, which is what we want when we have over a hundred classes of very different sizes. In the notebook for this module, you'll build confusion matrices, compute these metrics, and see firsthand how accuracy and macro-F1 diverge on our real data. Go do the notebook — it'll make all of this concrete.