Calibration Basics

Model says...	Actually correct...
95%	~70%
80%	~55%
60%	~45%

Pattern	Meaning
Points on the diagonal	Well-calibrated
Points below diagonal (high conf.)	Overconfident
Points above diagonal (low conf.)	Underconfident
Histogram shows most predictions at extremes	Model is very "decisive"

Welcome to Module 05. Today we're going to talk about calibration — one of those concepts that separates people who train models from people who deploy them. If you've ever looked at a model's output probability and thought "great, the model is 92% confident," this module is going to make you question whether that number means what you think it means.

So let's start with a simple question. Your model says it's 90% confident about a prediction. What should that mean? If you collect every prediction where the model said 90%, and you check how many it actually got right, the answer should be about 90 out of 100. That's calibration. A well-calibrated model's probabilities actually mean what they say. When it says 80%, it's right about 80% of the time. When it says 50%, it's a coin flip. The predicted probability matches the observed frequency of being correct. Simple concept, but it turns out most models fail at this.

Here's the problem. Most neural networks — and this includes the transformer-based classifiers we'll be working with — are systematically overconfident. When a typical fine-tuned model says it's 95% confident, the real accuracy for those predictions might be only 70%. When it says 80%, maybe it's right 55% of the time. This isn't a bug in one particular model. It's a well-documented, systematic property of how neural networks are trained. The combination of overparameterization and the way cross-entropy loss works tends to push models toward producing overly sharp probability distributions. They learn to be way too sure of themselves.

Now here's why this matters in practice. Let's say you deploy a complaint classifier and the business rule is: only automatically route complaints to the right department when the model is at least 80% confident. If the model is well-calibrated, that's a sensible policy. 80% confidence means 80% accuracy — you'll route most things correctly and flag the uncertain ones for human review. But if the model is overconfident, that threshold is completely meaningless. The model is saying 80% for predictions it gets wrong nearly half the time. You'd be auto-routing tons of complaints to the wrong department, and you'd have no idea because the confidence numbers look great. Miscalibration turns every probability threshold into a lie. And in production systems, people set thresholds constantly — for routing, for escalation, for alerting. If those probabilities aren't calibrated, none of those thresholds do what you think they do.

How do we actually see whether a model is calibrated? The standard visualization is called a reliability diagram — sometimes called a calibration plot. Here's how it works. You take all your predictions and group them into bins by predicted confidence. So one bin has all predictions where the model said between 0 and 10% confident, another bin for 10 to 20%, and so on. For each bin, you compute the actual accuracy — what fraction of those predictions were actually correct. Then you plot predicted confidence on the x-axis and actual accuracy on the y-axis. If your model is perfectly calibrated, you get the diagonal line — every bin's actual accuracy matches the predicted confidence. In practice, an overconfident model will show a curve that falls below the diagonal at high confidence values. The model is saying 90% but the actual accuracy is only 65%. You'll build one of these in the notebook exercise.

Let me walk you through how to read these diagrams because you'll be building them in the notebook. If the points sit on the diagonal, you're in good shape — the model is well-calibrated. If the points fall below the diagonal at high confidence levels, the model is overconfident. It's claiming high confidence but delivering lower accuracy. If points are above the diagonal at low confidence, the model is underconfident — it's actually better than it thinks. Usually you'll also see a histogram at the bottom showing where most predictions fall. Overconfident models tend to cluster predictions at the extremes, near 0 and near 1, because they're always very sure of themselves. In the notebook you'll build reliability diagrams for two synthetic models and see these patterns firsthand.

I want to be clear about scope. We are not going to fix calibration in this pre-work module. There are well-known techniques — temperature scaling, Platt scaling — that can recalibrate a model's outputs after training. We'll cover those in Week 4 when we get into post-training techniques. For now, the only goal is awareness. I want you to know that model probabilities are often unreliable. I want you to understand why that's a problem for any system that uses confidence thresholds — which is basically every production system. And I want you to be able to read a reliability diagram. That's it. If you walk away from this module a little suspicious of model confidence scores, I've done my job.

Let's recap. Calibration means your model's predicted probabilities actually correspond to how often it's correct. Most neural networks are systematically overconfident — they produce inflated confidence scores. This is a real operational problem because any production system that uses probability thresholds for routing, escalation, or filtering will make bad decisions if those probabilities are miscalibrated. Reliability diagrams are how you visualize and diagnose calibration — predicted confidence on the x-axis, actual accuracy on the y-axis, and you want the diagonal. We'll learn how to fix calibration in Week 4. For now, the important thing is that you know this problem exists and you can recognize it. Go do the notebook exercise and build some reliability diagrams yourself. Then move on to Module 06 where we'll start looking at parameter-efficient fine-tuning.

Calibration Basics

Pre-Work Module 05

ECBS5200 — Practical Deep Learning Engineering

What does "90% confident" mean?

The overconfidence problem

Why calibration matters in production

Reliability diagrams

Reliability diagrams: reading them

What we won't do yet

Key takeaways

Next up: Module 06 — LoRA & PEFT Basics