ECBS5200 Pre-Work

Calibration Basics

Pre-Work Module 05

ECBS5200 — Practical Deep Learning Engineering

Module 05: Calibration Basics
ECBS5200 Pre-Work

What does "90% confident" mean?

Suppose your model predicts 90% confidence on 100 different complaints.

If the model is well-calibrated, about 90 of those 100 should actually be correct.

That's it. That's calibration.

  • Calibrated: predicted probability matches observed frequency
  • If the model says 80%, it should be right ~80% of the time
  • If the model says 50%, it should be right ~50% of the time
Module 05: Calibration Basics
ECBS5200 Pre-Work

The overconfidence problem

Most neural networks are overconfident.

They routinely say 95% confident when they're actually right only 70% of the time.

Model says... Actually correct...
95% ~70%
80% ~55%
60% ~45%

This isn't a bug in one model — it's a systematic property of how neural networks are trained.

Modern deep networks tend to produce overly sharp probability distributions.

Module 05: Calibration Basics
ECBS5200 Pre-Work

Why calibration matters in production

Imagine you deploy a complaint classifier with a business rule:

"Only auto-route complaints where the model is >80% confident"

If the model is well-calibrated, this threshold is meaningful.
Complaints above 80% really are almost certainly correct.

If the model is overconfident, your threshold is meaningless.
The model says 80% for predictions it gets wrong 40% of the time.

Miscalibration turns every probability threshold into a lie.

Module 05: Calibration Basics
ECBS5200 Pre-Work

Reliability diagrams

The standard tool for visualizing calibration is the reliability diagram.

  • X-axis: predicted confidence (binned, e.g., 0.0–0.1, 0.1–0.2, ...)
  • Y-axis: actual accuracy within that bin
  • Perfect calibration = the diagonal line

A curve below the diagonal at high confidence = overconfidence.

Module 05: Calibration Basics
ECBS5200 Pre-Work

Reliability diagrams: reading them

Pattern Meaning
Points on the diagonal Well-calibrated
Points below diagonal (high conf.) Overconfident
Points above diagonal (low conf.) Underconfident
Histogram shows most predictions at extremes Model is very "decisive"

center

Module 05: Calibration Basics
ECBS5200 Pre-Work

What we won't do yet

We are not going to fix calibration in this module.

Techniques like temperature scaling and Platt scaling exist to recalibrate models after training. We'll cover those in Week 4 of the course.

For now, the goal is:

  1. Know that model probabilities are often not trustworthy
  2. Understand why this matters for any system that uses probability thresholds
  3. Recognize a reliability diagram when you see one

That's sufficient for pre-work. The fix comes later.

Module 05: Calibration Basics
ECBS5200 Pre-Work

Key takeaways

  1. Calibration = predicted probabilities match observed frequencies
  2. Most neural networks are systematically overconfident
  3. Overconfidence makes probability thresholds unreliable in production
  4. Reliability diagrams are the standard visualization tool
  5. We'll learn to fix calibration in Week 4 — for now, just know it matters

Next up: Module 06 — LoRA & PEFT Basics

Module 05: Calibration Basics

Welcome to Module 05. Today we're going to talk about calibration — one of those concepts that separates people who train models from people who deploy them. If you've ever looked at a model's output probability and thought "great, the model is 92% confident," this module is going to make you question whether that number means what you think it means.

So let's start with a simple question. Your model says it's 90% confident about a prediction. What should that mean? If you collect every prediction where the model said 90%, and you check how many it actually got right, the answer should be about 90 out of 100. That's calibration. A well-calibrated model's probabilities actually mean what they say. When it says 80%, it's right about 80% of the time. When it says 50%, it's a coin flip. The predicted probability matches the observed frequency of being correct. Simple concept, but it turns out most models fail at this.

Here's the problem. Most neural networks — and this includes the transformer-based classifiers we'll be working with — are systematically overconfident. When a typical fine-tuned model says it's 95% confident, the real accuracy for those predictions might be only 70%. When it says 80%, maybe it's right 55% of the time. This isn't a bug in one particular model. It's a well-documented, systematic property of how neural networks are trained. The combination of overparameterization and the way cross-entropy loss works tends to push models toward producing overly sharp probability distributions. They learn to be way too sure of themselves.

Now here's why this matters in practice. Let's say you deploy a complaint classifier and the business rule is: only automatically route complaints to the right department when the model is at least 80% confident. If the model is well-calibrated, that's a sensible policy. 80% confidence means 80% accuracy — you'll route most things correctly and flag the uncertain ones for human review. But if the model is overconfident, that threshold is completely meaningless. The model is saying 80% for predictions it gets wrong nearly half the time. You'd be auto-routing tons of complaints to the wrong department, and you'd have no idea because the confidence numbers look great. Miscalibration turns every probability threshold into a lie. And in production systems, people set thresholds constantly — for routing, for escalation, for alerting. If those probabilities aren't calibrated, none of those thresholds do what you think they do.

How do we actually see whether a model is calibrated? The standard visualization is called a reliability diagram — sometimes called a calibration plot. Here's how it works. You take all your predictions and group them into bins by predicted confidence. So one bin has all predictions where the model said between 0 and 10% confident, another bin for 10 to 20%, and so on. For each bin, you compute the actual accuracy — what fraction of those predictions were actually correct. Then you plot predicted confidence on the x-axis and actual accuracy on the y-axis. If your model is perfectly calibrated, you get the diagonal line — every bin's actual accuracy matches the predicted confidence. In practice, an overconfident model will show a curve that falls below the diagonal at high confidence values. The model is saying 90% but the actual accuracy is only 65%. You'll build one of these in the notebook exercise.

Let me walk you through how to read these diagrams because you'll be building them in the notebook. If the points sit on the diagonal, you're in good shape — the model is well-calibrated. If the points fall below the diagonal at high confidence levels, the model is overconfident. It's claiming high confidence but delivering lower accuracy. If points are above the diagonal at low confidence, the model is underconfident — it's actually better than it thinks. Usually you'll also see a histogram at the bottom showing where most predictions fall. Overconfident models tend to cluster predictions at the extremes, near 0 and near 1, because they're always very sure of themselves. In the notebook you'll build reliability diagrams for two synthetic models and see these patterns firsthand.

I want to be clear about scope. We are not going to fix calibration in this pre-work module. There are well-known techniques — temperature scaling, Platt scaling — that can recalibrate a model's outputs after training. We'll cover those in Week 4 when we get into post-training techniques. For now, the only goal is awareness. I want you to know that model probabilities are often unreliable. I want you to understand why that's a problem for any system that uses confidence thresholds — which is basically every production system. And I want you to be able to read a reliability diagram. That's it. If you walk away from this module a little suspicious of model confidence scores, I've done my job.

Let's recap. Calibration means your model's predicted probabilities actually correspond to how often it's correct. Most neural networks are systematically overconfident — they produce inflated confidence scores. This is a real operational problem because any production system that uses probability thresholds for routing, escalation, or filtering will make bad decisions if those probabilities are miscalibrated. Reliability diagrams are how you visualize and diagnose calibration — predicted confidence on the x-axis, actual accuracy on the y-axis, and you want the diagonal. We'll learn how to fix calibration in Week 4. For now, the important thing is that you know this problem exists and you can recognize it. Go do the notebook exercise and build some reliability diagrams yourself. Then move on to Module 06 where we'll start looking at parameter-efficient fine-tuning.