ECBS5200 Pre-Work

Train vs Eval Mode

Pre-Work Module 03

ECBS5200 — Practical Deep Learning Engineering

Module 03: Train vs Eval Mode
ECBS5200 Pre-Work

The problem: same model, two behaviors

A PyTorch model can behave differently depending on whether it's in train mode or eval mode.

model.train()   # Training behavior
model.eval()    # Inference behavior

Same weights. Same architecture. Potentially different outputs.

If you don't switch modes correctly, your metrics can be wrong and your deployed model unreliable.

Module 03: Train vs Eval Mode
ECBS5200 Pre-Work

What model.train() does

Activates training-time behaviors:

1. Dropout is ON (when configured with dropout > 0)

  • Randomly sets a fraction of neurons to zero on each forward pass
  • Forces the network to not rely on any single neuron — a form of regularization
  • Each forward pass produces a different output, even for the same input

2. BatchNorm updates running statistics

  • Tracks mean and variance of each batch during training
  • These running stats are used later during inference
Module 03: Train vs Eval Mode
ECBS5200 Pre-Work

What model.eval() does

Activates inference-time behaviors:

1. Dropout is OFF

  • All neurons are active — no randomness
  • Output is deterministic for the same input

2. BatchNorm uses frozen running statistics

  • Uses the mean/variance accumulated during training
  • Does not update them with new data
model.eval()  # "I'm done training, give me consistent predictions"
Module 03: Train vs Eval Mode
ECBS5200 Pre-Work

A real-world nuance: ModernBERT defaults to dropout = 0.0

Our course model, ModernBERT-base, ships with dropout disabled (p=0.0).

That means out of the box, train mode and eval mode produce identical outputs.

So why does this module exist? Because:

  1. You may add dropout during fine-tuning — it's common to set 0.1 or 0.2 as regularization
  2. Other models use dropout by default — BERT, DeBERTa, GPT-2 all ship with non-zero dropout
  3. It's a professional habit — always call model.eval() for inference, no exceptions

In the notebook, we'll set dropout to 0.1 so you can see the effect firsthand.

Module 03: Train vs Eval Mode
ECBS5200 Pre-Work

Why forgetting eval() can burn you

During validation (when dropout > 0):

  • Dropout is still randomly zeroing neurons
  • Your validation accuracy fluctuates randomly between runs
  • You can't tell if a change improved the model or if you just got lucky

During deployment:

  • The same customer complaint gets a different prediction each time
  • Your model appears non-deterministic to users
  • You'll spend hours debugging something that "works sometimes"

The rule: always call model.eval() for inference. No exceptions.

Module 03: Train vs Eval Mode
ECBS5200 Pre-Work

The companion: torch.no_grad()

model.eval() changes model behavior. torch.no_grad() changes memory usage.

with torch.no_grad():
    outputs = model(**inputs)
model.eval() torch.no_grad()
What it does Disables dropout, freezes batchnorm Disables gradient computation
Why you need it Correct predictions Save memory
Affects output? Can change the result No — same result either way
Forgetting it Potentially wrong metrics OOM errors on large batches

They do different things. You need both during inference.

Module 03: Train vs Eval Mode
ECBS5200 Pre-Work

The pattern you'll use every week

# === TRAINING LOOP ===
model.train()
for batch in train_loader:
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# === EVALUATION LOOP ===
model.eval()
with torch.no_grad():
    for batch in eval_loader:
        outputs = model(**batch)
        # compute metrics — no loss.backward()!

Every week. Every experiment. This pattern.

Train loop: model.train() — Eval loop: model.eval() + torch.no_grad()

Module 03: Train vs Eval Mode
ECBS5200 Pre-Work

Key takeaways

  1. model.train() enables dropout and batchnorm updates — use it for training
  2. model.eval() disables dropout and freezes batchnorm — use it for inference
  3. Not all models have the same defaults — ModernBERT ships with dropout=0.0, but many don't
  4. Always call model.eval() anyway — it's a professional habit, not an optional step
  5. torch.no_grad() saves memory by skipping gradient tracking — different from eval()
  6. The pattern: train loop = model.train(), eval loop = model.eval() + torch.no_grad()

Next: try it yourself in the notebook!

Module 03: Train vs Eval Mode

Welcome to Module 3. This one is short but extremely important. We're going to talk about something that trips up almost every beginner in deep learning — and honestly, it still bites experienced practitioners too. The issue is that a PyTorch model has two modes: training mode and evaluation mode. If you don't switch between them at the right times, your model can silently give you wrong answers. No error message, no crash — just subtly incorrect results. Let's understand why.

Here's the core issue. A PyTorch model is not always a pure function. Depending on its configuration, the same model with the exact same weights can give you different outputs depending on whether it's in train mode or eval mode. These are two explicit method calls — model.train() and model.eval() — and you, the engineer, are responsible for calling them at the right time. If you forget, PyTorch will not warn you. It will just silently do the wrong thing.

Let's start with model.train(). When you call this, two important things happen. First, dropout layers become active — if the model has dropout configured above zero. Dropout randomly zeros out a fraction of the neurons on every forward pass. The idea is to prevent the model from relying too heavily on any single neuron — it's a regularization technique. The key consequence for us is that every forward pass gives you a slightly different output, even for the exact same input. Second, batch normalization layers start tracking running statistics. ModernBERT doesn't use batch norm, but you'll encounter it in other architectures, so it's worth knowing about.

Now model.eval(). This flips both switches. Dropout turns off — every neuron stays active, no randomness. If you pass the same input twice, you get the exact same output. Batch normalization stops updating its running statistics and instead uses the ones it accumulated during training. Think of model.eval() as telling the model: "I'm done learning, just give me your best, consistent prediction." This is what you want when you're evaluating on a validation set, running a test set, or deploying to production. Consistent, deterministic outputs.

Now here's a nuance that's actually an important lesson. ModernBERT, the model we're using this semester, ships with dropout set to zero. That means if you load it and run the same input in train mode five times, you'll get the same answer every time. So why am I teaching you this? Three reasons. First, during fine-tuning — which you'll start doing in Week 1 — you may add dropout as regularization. Many practitioners set it to 0.1 or 0.2, and you might experiment with that in Week 2. Second, most other models you'll encounter in your career have non-zero dropout by default. BERT, DeBERTa, GPT-2 — they all ship with dropout turned on. The habit of calling model.eval() needs to be automatic regardless of which model you're using. Third, it's just professional discipline. In the notebook, we'll explicitly set dropout to 0.1 so you can actually see the difference. This way you understand the mechanism, not just the rule.

So what happens if you forget to call model.eval() and your model has dropout enabled? During validation, your metrics bounce around randomly between runs. You think the model improved, but actually dropout just happened to leave different neurons active this time. You make decisions based on noise. In deployment, it's even worse — the same complaint gets a different prediction each time you run it. A customer submits the same complaint twice and gets two different answers. You'll spend hours trying to figure out why your model is "flaky" when the fix is literally one line of code: model.eval(). Even if your current model has zero dropout, build the habit now. The one time you forget on a model that does have dropout, it'll cost you hours.

There's a companion to model.eval() that you'll always see alongside it: torch.no_grad(). These two are often confused, so let me be very clear about what each does. model.eval() changes the model's behavior — it turns off dropout, freezes batch norm. It can change the numbers that come out, depending on your model's configuration. torch.no_grad() does not change the output at all. What it does is tell PyTorch to stop tracking gradients — those intermediate values needed for backpropagation. During inference, you're not going to call loss.backward(), so there's no reason to store all that gradient information. Skipping it saves a significant amount of memory, which matters when you're processing large batches. They solve different problems: eval() gives you correct outputs, no_grad() prevents out-of-memory errors. You need both.

Here's the pattern you'll write every single week in this course. In the training loop, you call model.train() once at the top, then iterate over batches — forward pass, compute loss, backward pass, optimizer step. In the evaluation loop, you call model.eval() at the top, wrap everything in torch.no_grad(), and just do forward passes to compute metrics. No backward pass, no optimizer step. This is the canonical PyTorch training and evaluation pattern. You'll see it in our course notebooks, you'll see it in open-source projects, you'll see it in production code. Memorize it. Once you internalize this pattern, you'll never accidentally leave dropout on during validation again.

Let me recap. model.train() turns on dropout and batch norm updates — use it when you're training. model.eval() turns them off — use it when you're evaluating or predicting. Not all models have the same dropout defaults — ModernBERT ships with zero, but most other models don't, and you might add dropout during fine-tuning. Always call model.eval() anyway — it's a non-negotiable professional habit. torch.no_grad() is a separate thing — it saves memory by not computing gradients, but it doesn't change the model's output. You need both during inference. Now go open the notebook — we'll set dropout to 0.1 and you'll see the randomness in train mode with your own eyes, then watch it disappear in eval mode. See you in Module 4.