Welcome to Module 3. This one is short but extremely important. We're going to talk about something that trips up almost every beginner in deep learning — and honestly, it still bites experienced practitioners too. The issue is that a PyTorch model has two modes: training mode and evaluation mode. If you don't switch between them at the right times, your model can silently give you wrong answers. No error message, no crash — just subtly incorrect results. Let's understand why.
Here's the core issue. A PyTorch model is not always a pure function. Depending on its configuration, the same model with the exact same weights can give you different outputs depending on whether it's in train mode or eval mode. These are two explicit method calls — model.train() and model.eval() — and you, the engineer, are responsible for calling them at the right time. If you forget, PyTorch will not warn you. It will just silently do the wrong thing.
Let's start with model.train(). When you call this, two important things happen. First, dropout layers become active — if the model has dropout configured above zero. Dropout randomly zeros out a fraction of the neurons on every forward pass. The idea is to prevent the model from relying too heavily on any single neuron — it's a regularization technique. The key consequence for us is that every forward pass gives you a slightly different output, even for the exact same input. Second, batch normalization layers start tracking running statistics. ModernBERT doesn't use batch norm, but you'll encounter it in other architectures, so it's worth knowing about.
Now model.eval(). This flips both switches. Dropout turns off — every neuron stays active, no randomness. If you pass the same input twice, you get the exact same output. Batch normalization stops updating its running statistics and instead uses the ones it accumulated during training. Think of model.eval() as telling the model: "I'm done learning, just give me your best, consistent prediction." This is what you want when you're evaluating on a validation set, running a test set, or deploying to production. Consistent, deterministic outputs.
Now here's a nuance that's actually an important lesson. ModernBERT, the model we're using this semester, ships with dropout set to zero. That means if you load it and run the same input in train mode five times, you'll get the same answer every time. So why am I teaching you this? Three reasons. First, during fine-tuning — which you'll start doing in Week 1 — you may add dropout as regularization. Many practitioners set it to 0.1 or 0.2, and you might experiment with that in Week 2. Second, most other models you'll encounter in your career have non-zero dropout by default. BERT, DeBERTa, GPT-2 — they all ship with dropout turned on. The habit of calling model.eval() needs to be automatic regardless of which model you're using. Third, it's just professional discipline. In the notebook, we'll explicitly set dropout to 0.1 so you can actually see the difference. This way you understand the mechanism, not just the rule.
So what happens if you forget to call model.eval() and your model has dropout enabled? During validation, your metrics bounce around randomly between runs. You think the model improved, but actually dropout just happened to leave different neurons active this time. You make decisions based on noise. In deployment, it's even worse — the same complaint gets a different prediction each time you run it. A customer submits the same complaint twice and gets two different answers. You'll spend hours trying to figure out why your model is "flaky" when the fix is literally one line of code: model.eval(). Even if your current model has zero dropout, build the habit now. The one time you forget on a model that does have dropout, it'll cost you hours.
There's a companion to model.eval() that you'll always see alongside it: torch.no_grad(). These two are often confused, so let me be very clear about what each does. model.eval() changes the model's behavior — it turns off dropout, freezes batch norm. It can change the numbers that come out, depending on your model's configuration. torch.no_grad() does not change the output at all. What it does is tell PyTorch to stop tracking gradients — those intermediate values needed for backpropagation. During inference, you're not going to call loss.backward(), so there's no reason to store all that gradient information. Skipping it saves a significant amount of memory, which matters when you're processing large batches. They solve different problems: eval() gives you correct outputs, no_grad() prevents out-of-memory errors. You need both.
Here's the pattern you'll write every single week in this course. In the training loop, you call model.train() once at the top, then iterate over batches — forward pass, compute loss, backward pass, optimizer step. In the evaluation loop, you call model.eval() at the top, wrap everything in torch.no_grad(), and just do forward passes to compute metrics. No backward pass, no optimizer step. This is the canonical PyTorch training and evaluation pattern. You'll see it in our course notebooks, you'll see it in open-source projects, you'll see it in production code. Memorize it. Once you internalize this pattern, you'll never accidentally leave dropout on during validation again.
Let me recap. model.train() turns on dropout and batch norm updates — use it when you're training. model.eval() turns them off — use it when you're evaluating or predicting. Not all models have the same dropout defaults — ModernBERT ships with zero, but most other models don't, and you might add dropout during fine-tuning. Always call model.eval() anyway — it's a non-negotiable professional habit. torch.no_grad() is a separate thing — it saves memory by not computing gradients, but it doesn't change the model's output. You need both during inference. Now go open the notebook — we'll set dropout to 0.1 and you'll see the randomness in train mode with your own eyes, then watch it disappear in eval mode. See you in Module 4.