Welcome to Module 06. In this module we're going to talk about LoRA and PEFT — two terms you'll hear constantly when people talk about fine-tuning large language models efficiently. We won't implement anything yet — that happens in Week 3. Right now I just want you to understand the core idea and why it matters. By the end of this, you should be able to explain to someone at a whiteboard why we don't always need to update every single parameter in a model.
Here's the problem. When you fine-tune a pretrained model the standard way — which is what we do in Weeks 1 and 2 — you update every single parameter. For ModernBERT-base, that's about 149 million parameters. Each one gets a gradient computed for it, each one gets optimizer state stored for it — things like momentum and variance for Adam, which actually means your optimizer is storing two extra copies of every parameter. And every time you save a checkpoint, you're saving all 149 million values to disk. That's a lot of compute, a lot of memory, and a lot of storage. And the natural question is: do we really need to update all of them? What if most of those changes are small and redundant?
The idea is surprisingly simple. What if we just froze most of the pretrained weights — literally set them to not receive gradient updates — and instead added a small number of brand new trainable parameters? If we could get nearly the same performance as full fine-tuning, we'd save enormous amounts of compute and memory. This general approach is called Parameter-Efficient Fine-Tuning, or PEFT. There are several PEFT methods out there, but the one that has become dominant in practice is called LoRA — Low-Rank Adaptation. It's what we'll use in Week 3, and it's what everyone in industry is using right now.
Here's how LoRA works conceptually. Think of the pretrained model as a textbook. It's full of knowledge learned during pretraining. In standard fine-tuning, you'd be rewriting pages of that textbook — changing every weight. With LoRA, you leave the textbook alone and instead attach small sticky notes that say "adjust this part slightly for our specific task." Technically, what LoRA does is add two small matrices alongside each target weight matrix. These two small matrices multiply together to produce an adjustment that gets added to the original frozen weight. The key insight is that these matrices have a very low rank — that's the "Low-Rank" in LoRA. So instead of a 768-by-768 matrix with nearly 600,000 parameters, you have a 768-by-8 and an 8-by-768 — about 12,000 parameters total. That's 48 times fewer parameters per layer.
Let's put real numbers on this. With full fine-tuning of ModernBERT-base, you're training about 149 million parameters and your optimizer alone needs about 1.2 gigabytes of memory. With LoRA using a rank of 8, you're training roughly 1 to 2 million parameters — the exact number depends on which layers you target — and your optimizer needs maybe 16 megabytes. Your checkpoints shrink from around 600 megabytes to about 8 megabytes. That's a massive difference. And the base model is identical — same architecture, same pretrained knowledge. You're just changing how much of it you adapt for your specific task. You'll compute the exact numbers yourself in the notebook.
Now, LoRA is not magic. It's a trade-off, and you should understand both sides. On the plus side, training is faster because you're computing gradients for far fewer parameters. Checkpoints are tiny. You use less GPU memory. And here's a nice bonus — you can keep one base model and swap in different LoRA adapters for different tasks. That's really useful in production. On the minus side, you might lose a small amount of quality compared to full fine-tuning. How much depends entirely on the task. For many tasks, the difference is negligible. For some, it matters. You also have new hyperparameters to choose — the rank, which modules to target, the learning rate. And for some tasks, especially when you have plenty of compute, full fine-tuning still gives better results. The right answer depends on your constraints. In this course, we'll try both approaches and compare them directly.
In Week 3, you'll actually implement all of this. You'll take the same ModernBERT-base model you used in Weeks 1 and 2, apply LoRA adapters, fine-tune it on complaint classification, and directly compare the results against full fine-tuning. You'll see how the macro F1 compares, how fast training runs, and how big the checkpoints are. But that's Week 3. For now, I just want the concept to click. Freeze the pretrained weights. Add small trainable matrices alongside them. Train only those small matrices. That's LoRA. Go run the notebook — it takes about a minute on CPU — and you'll see the exact parameter counts that make this all concrete.
Let me recap. Full fine-tuning updates every parameter in the model — it works, but it's expensive. PEFT methods like LoRA take a smarter approach: freeze most of the pretrained weights and only train a small number of new parameters. LoRA specifically adds pairs of small matrices alongside the frozen layers. The parameter reduction is dramatic — we go from 149 million trainable parameters down to about 1 to 2 million. There are trade-offs — you might lose a tiny bit of quality — but for many tasks the difference is negligible and the savings are enormous. You'll implement LoRA yourself in Week 3 and see how it compares to full fine-tuning. For now, go run the notebook. It takes a minute on CPU, no GPU needed, and you'll see exactly what happens to the parameter counts when you apply LoRA. See you in Module 7.