ECBS5200 Pre-Work

LoRA & PEFT Basics

Pre-Work Module 06

ECBS5200 — Practical Deep Learning Engineering

Module 06: LoRA/PEFT Basics
ECBS5200 Pre-Work

The problem: fine-tuning is expensive

When we fine-tune ModernBERT-base, we update all 149 million parameters.

  • Every parameter gets a gradient
  • Every parameter gets an optimizer state (momentum, variance)
  • Every checkpoint saves all 149M parameters
  • Training is slow, memory-hungry, and wasteful

What if most of those updates aren't really necessary?

Module 06: LoRA/PEFT Basics
ECBS5200 Pre-Work

The idea: freeze most, train a few

What if we could:

  1. Freeze the pretrained weights (don't update them at all)
  2. Add a small number of new, trainable parameters
  3. Get nearly the same performance as full fine-tuning

This is the core idea behind Parameter-Efficient Fine-Tuning (PEFT).

LoRA is the most popular PEFT method — and the one we'll use.

Module 06: LoRA/PEFT Basics
ECBS5200 Pre-Work

LoRA: sticky notes on a textbook

LoRA = Low-Rank Adaptation

Instead of rewriting the textbook, we add sticky notes.

  • The original weight matrix stays frozen (the textbook)
  • We add two small matrices that together produce an adjustment (the sticky notes)
  • During inference: original weight + adjustment = adapted behavior
  • The adjustment matrices are much smaller than the original
Original: 768 × 768 = 589,824 parameters  (frozen)
LoRA:     768 × 8 + 8 × 768 = 12,288 parameters  (trainable)

That's 48x fewer trainable parameters per layer.

Module 06: LoRA/PEFT Basics
ECBS5200 Pre-Work

The numbers

For ModernBERT-base with a typical LoRA configuration:

Parameters Optimizer memory
Full fine-tuning ~149M trainable ~1.2 GB
LoRA (r=8) ~1-2M trainable ~16 MB
  • Same base model, same architecture, same inference
  • Fraction of the training cost
  • Checkpoints go from ~600 MB to ~8 MB

You'll see the exact numbers in the notebook.

Module 06: LoRA/PEFT Basics
ECBS5200 Pre-Work

Trade-offs

LoRA is not a free lunch. Here's what you're trading:

Advantages:

  • Much faster training (fewer parameters to update)
  • Much smaller checkpoints (only save the adapters)
  • Lower GPU memory requirements
  • Can swap adapters for different tasks on the same base model

Potential downsides:

  • Might lose a small amount of quality vs. full fine-tuning
  • Need to choose rank, target modules, learning rate
  • For some tasks, full fine-tuning still wins

Whether the trade-off is worth it depends on your task and constraints.

Module 06: LoRA/PEFT Basics
ECBS5200 Pre-Work

Week 3 preview

In Week 3, you will:

  1. Take the same ModernBERT-base model from Weeks 1-2
  2. Apply LoRA adapters to it
  3. Fine-tune on our complaint classification task
  4. Compare performance and training cost vs. full fine-tuning

For now, just understand the concept:

  • Freeze the base model
  • Add small trainable matrices
  • Train only those

The notebook for this module shows the setup — no training, just the parameter counts.

Module 06: LoRA/PEFT Basics
ECBS5200 Pre-Work

Key takeaways

  1. Full fine-tuning updates all parameters — expensive in compute, memory, and storage
  2. PEFT methods freeze most weights and add a small number of trainable ones
  3. LoRA adds low-rank matrices alongside frozen weight matrices — like sticky notes on a textbook
  4. The parameter reduction is dramatic — from ~149M to ~1-2M trainable parameters
  5. Trade-offs exist — slightly less quality is possible, but often negligible
  6. You'll implement this in Week 3 — for now, understand the concept and see the numbers

Next: run the notebook and see the parameter counts yourself!

Module 06: LoRA/PEFT Basics

Welcome to Module 06. In this module we're going to talk about LoRA and PEFT — two terms you'll hear constantly when people talk about fine-tuning large language models efficiently. We won't implement anything yet — that happens in Week 3. Right now I just want you to understand the core idea and why it matters. By the end of this, you should be able to explain to someone at a whiteboard why we don't always need to update every single parameter in a model.

Here's the problem. When you fine-tune a pretrained model the standard way — which is what we do in Weeks 1 and 2 — you update every single parameter. For ModernBERT-base, that's about 149 million parameters. Each one gets a gradient computed for it, each one gets optimizer state stored for it — things like momentum and variance for Adam, which actually means your optimizer is storing two extra copies of every parameter. And every time you save a checkpoint, you're saving all 149 million values to disk. That's a lot of compute, a lot of memory, and a lot of storage. And the natural question is: do we really need to update all of them? What if most of those changes are small and redundant?

The idea is surprisingly simple. What if we just froze most of the pretrained weights — literally set them to not receive gradient updates — and instead added a small number of brand new trainable parameters? If we could get nearly the same performance as full fine-tuning, we'd save enormous amounts of compute and memory. This general approach is called Parameter-Efficient Fine-Tuning, or PEFT. There are several PEFT methods out there, but the one that has become dominant in practice is called LoRA — Low-Rank Adaptation. It's what we'll use in Week 3, and it's what everyone in industry is using right now.

Here's how LoRA works conceptually. Think of the pretrained model as a textbook. It's full of knowledge learned during pretraining. In standard fine-tuning, you'd be rewriting pages of that textbook — changing every weight. With LoRA, you leave the textbook alone and instead attach small sticky notes that say "adjust this part slightly for our specific task." Technically, what LoRA does is add two small matrices alongside each target weight matrix. These two small matrices multiply together to produce an adjustment that gets added to the original frozen weight. The key insight is that these matrices have a very low rank — that's the "Low-Rank" in LoRA. So instead of a 768-by-768 matrix with nearly 600,000 parameters, you have a 768-by-8 and an 8-by-768 — about 12,000 parameters total. That's 48 times fewer parameters per layer.

Let's put real numbers on this. With full fine-tuning of ModernBERT-base, you're training about 149 million parameters and your optimizer alone needs about 1.2 gigabytes of memory. With LoRA using a rank of 8, you're training roughly 1 to 2 million parameters — the exact number depends on which layers you target — and your optimizer needs maybe 16 megabytes. Your checkpoints shrink from around 600 megabytes to about 8 megabytes. That's a massive difference. And the base model is identical — same architecture, same pretrained knowledge. You're just changing how much of it you adapt for your specific task. You'll compute the exact numbers yourself in the notebook.

Now, LoRA is not magic. It's a trade-off, and you should understand both sides. On the plus side, training is faster because you're computing gradients for far fewer parameters. Checkpoints are tiny. You use less GPU memory. And here's a nice bonus — you can keep one base model and swap in different LoRA adapters for different tasks. That's really useful in production. On the minus side, you might lose a small amount of quality compared to full fine-tuning. How much depends entirely on the task. For many tasks, the difference is negligible. For some, it matters. You also have new hyperparameters to choose — the rank, which modules to target, the learning rate. And for some tasks, especially when you have plenty of compute, full fine-tuning still gives better results. The right answer depends on your constraints. In this course, we'll try both approaches and compare them directly.

In Week 3, you'll actually implement all of this. You'll take the same ModernBERT-base model you used in Weeks 1 and 2, apply LoRA adapters, fine-tune it on complaint classification, and directly compare the results against full fine-tuning. You'll see how the macro F1 compares, how fast training runs, and how big the checkpoints are. But that's Week 3. For now, I just want the concept to click. Freeze the pretrained weights. Add small trainable matrices alongside them. Train only those small matrices. That's LoRA. Go run the notebook — it takes about a minute on CPU — and you'll see the exact parameter counts that make this all concrete.

Let me recap. Full fine-tuning updates every parameter in the model — it works, but it's expensive. PEFT methods like LoRA take a smarter approach: freeze most of the pretrained weights and only train a small number of new parameters. LoRA specifically adds pairs of small matrices alongside the frozen layers. The parameter reduction is dramatic — we go from 149 million trainable parameters down to about 1 to 2 million. There are trade-offs — you might lose a tiny bit of quality — but for many tasks the difference is negligible and the savings are enormous. You'll implement LoRA yourself in Week 3 and see how it compares to full fine-tuning. For now, go run the notebook. It takes a minute on CPU, no GPU needed, and you'll see exactly what happens to the parameter counts when you apply LoRA. See you in Module 7.