From the original paper:
| Full FT | LoRA (r=4) | |
|---|---|---|
| Trainable params | 175B | 17.5M |
| GPU memory | 1.2 TB | 350 GB |
| Checkpoint size | 350 GB | 35 MB |
| Inference latency | baseline | same |
| GLUE performance | baseline | same or better |
10,000× fewer parameters. 3× less memory. Same quality. Same inference speed.
The "same inference latency" line matters: at deploy time, you compute W₀ + AB once, merge it into the base weights, and the model runs exactly like the original. No extra layers. No added latency. That's why LoRA beat earlier approaches like adapters.
ModernBERT's Wqkv layer: 2,304 × 768 = 1,769,472 parameters.
With LoRA rank 16:
Each layer has 4 module types (Wqkv, Wo, Wi, Wo2) — different sizes, about 150K trainable params per layer. Across 22 layers + the classifier head: ~3.5M total.
2.3% of the model.
In the lab, you'll configure LoRA on ModernBERT:
LoraConfig(
task_type="SEQ_CLS",
r=???, # Rank — you choose
lora_alpha=32,
target_modules=???, # Which layers — you choose
)
You'll explore the model's layer names and decide which modules to target.
The result: ~3.5M trainable parameters out of 149M. About 2.3%.
ModernBERT, BERT, RoBERTa, DeBERTa
149M parameters. Pretrained on ~2 trillion tokens of English web text.
GPT, Llama, Qwen, Gemma
Qwen 2.5-0.5B: 494M parameters. Pretrained on ~18 trillion tokens (multilingual).
Why 2.5, not the newer Qwen 3? We tested Qwen 3-0.6B — it scored slightly higher (+0.009 F1) but took 2× longer to train on T4. Not worth the trade-off for a homework you need to complete.
The default instinct is still: encoders for classification, decoders for generation.
So why would we use a decoder on a classification task?
One reason: we're betting that more pretraining data buys us better representations of rare concepts — and we're willing to pay more at inference to get them.
Yousefiramandi & Cooney (2025) tested exactly this.
Two approaches:
The classification head approach significantly outperformed instruction tuning.
And was competitive with fine-tuned BERT baselines.
readings/week3/yousefiramandi2025_decoder_cls.pdf
The decoder reads left-to-right. The last token has attended to everything before it.
[complaint text tokens...] → last hidden state → classification head → 113 logits
The last token's representation is a summary of the entire input.
Analogous to the encoder's [CLS] token — but built left-to-right.
But there's a catch: padding must go on the left, not the right. Otherwise the "last token" is a pad token.
Input: "I was charged a fee for a convenience check I never ordered"
Encoder: every token attends to every other token (bidirectional)
→ pool all tokens → classify
Decoder: each token attends to tokens BEFORE it (causal)
→ last token has seen everything → classify from that token
Both produce a single 113-dimensional logit vector. Same loss function. Same evaluation.
The architectural difference is in how the representation is built, not what it's used for.
In the lab, you'll compare these two models head-to-head:
| Encoder (ModernBERT) | Decoder (Qwen 0.5B) | |
|---|---|---|
| Parameters | 149M | 494M |
| LoRA trainable | ~2.3% | ~0.46% |
| Pretraining data | ~2T tokens | ~18T tokens |
| Architecture | Bidirectional | Left-to-right |
| Training | Same data, same epochs, same LoRA rank |
Same task. Same LoRA. Different backbone. What do you expect?
Important: this is an applied comparison, not a controlled experiment. Architecture, model size, and pretraining data all differ. That's realistic — in practice you rarely get to isolate one variable. Your job is to reason about the result given all the differences.
Your encoder gets zero F1 on the rarest ~29 classes.
Those classes have 4-27 training examples each.
Is that enough to learn from?
The encoder was pretrained on ~2T tokens — it knows something about language. But its exposure to concepts like "Convenience checks" or "Applying for a mortgage" may be thin. Fine-tuning on 4-27 examples has to bridge whatever gap remains.
If the decoder outperforms on rare classes, why? At least three explanations:
You cannot distinguish these from your data alone.
The evidence will point in one direction; multiple mechanisms could be responsible.
"Large Language Models Struggle to Learn Long-Tail Knowledge" (ICML 2023)
Key finding: model accuracy is strongly correlated with how many relevant documents appeared in pretraining.
readings/week3/kandpal2023_long_tail_knowledge.pdf
The encoder and decoder differ in architecture, scale, and pretraining exposure.
The question: do those differences show up in your metrics? And if so, where — in the aggregates, or somewhere more specific?
Keep Kandpal's finding in mind when you look at rare-class behavior.
Quality is half the story. The other half: how fast does it run?
| Encoder | Decoder | Ratio | |
|---|---|---|---|
| Parameters | 149M | 494M | 3.3× |
| Batched inference (ms/ex) | ~10 | ~27 | 2.8× |
| Throughput (ex/sec) | ~100 | ~37 | 2.7× |
| 64K complaints | ~10 min | ~29 min | 2.9× |
At scale: 2.8× means 2.8× the GPUs to serve the same traffic.
The difference between $1M/year and $2.8M/year in GPU spend is not nothing.
In Week 2, class weighting improved rare-class F1 on the encoder.
Would the same trick work on the decoder?
Think about what class weighting does: it changes the loss function to pay more attention to rare classes.
Will the same trick that helped the encoder also help the decoder? Form a hypothesis before you run it.
You'll test this in the homework.
Input
↓
Attention block → (residual + norm)
↓
Feed-forward block → (residual + norm)
↓
Output
Each block is a handful of linear layers stacked together.
Every linear layer is a candidate for LoRA.
Attention lets each token look at other tokens. To do that, every token gets projected three ways:
Attention = similarity(Q, K) → weights → weighted sum of V.
Then an Output (O) projection reshapes the result back into the token stream.
Four linear layers per attention block.
Same four projections — different names in different models.
ModernBERT (fused):
Wqkv — one matrix does Q, K, and V together (faster: single matmul)Wo — output projectionQwen (separate):
q_proj, k_proj, v_proj — one matrix eacho_proj — output projectionMath is identical. Fusion is a speed optimization.
After attention, each token goes through a small 2-layer MLP:
Each token "thinks" on its own in a wider space, then gets compressed back.
Classic FFN: 2 linear layers. SwiGLU: 3 (adds a gate).
| Role | ModernBERT | Qwen |
|---|---|---|
| Q/K/V projections | Wqkv (fused) |
q_proj, k_proj, v_proj |
| Attention output | Wo |
o_proj |
| FFN expansion | Wi |
up_proj + gate_proj |
| FFN contraction | Wo2 |
down_proj |
In the lab, run print(model) to see the real names.
Then decide: attention only (classic LoRA), or attention + FFN (more capacity, more params).
The rank r controls the size of the low-rank matrices.
| Rank | Params per layer | Typical use |
|---|---|---|
| 4 | ~6K | Light adaptation |
| 16 | ~25K | Standard choice |
| 64 | ~100K | Heavy adaptation |
Higher rank = more capacity, but diminishing returns.
Biderman et al. (2024) showed: LoRA learns less than full fine-tuning, but also forgets less. The rank controls the trade-off.
Three things to get right for the decoder:
padding_side = "left" — causal LMs read from the left; classify from the rightmodules_to_save = ["score"] — the classification head is randomly initialized; must be saveddtype = torch.float32 at load, fp16=True in training — T4 can't handle bfloat16 GradScalerGet any of these wrong and training fails silently.
The aggregate metrics will tell you who wins.
The per-class analysis will tell you where and why.
Break the 113 classes into frequency tiers. Compare encoder vs decoder on each tier.
This is the analysis that earns the most points on the rubric.
Before the lab, write down:
Will LoRA match full fine-tuning on the encoder? (your full FT: ~56.6% acc, ~0.209 F1)
Which model wins on macro F1 — encoder or decoder? By how much?
Where will the decoder's advantage be largest — rare classes or common classes?
Hold these predictions. You'll check them in 90 minutes.
There is a pre-trained encoder fallback on HuggingFace Hub if training doesn't finish.
You train the decoder yourself.
The central question: given your results, what would you deploy — and in what context?
Due: Wednesday morning before Week 4 class. HTML via Moodle.
| Section | Points | Focus |
|---|---|---|
| 1. Systems-level comparison | 20 | Architecture, parameter efficiency, aggregates |
| 2. Class weighting on decoder | 20 | Does the Week 2 trick transfer? Why or why not? |
| 3. Per-class analysis | 25 | Where do the models differ? Rare vs common |
| 4. Latency + deployment | 20 | Translate speed into a recommendation |
| 5. What you'd do next | 15 | Identify the bottleneck, propose an experiment |
Section 3 carries the most weight. The aggregate metrics hide the important story.
| Paper | Year | Key idea |
|---|---|---|
| Hu et al. — LoRA | 2021 | Low-rank weight updates = full FT quality |
| Yousefiramandi & Cooney — Decoder Classification | 2025 | Cls heads on decoders beat generation |
| Kandpal et al. — Long-Tail Knowledge | 2023 | Accuracy tracks pretraining document count |
| Houlsby et al. — Adapters | 2019 | The PEFT predecessor |
| He et al. — Unified View of PEFT | 2022 | Why LoRA, adapters, and prefix tuning are variations on one idea |
| Dettmers et al. — QLoRA | 2023 | LoRA + 4-bit quantization |
| Weller et al. — Seq vs Seq | 2025 | Controlled encoder-vs-decoder comparison |
| BehnamGhader et al. — LLM2Vec | 2024 | Decoders are secretly good encoders |
| Valdes Gonzalez — Cost-Aware Selection | 2026 | Pareto frontier for quality vs latency |
All PDFs in readings/week3/. Bold = covered in lecture.
When you face this decision in the real world, the shape is:
| Situation | Reach for |
|---|---|
| Common classes, tight latency budget | Encoder + LoRA |
| Rare classes matter, latency budget exists | Decoder + LoRA |
| Maximum accuracy, cost is no object | Full fine-tune (or a larger model) |
| Many fine-tuned variants served concurrently | Adapters (not LoRA) |
The homework's deployment question asks you to identify which situation you're in — and justify it with evidence from your own results.
We go deeper on the encoder-vs-decoder trade-off.
New tools. New interventions. The question sharpens.
Welcome back. Two weeks in and you've now fine-tuned a 149-million parameter model, run controlled experiments, and discovered that 47 of your 113 classes still get zero F1 no matter what you try. This week we ask: do you actually need to update all 149 million parameters? And what happens if you try a completely different kind of model? By the end of today you'll have answers to both. And at least one of those answers will surprise you.
Here's where we are after two weeks. Full fine-tuning got you to about 56.6 percent accuracy, 0.209 macro F1, 46 classes at zero. Class weighting improved F1 to about 0.24 — the best macro F1 you've seen — but accuracy dropped to about 52 percent, and 37 classes are still invisible. That 0.24 came from training all 149 million parameters for 3 epochs on 58,000 examples. What if I told you we could get the same result while training 2 percent of the parameters? And what if I told you there's a model that can beat 0.24 on macro F1 — and it trained even fewer parameters? That's where we're going today.
This week has two ideas. First, LoRA — Low-Rank Adaptation. Instead of updating all 149 million parameters, you freeze the model and inject small trainable matrices into specific layers. You train about 2 percent of the parameters. Does quality hold? Second, you'll apply LoRA to a completely different architecture — a decoder model — and compare it head-to-head with your encoder on the same data, same task. The lecture gives you the conceptual tools. The lab gives you the empirical evidence.
Here's the plan. In the lecture, we'll cover what makes fine-tuning expensive, how LoRA addresses that, and the key architectural difference between encoders and decoders. Then we'll ask a question that most practitioners would answer wrong: can a decoder beat an encoder at classification? In the lab, you'll configure LoRA on the encoder you already know, train it, and then compare it against a decoder that was trained the same way. The comparison is the centerpiece.
Before we get into the technical content, a quick poll. You fine-tuned all 149 million parameters in Week 1. How many of those do you think actually needed to change? How much of the model's knowledge was already useful as-is, and how much did you actually modify for consumer complaints? Think about it. We'll come back to this in a few minutes.
Before LoRA makes sense, you need to understand the problem it solves. Why does fine-tuning cost what it costs?
Let's do the math. ModernBERT-base has 149 million parameters. In float32, that's about 600 megabytes. During training, you also need gradients for every parameter — another 600 megabytes. AdamW keeps two momentum states per parameter — 1.2 gigabytes more. Before a single training example touches the GPU, you've consumed about 2.4 gigabytes just for the parameter state. This scales linearly. If you want 10 fine-tuned models for 10 tasks, you store 10 copies of 600 megabytes. Six gigabytes of checkpoints, all of which are 99 percent identical to each other. LoRA asks: do we really need to update all 149 million parameters to adapt this model?
Here's the key insight. When you fine-tune a pretrained model, the weight updates are not random perturbations across all 589,824 values in a matrix. Aghajanyan and colleagues showed in 2020 that the updates live in a low-dimensional subspace. The useful information in the weight change can be captured in far fewer dimensions than the full matrix. The analogy: if the weight matrix is a 768-dimensional space, the fine-tuning update only needs about 16 dimensions to represent what the model actually learned. LoRA — Low-Rank Adaptation — exploits exactly this.
Here's how LoRA works. You freeze the pretrained weight matrix W-zero entirely. Then you add two small matrices — A, which is 768 by 16, and B, which is 16 by 768. During the forward pass, the output is W-zero times x plus A times B times x. The product A times B has the same shape as W-zero — 768 by 768 — but it's parameterized by only 24,576 values instead of 589,824. That's a 24x reduction per layer. You apply this to every attention and feed-forward layer. The 16 in those dimensions is the rank — you choose it. Higher rank means more capacity, more parameters, but still far fewer than full fine-tuning. The Hu et al. paper from 2021 is in your readings. It's short and practical — I recommend reading at least the first five pages.
Here's what Hu et al. actually measured. On GPT-3 with 175 billion parameters, LoRA at rank 4 trained only 17.5 million parameters — ten thousand times fewer. GPU memory dropped by 3x. The checkpoint went from 350 gigabytes to 35 megabytes. And the quality on GLUE benchmarks was the same or better than full fine-tuning. That last part is the surprising one. You'd expect a 10,000x reduction in trainable parameters to cost you something. It doesn't — because the weight updates live in that low-rank subspace we talked about. If the update only needs 4 dimensions, giving it 175 billion dimensions is waste, not capacity. And pay attention to that "same inference latency" line — that's why LoRA won over earlier methods like adapters. At deploy time, you compute W-zero plus A times B once, merge it into the base weights, and ship a single model that runs exactly like the original. No extra layers in the forward pass. No added latency. Adapters and prefix tuning couldn't do that — they live in the forward pass forever. You'll see this matter again next week when we benchmark latency: merged LoRA and full fine-tuning are indistinguishable in speed.
Let's do the arithmetic so you can verify it in the lab. ModernBERT's Wqkv layer is 2,304 by 768 — that's the fused query-key-value projection. Nearly 1.8 million parameters per layer. With LoRA at rank 16, you replace the update with two matrices: A is 768 by 16, B is 16 by 2,304. That's about 49,000 parameters — 2.8 percent of the original layer. Each transformer layer actually has four module types you could target — Wqkv for attention, Wo for attention output, Wi and Wo2 for the feed-forward network. They're not all the same size, so the per-layer total varies, but it averages about 150,000 trainable parameters per layer. Across 22 layers plus the classification head, you get roughly 3.5 million trainable parameters — 2.3 percent of the model. You'll verify this in the lab when you see the "Trainable: X / Y" printout after applying LoRA.
In the lab, you'll configure LoRA yourself. You'll inspect the model's layer names, decide which modules to target, and choose a rank. I'm not going to tell you the answers — you'll see the layers, read about what each one does, and make a decision. The result will be about 3.5 million trainable parameters out of 149 million — roughly 2.3 percent. Then you'll train it and compare to your full fine-tuning results from Week 1. The question you should have in your head: does training 2 percent of the parameters sacrifice quality?
Now the second idea. You've been working with ModernBERT, which is an encoder. But there's another family of models — decoders — and we're about to put one on the same task.
Encoders like ModernBERT are trained to predict masked tokens. The key property: every token can attend to every other token — past and future. This makes them naturally suited for tasks where you need to understand the whole input, like classification. ModernBERT has 149 million parameters and was pretrained on about 2 trillion tokens of English web text. This is what you've been using since Week 1.
Decoders like Qwen are trained to predict the next token. The key difference: each token can only attend to tokens before it — a causal mask. This makes them natural text generators, which is why they power chatbots and code assistants. Qwen 2.5-0.5B has 494 million parameters — about 3.3 times larger than ModernBERT — and was pretrained on roughly 18 trillion tokens of multilingual text. That's 9 times more pretraining data than the encoder. Quick note on model choice: there is a newer Qwen 3 with 0.6 billion parameters, trained on 36 trillion tokens. We tested it. It scored about 0.009 higher on macro F1 — real but small. The problem: it took 108 minutes to train instead of 57. For a homework where you need to run two training jobs plus analysis, doubling the training time isn't worth a marginal quality gain. Engineering trade-offs start here.
The default instinct in NLP is still to reach for an encoder when you have a classification task and a decoder when you need to generate text. So why would we use a decoder on a classification task at all? Here's the bet: we're trading inference speed and memory for representation quality, specifically on rare concepts. If that trade-off pays off, we get a better model for the long tail. If it doesn't, we've paid the cost for nothing. That's the question this week asks. Not "are decoders secretly better at classification in general" — but "on this specific task with this specific class distribution, is the pretraining advantage worth the inference cost?" You'll answer that for yourself this afternoon.
Yousefiramandi and Cooney published a paper in December 2025 testing this directly. They compared two ways of using a decoder for classification. Approach one: attach a classification head to the last token's hidden state and fine-tune with LoRA. One forward pass, 113 logits, done. Approach two: instruction-tune the decoder to generate the label text. They found the classification head approach significantly outperformed instruction tuning on F1, and was competitive with fine-tuned BERT models. This is exactly what you're going to do in the lab. Not text generation — a classification head, same as the encoder, but on a decoder backbone. The paper is in your readings.
Why might this work? In a decoder, each token attends to all previous tokens. The last token has seen the entire input — its hidden state is a summary of the whole complaint. That's analogous to how BERT's CLS token works, except the decoder builds it left-to-right instead of bidirectionally. You put a linear classification head on top of that hidden state and train it. One critical detail: padding must go on the left, not the right. If you right-pad like you do with an encoder, the last token in the sequence is a pad token — and the model classifies padding instead of your complaint. Left padding pushes the real content to the right edge where the classification head reads it. You'll set this up in the lab.
Let me make the architectural difference concrete. Take a complaint: "I was charged a fee for a convenience check I never ordered." The encoder processes this bidirectionally — every token attends to every other token. Then you pool across all positions and classify. The decoder processes left-to-right — "I" sees nothing, "was" sees "I", "charged" sees "I was", and so on. By the time you reach "ordered" — the last token — it has attended to everything before it. You classify from that last token's hidden state. Both approaches produce a single 113-dimensional logit vector. Same cross-entropy loss. Same evaluation. The difference is entirely in how the representation is built. And that difference matters — because the decoder's left-to-right representations were shaped by 18 trillion tokens of pretraining, while the encoder's bidirectional representations were shaped by 2 trillion.
Here's what the comparison looks like. Same dataset. Same 113 classes. Same number of training epochs. Same LoRA rank. Two different models. The encoder has 149 million parameters pretrained on about 2 trillion tokens. The decoder has 494 million parameters — 3.3 times larger — pretrained on 18 trillion tokens — 9 times more data. The decoder trains fewer parameters as a percentage — 0.46 percent versus 2.3 percent — because it's a bigger model. Now, I want to be upfront: this is an applied comparison, not a perfectly controlled experiment. Architecture, model size, and pretraining corpus all differ simultaneously. That's realistic. In practice you almost never get to isolate one variable when comparing models. Your job is to reason about the result given all the differences, not to pretend it's a clean ablation. Make a prediction right now, before we go further. Which model has higher macro F1? Write it down.
Before you go to the lab, I want to give you one more conceptual tool. It's about pretraining — and what it means for the rare classes you've been struggling with.
Let's think about the rare-class problem from the model's perspective. Your encoder has 29 classes where it gets zero F1 — literally never predicts them correctly. These classes have between 4 and 27 training examples each. Now, the encoder isn't starting from nothing — it was pretrained on about 2 trillion tokens and knows a lot about language. But its exposure to specific concepts like "convenience checks" or "applying for a mortgage" may be thin. Fine-tuning on 4 to 27 examples has to bridge whatever gap remains between the pretrained representation and a useful classifier. That's a lot to ask of a few examples.
If the decoder outperforms the encoder on rare classes, the natural question is why. There are at least three competing explanations and the evidence in your lab won't let you distinguish between them cleanly. First explanation: pretraining exposure. The decoder saw 18 trillion tokens versus the encoder's 2 trillion, and somewhere in that extra 16 trillion tokens there's more content about rare complaint categories. Second explanation: raw parameter count. The decoder has 494 million parameters versus the encoder's 149 million — more capacity, which might help with long-tail separation regardless of what the pretraining data contained. Third explanation: architectural representation structure. A causal last-token summary may encode rare-class information differently than a bidirectional pooled representation — not better or worse intrinsically, but different, and possibly more suitable for some class distributions. I want to be honest: your lab data won't let you isolate which of these is doing the work. You're comparing two models that differ on all three dimensions simultaneously. The question for your memo isn't "which explanation is correct" but "what does your evidence rule out, and what does it leave on the table."
Kandpal and colleagues at ICML 2023 established this rigorously. They measured how well language models answer factual questions and compared that to how many relevant documents appeared in the pretraining data. The relationship is nearly linear on a log scale — more pretraining exposure, better accuracy. Larger models help, but the improvement is gradual. Their estimate: to achieve competitive performance on truly rare knowledge, models would need to scale by many orders of magnitude. The implication for your task: those 29 rare classes sit at the far left of that curve. The encoder, pretrained on 2 trillion tokens, may have very little exposure to them. The decoder, pretrained on 18 trillion tokens, has more — but even 18 trillion may not be enough for some of them. The paper is in your readings. The first figure alone is worth the read.
So here's the frame for the lab. You have two models that differ in multiple ways: architecture, parameter count, and pretraining data. The question is whether those differences produce measurably different behavior on your task — and if they do, whether the difference is visible in the aggregate metrics or only when you look class by class. Keep Kandpal's finding in mind: performance tracks pretraining exposure, and the long tail is where the gap should be widest. But treat that as a hypothesis to test, not a conclusion to confirm.
Let's talk about speed. The decoder has 3.3 times more parameters than the encoder. On batched inference — processing 32 complaints at a time — the encoder takes about 10 milliseconds per example, the decoder takes about 27. That's a 2.8x gap. At scale, 2.8x means you need 2.8 times the GPUs to serve the same traffic. If your encoder deployment costs a million dollars a year in GPU, the decoder costs 2.8 million. That's not a theoretical number — it's a procurement conversation. You'll measure this yourself in the homework. And you'll need these numbers for the deployment recommendation in your memo.
One more thing to think about before the lab. In Week 2, class weighting gave the encoder a big boost — about 15 percent improvement in macro F1. The homework asks you to try the same trick on the decoder. Before you run it, form a hypothesis: will class weighting help the decoder as much as it helped the encoder? More? Less? The same? Think about what class weighting actually does and what you've heard today about the two architectures. There's no guaranteed answer here — you'll run the experiment and reason from your own results.
Let me cover the practical decisions you'll face in the lab, so you're prepared.
Before you pick which modules to target, you need to know what the modules actually are. Every transformer layer — and both models have twenty-plus of these — has the same two blocks. First, an attention block, where tokens exchange information with each other. Second, a feed-forward block, where each token does some processing on its own. Around each block there's a residual connection and a layer norm. Inside each block, a handful of linear layers. That's it. LoRA works by adding a small trainable update to any linear layer you want. So when you ask "which modules do I target?", you're really asking "which linear layers get a LoRA adapter attached?" Let's look at what's inside each block.
Attention is how tokens talk to each other. Every token gets projected three different ways. The query projection asks "what am I looking for in the other tokens?" The key projection advertises "here's what I have to offer." The value projection carries "here's the actual information I'm contributing." Attention computes the similarity between queries and keys — how much does token A want what token B offers — turns those similarities into weights, and takes a weighted sum of the values. So after attention, each token now carries a blend of information from the tokens it cared about. One more projection — the output — reshapes the result so it can rejoin the token stream. Four linear layers per attention block: Q, K, V, O. Every one is a potential LoRA target.
Here's where the naming gets confusing. Both models do the same four projections, but they organize them differently. ModernBERT fuses the query, key, and value projections into one big matrix called Wqkv. Why? One matrix multiplication is faster than three, and the result is identical. The output projection is Wo. Qwen keeps them as separate matrices — q_proj, k_proj, v_proj, and o_proj. Neither is mathematically superior; fusion is just a speed trick on the forward pass. But this affects your LoRA config. ModernBERT's attention has two targetable modules. Qwen's has four. Same roles, different number of names.
After attention, each token takes a solo trip through a small two-layer MLP. First, the up projection expands the representation. ModernBERT goes from 768 dimensions to 3072 — four times wider. This gives the token room to do some nonlinear processing. Then an activation function. Then the down projection contracts it back to the original size so it can rejoin the stream. Classic FFN: two linear layers, up and down. But Qwen uses SwiGLU, which adds a gating mechanism. Instead of one up projection, you get a gate projection and an up projection. Their element-wise product goes through the down projection. Slightly better quality empirically, slightly more parameters, and — for our purposes — one more module name on the list. Don't get hung up on why gating helps. Just know that when you look at Qwen, you'll see three FFN modules instead of two.
Here's the summary, and this is the slide to photograph if you only remember one. Same roles in both architectures. Different names. Query, key, value projections — fused in ModernBERT, separate in Qwen. Attention output — Wo versus o_proj. FFN expansion — Wi in ModernBERT, up_proj plus gate_proj in Qwen because of SwiGLU. FFN contraction — Wo2 versus down_proj. In the lab, you'll run print of the model and see the actual names, and the hierarchy they're nested in. Then you decide which ones to adapt. A common default is attention only — the original LoRA paper targeted just q_proj and v_proj. You can go broader. More modules means more capacity, more trainable parameters, more memory. You'll make this call yourself and see the consequences.
The rank controls how many dimensions the low-rank update uses. Rank 4 is very light — about 6,000 parameters per adapted layer. Rank 16 is the standard choice — about 25,000 parameters. Rank 64 is heavy. Higher rank gives you more capacity, but there are diminishing returns. Biderman and colleagues published a detailed study in 2024 showing that LoRA consistently learns less than full fine-tuning — it's a constrained approximation — but it also forgets less of the base model's knowledge. The rank controls that trade-off. For your 113-class task, rank 16 is a reasonable starting point, but you'll choose in the lab.
When you work with the decoder in the homework, there are three things you must get right. First, left padding. We covered why. Second, modules to save. The classification head — called "score" in Qwen — is randomly initialized. If you don't include it in modules_to_save, it won't be saved with the LoRA adapter. You'd reload the model, get a fresh random head, and your metrics would look random. Third, load the model in float32. Qwen defaults to bfloat16, but the T4 GPU can't handle bfloat16 gradient scaling. Load in float32 and let the Trainer handle mixed precision. Miss any of these and training fails silently — it runs, it produces numbers, but the numbers are wrong. I'm telling you this now so you know what to watch for.
When you get to the comparison, don't stop at the aggregate numbers. Break the 113 classes into rare and common tiers — the notebook does this for you — and compare encoder versus decoder within each tier. The chart on the right shows what this kind of analysis looks like: rare classes on the left, common classes on the right, encoder in blue, decoder in orange. The pattern you see will explain more than any single macro F1 number. This is also the analysis that carries the most weight on the rubric — 25 points — so invest time here.
I want you to make three predictions before we break for the lab. Write them down — in your notebook, on paper, wherever. First: will encoder LoRA match the full fine-tuning quality from Week 1? You trained all 149 million parameters then. Now you're training 2 percent. Second: which model wins on macro F1 — encoder LoRA or decoder LoRA? By how much? Third: where will the decoder's advantage be largest — on the rare classes or the common ones? Hold these predictions. Don't change them once you see the numbers. The value is in comparing what you expected to what happened.
Let me tell you what the lab looks like.
Here's the lab structure. You start by configuring LoRA on the encoder — inspect the layers, choose your modules and rank. Then train for 3 epochs — about 32 minutes on T4. While it trains, you write predictions about what you expect. After training, you evaluate and compare to your full fine-tuning results from Week 1. Then you load a pre-trained decoder from HuggingFace Hub — same data, same task, same LoRA approach, already trained. The reveal is a side-by-side metrics table. After that, you dig into the per-class analysis — break it down by frequency tier, look at the scatter plot, interpret what you see. If encoder training doesn't finish in time, there's a pre-trained fallback on HuggingFace Hub. The reveal must not depend on both training runs completing.
The homework extends the lab in three directions. First, you train the decoder yourself instead of loading a pre-trained checkpoint. You'll configure LoRA, set up left padding, handle the classification head. Second, you apply class weighting — the same sqrt-inverse weights from Week 2 — and see whether it helps the decoder the way it helped the encoder. Third, you benchmark latency for both models. The memo has five sections. The central question is a deployment recommendation: given everything you've measured — quality, per-class behavior, latency — what would you deploy, and for what use case? There is no single right answer. But "it depends" without saying on what doesn't count.
Here's the point breakdown for the memo. I want to be upfront about what matters most. Section 3 — the per-class analysis — carries 25 points, the most of any section. That's deliberate. If you only report "the decoder has higher macro F1" and stop there, you've missed the week. The aggregate metrics are real, but they hide the most interesting story. The rubric is in the course repo.
Here's your reading list. Nine papers. The three in bold — LoRA, decoder classification, and long-tail knowledge — we covered in the lecture. The other six are referenced on individual slides or provide deeper context. You don't need to read all nine. Pick two or three that connect to what you find most interesting in your results. If you're curious about why LoRA won over adapters, read He et al. If you want the cleanest encoder-vs-decoder comparison, read Weller et al. — it's an ICLR 2026 paper where they trained paired models on identical data to control for every variable except architecture. If you're thinking about the deployment question for your memo, Valdes Gonzalez frames model selection as a Pareto frontier problem — quality versus cost — which is exactly the framework you need.
Here's the durable mental model I want you to leave with. When you face an adaptation decision in practice, the shape is roughly this. If your task is dominated by common classes and you have a tight latency budget — a real-time classification API, for example — reach for an encoder with LoRA. If rare classes matter and you have some latency budget to spend, the decoder with LoRA is worth considering. If you can spend unlimited compute and accuracy is all that matters, full fine-tuning or a larger model is the right move. And if you're serving many fine-tuned variants concurrently and need to swap between them without reloading the base, adapters are still the right choice. The homework's deployment question asks you to identify which situation you're in for the consumer complaints task and justify it with your own evidence. There's no universal answer.
Next week we build on everything from today. New tools, new experiments. The encoder-vs-decoder question gets sharper. See you in the lab.