LoRA / PEFT: Encoder vs Decoder

Model	Accuracy	Macro F1	Zero-F1
TF-IDF + LogReg	54.2%	0.132	70
Full fine-tune (3 ep)	56.6%	0.209	46
+ class weighting	~52%	~0.24	~37

	Full FT	LoRA (r=4)
Trainable params	175B	17.5M
GPU memory	1.2 TB	350 GB
Checkpoint size	350 GB	35 MB
Inference latency	baseline	same
GLUE performance	baseline	same or better

	Encoder (ModernBERT)	Decoder (Qwen 0.5B)
Parameters	149M	494M
LoRA trainable	~2.3%	~0.46%
Pretraining data	~2T tokens	~18T tokens
Architecture	Bidirectional	Left-to-right
Training	Same data, same epochs, same LoRA rank

	Encoder	Decoder	Ratio
Parameters	149M	494M	3.3×
Batched inference (ms/ex)	~10	~27	2.8×
Throughput (ex/sec)	~100	~37	2.7×
64K complaints	~10 min	~29 min	2.9×

Role	ModernBERT	Qwen
Q/K/V projections	`Wqkv` (fused)	`q_proj`, `k_proj`, `v_proj`
Attention output	`Wo`	`o_proj`
FFN expansion	`Wi`	`up_proj` + `gate_proj`
FFN contraction	`Wo2`	`down_proj`

Rank	Params per layer	Typical use
4	~6K	Light adaptation
16	~25K	Standard choice
64	~100K	Heavy adaptation

Section	Points	Focus
1. Systems-level comparison	20	Architecture, parameter efficiency, aggregates
2. Class weighting on decoder	20	Does the Week 2 trick transfer? Why or why not?
3. Per-class analysis	25	Where do the models differ? Rare vs common
4. Latency + deployment	20	Translate speed into a recommendation
5. What you'd do next	15	Identify the bottleneck, propose an experiment

Paper	Year	Key idea
Hu et al. — LoRA	2021	Low-rank weight updates = full FT quality
Yousefiramandi & Cooney — Decoder Classification	2025	Cls heads on decoders beat generation
Kandpal et al. — Long-Tail Knowledge	2023	Accuracy tracks pretraining document count
Houlsby et al. — Adapters	2019	The PEFT predecessor
He et al. — Unified View of PEFT	2022	Why LoRA, adapters, and prefix tuning are variations on one idea
Dettmers et al. — QLoRA	2023	LoRA + 4-bit quantization
Weller et al. — Seq vs Seq	2025	Controlled encoder-vs-decoder comparison
BehnamGhader et al. — LLM2Vec	2024	Decoders are secretly good encoders
Valdes Gonzalez — Cost-Aware Selection	2026	Pareto frontier for quality vs latency

Situation	Reach for
Common classes, tight latency budget	Encoder + LoRA
Rare classes matter, latency budget exists	Decoder + LoRA
Maximum accuracy, cost is no object	Full fine-tune (or a larger model)
Many fine-tuned variants served concurrently	Adapters (not LoRA)

Welcome back. Two weeks in and you've now fine-tuned a 149-million parameter model, run controlled experiments, and discovered that 47 of your 113 classes still get zero F1 no matter what you try. This week we ask: do you actually need to update all 149 million parameters? And what happens if you try a completely different kind of model? By the end of today you'll have answers to both. And at least one of those answers will surprise you.

Here's where we are after two weeks. Full fine-tuning got you to about 56.6 percent accuracy, 0.209 macro F1, 46 classes at zero. Class weighting improved F1 to about 0.24 — the best macro F1 you've seen — but accuracy dropped to about 52 percent, and 37 classes are still invisible. That 0.24 came from training all 149 million parameters for 3 epochs on 58,000 examples. What if I told you we could get the same result while training 2 percent of the parameters? And what if I told you there's a model that can beat 0.24 on macro F1 — and it trained even fewer parameters? That's where we're going today.

This week has two ideas. First, LoRA — Low-Rank Adaptation. Instead of updating all 149 million parameters, you freeze the model and inject small trainable matrices into specific layers. You train about 2 percent of the parameters. Does quality hold? Second, you'll apply LoRA to a completely different architecture — a decoder model — and compare it head-to-head with your encoder on the same data, same task. The lecture gives you the conceptual tools. The lab gives you the empirical evidence.

Here's the plan. In the lecture, we'll cover what makes fine-tuning expensive, how LoRA addresses that, and the key architectural difference between encoders and decoders. Then we'll ask a question that most practitioners would answer wrong: can a decoder beat an encoder at classification? In the lab, you'll configure LoRA on the encoder you already know, train it, and then compare it against a decoder that was trained the same way. The comparison is the centerpiece.

Before we get into the technical content, a quick poll. You fine-tuned all 149 million parameters in Week 1. How many of those do you think actually needed to change? How much of the model's knowledge was already useful as-is, and how much did you actually modify for consumer complaints? Think about it. We'll come back to this in a few minutes.

Before LoRA makes sense, you need to understand the problem it solves. Why does fine-tuning cost what it costs?

Let's do the math. ModernBERT-base has 149 million parameters. In float32, that's about 600 megabytes. During training, you also need gradients for every parameter — another 600 megabytes. AdamW keeps two momentum states per parameter — 1.2 gigabytes more. Before a single training example touches the GPU, you've consumed about 2.4 gigabytes just for the parameter state. This scales linearly. If you want 10 fine-tuned models for 10 tasks, you store 10 copies of 600 megabytes. Six gigabytes of checkpoints, all of which are 99 percent identical to each other. LoRA asks: do we really need to update all 149 million parameters to adapt this model?

Here's the key insight. When you fine-tune a pretrained model, the weight updates are not random perturbations across all 589,824 values in a matrix. Aghajanyan and colleagues showed in 2020 that the updates live in a low-dimensional subspace. The useful information in the weight change can be captured in far fewer dimensions than the full matrix. The analogy: if the weight matrix is a 768-dimensional space, the fine-tuning update only needs about 16 dimensions to represent what the model actually learned. LoRA — Low-Rank Adaptation — exploits exactly this.

Here's how LoRA works. You freeze the pretrained weight matrix W-zero entirely. Then you add two small matrices — A, which is 768 by 16, and B, which is 16 by 768. During the forward pass, the output is W-zero times x plus A times B times x. The product A times B has the same shape as W-zero — 768 by 768 — but it's parameterized by only 24,576 values instead of 589,824. That's a 24x reduction per layer. You apply this to every attention and feed-forward layer. The 16 in those dimensions is the rank — you choose it. Higher rank means more capacity, more parameters, but still far fewer than full fine-tuning. The Hu et al. paper from 2021 is in your readings. It's short and practical — I recommend reading at least the first five pages.

Here's what Hu et al. actually measured. On GPT-3 with 175 billion parameters, LoRA at rank 4 trained only 17.5 million parameters — ten thousand times fewer. GPU memory dropped by 3x. The checkpoint went from 350 gigabytes to 35 megabytes. And the quality on GLUE benchmarks was the same or better than full fine-tuning. That last part is the surprising one. You'd expect a 10,000x reduction in trainable parameters to cost you something. It doesn't — because the weight updates live in that low-rank subspace we talked about. If the update only needs 4 dimensions, giving it 175 billion dimensions is waste, not capacity. And pay attention to that "same inference latency" line — that's why LoRA won over earlier methods like adapters. At deploy time, you compute W-zero plus A times B once, merge it into the base weights, and ship a single model that runs exactly like the original. No extra layers in the forward pass. No added latency. Adapters and prefix tuning couldn't do that — they live in the forward pass forever. You'll see this matter again next week when we benchmark latency: merged LoRA and full fine-tuning are indistinguishable in speed.

Let's do the arithmetic so you can verify it in the lab. ModernBERT's Wqkv layer is 2,304 by 768 — that's the fused query-key-value projection. Nearly 1.8 million parameters per layer. With LoRA at rank 16, you replace the update with two matrices: A is 768 by 16, B is 16 by 2,304. That's about 49,000 parameters — 2.8 percent of the original layer. Each transformer layer actually has four module types you could target — Wqkv for attention, Wo for attention output, Wi and Wo2 for the feed-forward network. They're not all the same size, so the per-layer total varies, but it averages about 150,000 trainable parameters per layer. Across 22 layers plus the classification head, you get roughly 3.5 million trainable parameters — 2.3 percent of the model. You'll verify this in the lab when you see the "Trainable: X / Y" printout after applying LoRA.

In the lab, you'll configure LoRA yourself. You'll inspect the model's layer names, decide which modules to target, and choose a rank. I'm not going to tell you the answers — you'll see the layers, read about what each one does, and make a decision. The result will be about 3.5 million trainable parameters out of 149 million — roughly 2.3 percent. Then you'll train it and compare to your full fine-tuning results from Week 1. The question you should have in your head: does training 2 percent of the parameters sacrifice quality?

Now the second idea. You've been working with ModernBERT, which is an encoder. But there's another family of models — decoders — and we're about to put one on the same task.

Encoders like ModernBERT are trained to predict masked tokens. The key property: every token can attend to every other token — past and future. This makes them naturally suited for tasks where you need to understand the whole input, like classification. ModernBERT has 149 million parameters and was pretrained on about 2 trillion tokens of English web text. This is what you've been using since Week 1.

Decoders like Qwen are trained to predict the next token. The key difference: each token can only attend to tokens before it — a causal mask. This makes them natural text generators, which is why they power chatbots and code assistants. Qwen 2.5-0.5B has 494 million parameters — about 3.3 times larger than ModernBERT — and was pretrained on roughly 18 trillion tokens of multilingual text. That's 9 times more pretraining data than the encoder. Quick note on model choice: there is a newer Qwen 3 with 0.6 billion parameters, trained on 36 trillion tokens. We tested it. It scored about 0.009 higher on macro F1 — real but small. The problem: it took 108 minutes to train instead of 57. For a homework where you need to run two training jobs plus analysis, doubling the training time isn't worth a marginal quality gain. Engineering trade-offs start here.

The default instinct in NLP is still to reach for an encoder when you have a classification task and a decoder when you need to generate text. So why would we use a decoder on a classification task at all? Here's the bet: we're trading inference speed and memory for representation quality, specifically on rare concepts. If that trade-off pays off, we get a better model for the long tail. If it doesn't, we've paid the cost for nothing. That's the question this week asks. Not "are decoders secretly better at classification in general" — but "on this specific task with this specific class distribution, is the pretraining advantage worth the inference cost?" You'll answer that for yourself this afternoon.

Yousefiramandi and Cooney published a paper in December 2025 testing this directly. They compared two ways of using a decoder for classification. Approach one: attach a classification head to the last token's hidden state and fine-tune with LoRA. One forward pass, 113 logits, done. Approach two: instruction-tune the decoder to generate the label text. They found the classification head approach significantly outperformed instruction tuning on F1, and was competitive with fine-tuned BERT models. This is exactly what you're going to do in the lab. Not text generation — a classification head, same as the encoder, but on a decoder backbone. The paper is in your readings.

Why might this work? In a decoder, each token attends to all previous tokens. The last token has seen the entire input — its hidden state is a summary of the whole complaint. That's analogous to how BERT's CLS token works, except the decoder builds it left-to-right instead of bidirectionally. You put a linear classification head on top of that hidden state and train it. One critical detail: padding must go on the left, not the right. If you right-pad like you do with an encoder, the last token in the sequence is a pad token — and the model classifies padding instead of your complaint. Left padding pushes the real content to the right edge where the classification head reads it. You'll set this up in the lab.

Let me make the architectural difference concrete. Take a complaint: "I was charged a fee for a convenience check I never ordered." The encoder processes this bidirectionally — every token attends to every other token. Then you pool across all positions and classify. The decoder processes left-to-right — "I" sees nothing, "was" sees "I", "charged" sees "I was", and so on. By the time you reach "ordered" — the last token — it has attended to everything before it. You classify from that last token's hidden state. Both approaches produce a single 113-dimensional logit vector. Same cross-entropy loss. Same evaluation. The difference is entirely in how the representation is built. And that difference matters — because the decoder's left-to-right representations were shaped by 18 trillion tokens of pretraining, while the encoder's bidirectional representations were shaped by 2 trillion.

Here's what the comparison looks like. Same dataset. Same 113 classes. Same number of training epochs. Same LoRA rank. Two different models. The encoder has 149 million parameters pretrained on about 2 trillion tokens. The decoder has 494 million parameters — 3.3 times larger — pretrained on 18 trillion tokens — 9 times more data. The decoder trains fewer parameters as a percentage — 0.46 percent versus 2.3 percent — because it's a bigger model. Now, I want to be upfront: this is an applied comparison, not a perfectly controlled experiment. Architecture, model size, and pretraining corpus all differ simultaneously. That's realistic. In practice you almost never get to isolate one variable when comparing models. Your job is to reason about the result given all the differences, not to pretend it's a clean ablation. Make a prediction right now, before we go further. Which model has higher macro F1? Write it down.

Before you go to the lab, I want to give you one more conceptual tool. It's about pretraining — and what it means for the rare classes you've been struggling with.

Let's think about the rare-class problem from the model's perspective. Your encoder has 29 classes where it gets zero F1 — literally never predicts them correctly. These classes have between 4 and 27 training examples each. Now, the encoder isn't starting from nothing — it was pretrained on about 2 trillion tokens and knows a lot about language. But its exposure to specific concepts like "convenience checks" or "applying for a mortgage" may be thin. Fine-tuning on 4 to 27 examples has to bridge whatever gap remains between the pretrained representation and a useful classifier. That's a lot to ask of a few examples.

If the decoder outperforms the encoder on rare classes, the natural question is why. There are at least three competing explanations and the evidence in your lab won't let you distinguish between them cleanly. First explanation: pretraining exposure. The decoder saw 18 trillion tokens versus the encoder's 2 trillion, and somewhere in that extra 16 trillion tokens there's more content about rare complaint categories. Second explanation: raw parameter count. The decoder has 494 million parameters versus the encoder's 149 million — more capacity, which might help with long-tail separation regardless of what the pretraining data contained. Third explanation: architectural representation structure. A causal last-token summary may encode rare-class information differently than a bidirectional pooled representation — not better or worse intrinsically, but different, and possibly more suitable for some class distributions. I want to be honest: your lab data won't let you isolate which of these is doing the work. You're comparing two models that differ on all three dimensions simultaneously. The question for your memo isn't "which explanation is correct" but "what does your evidence rule out, and what does it leave on the table."

Kandpal and colleagues at ICML 2023 established this rigorously. They measured how well language models answer factual questions and compared that to how many relevant documents appeared in the pretraining data. The relationship is nearly linear on a log scale — more pretraining exposure, better accuracy. Larger models help, but the improvement is gradual. Their estimate: to achieve competitive performance on truly rare knowledge, models would need to scale by many orders of magnitude. The implication for your task: those 29 rare classes sit at the far left of that curve. The encoder, pretrained on 2 trillion tokens, may have very little exposure to them. The decoder, pretrained on 18 trillion tokens, has more — but even 18 trillion may not be enough for some of them. The paper is in your readings. The first figure alone is worth the read.

So here's the frame for the lab. You have two models that differ in multiple ways: architecture, parameter count, and pretraining data. The question is whether those differences produce measurably different behavior on your task — and if they do, whether the difference is visible in the aggregate metrics or only when you look class by class. Keep Kandpal's finding in mind: performance tracks pretraining exposure, and the long tail is where the gap should be widest. But treat that as a hypothesis to test, not a conclusion to confirm.

Let's talk about speed. The decoder has 3.3 times more parameters than the encoder. On batched inference — processing 32 complaints at a time — the encoder takes about 10 milliseconds per example, the decoder takes about 27. That's a 2.8x gap. At scale, 2.8x means you need 2.8 times the GPUs to serve the same traffic. If your encoder deployment costs a million dollars a year in GPU, the decoder costs 2.8 million. That's not a theoretical number — it's a procurement conversation. You'll measure this yourself in the homework. And you'll need these numbers for the deployment recommendation in your memo.

One more thing to think about before the lab. In Week 2, class weighting gave the encoder a big boost — about 15 percent improvement in macro F1. The homework asks you to try the same trick on the decoder. Before you run it, form a hypothesis: will class weighting help the decoder as much as it helped the encoder? More? Less? The same? Think about what class weighting actually does and what you've heard today about the two architectures. There's no guaranteed answer here — you'll run the experiment and reason from your own results.

Let me cover the practical decisions you'll face in the lab, so you're prepared.

Before you pick which modules to target, you need to know what the modules actually are. Every transformer layer — and both models have twenty-plus of these — has the same two blocks. First, an attention block, where tokens exchange information with each other. Second, a feed-forward block, where each token does some processing on its own. Around each block there's a residual connection and a layer norm. Inside each block, a handful of linear layers. That's it. LoRA works by adding a small trainable update to any linear layer you want. So when you ask "which modules do I target?", you're really asking "which linear layers get a LoRA adapter attached?" Let's look at what's inside each block.

Attention is how tokens talk to each other. Every token gets projected three different ways. The query projection asks "what am I looking for in the other tokens?" The key projection advertises "here's what I have to offer." The value projection carries "here's the actual information I'm contributing." Attention computes the similarity between queries and keys — how much does token A want what token B offers — turns those similarities into weights, and takes a weighted sum of the values. So after attention, each token now carries a blend of information from the tokens it cared about. One more projection — the output — reshapes the result so it can rejoin the token stream. Four linear layers per attention block: Q, K, V, O. Every one is a potential LoRA target.

Here's where the naming gets confusing. Both models do the same four projections, but they organize them differently. ModernBERT fuses the query, key, and value projections into one big matrix called Wqkv. Why? One matrix multiplication is faster than three, and the result is identical. The output projection is Wo. Qwen keeps them as separate matrices — q_proj, k_proj, v_proj, and o_proj. Neither is mathematically superior; fusion is just a speed trick on the forward pass. But this affects your LoRA config. ModernBERT's attention has two targetable modules. Qwen's has four. Same roles, different number of names.

After attention, each token takes a solo trip through a small two-layer MLP. First, the up projection expands the representation. ModernBERT goes from 768 dimensions to 3072 — four times wider. This gives the token room to do some nonlinear processing. Then an activation function. Then the down projection contracts it back to the original size so it can rejoin the stream. Classic FFN: two linear layers, up and down. But Qwen uses SwiGLU, which adds a gating mechanism. Instead of one up projection, you get a gate projection and an up projection. Their element-wise product goes through the down projection. Slightly better quality empirically, slightly more parameters, and — for our purposes — one more module name on the list. Don't get hung up on why gating helps. Just know that when you look at Qwen, you'll see three FFN modules instead of two.

Here's the summary, and this is the slide to photograph if you only remember one. Same roles in both architectures. Different names. Query, key, value projections — fused in ModernBERT, separate in Qwen. Attention output — Wo versus o_proj. FFN expansion — Wi in ModernBERT, up_proj plus gate_proj in Qwen because of SwiGLU. FFN contraction — Wo2 versus down_proj. In the lab, you'll run print of the model and see the actual names, and the hierarchy they're nested in. Then you decide which ones to adapt. A common default is attention only — the original LoRA paper targeted just q_proj and v_proj. You can go broader. More modules means more capacity, more trainable parameters, more memory. You'll make this call yourself and see the consequences.

The rank controls how many dimensions the low-rank update uses. Rank 4 is very light — about 6,000 parameters per adapted layer. Rank 16 is the standard choice — about 25,000 parameters. Rank 64 is heavy. Higher rank gives you more capacity, but there are diminishing returns. Biderman and colleagues published a detailed study in 2024 showing that LoRA consistently learns less than full fine-tuning — it's a constrained approximation — but it also forgets less of the base model's knowledge. The rank controls that trade-off. For your 113-class task, rank 16 is a reasonable starting point, but you'll choose in the lab.

When you work with the decoder in the homework, there are three things you must get right. First, left padding. We covered why. Second, modules to save. The classification head — called "score" in Qwen — is randomly initialized. If you don't include it in modules_to_save, it won't be saved with the LoRA adapter. You'd reload the model, get a fresh random head, and your metrics would look random. Third, load the model in float32. Qwen defaults to bfloat16, but the T4 GPU can't handle bfloat16 gradient scaling. Load in float32 and let the Trainer handle mixed precision. Miss any of these and training fails silently — it runs, it produces numbers, but the numbers are wrong. I'm telling you this now so you know what to watch for.

When you get to the comparison, don't stop at the aggregate numbers. Break the 113 classes into rare and common tiers — the notebook does this for you — and compare encoder versus decoder within each tier. The chart on the right shows what this kind of analysis looks like: rare classes on the left, common classes on the right, encoder in blue, decoder in orange. The pattern you see will explain more than any single macro F1 number. This is also the analysis that carries the most weight on the rubric — 25 points — so invest time here.

I want you to make three predictions before we break for the lab. Write them down — in your notebook, on paper, wherever. First: will encoder LoRA match the full fine-tuning quality from Week 1? You trained all 149 million parameters then. Now you're training 2 percent. Second: which model wins on macro F1 — encoder LoRA or decoder LoRA? By how much? Third: where will the decoder's advantage be largest — on the rare classes or the common ones? Hold these predictions. Don't change them once you see the numbers. The value is in comparing what you expected to what happened.

Let me tell you what the lab looks like.

Here's the lab structure. You start by configuring LoRA on the encoder — inspect the layers, choose your modules and rank. Then train for 3 epochs — about 32 minutes on T4. While it trains, you write predictions about what you expect. After training, you evaluate and compare to your full fine-tuning results from Week 1. Then you load a pre-trained decoder from HuggingFace Hub — same data, same task, same LoRA approach, already trained. The reveal is a side-by-side metrics table. After that, you dig into the per-class analysis — break it down by frequency tier, look at the scatter plot, interpret what you see. If encoder training doesn't finish in time, there's a pre-trained fallback on HuggingFace Hub. The reveal must not depend on both training runs completing.

The homework extends the lab in three directions. First, you train the decoder yourself instead of loading a pre-trained checkpoint. You'll configure LoRA, set up left padding, handle the classification head. Second, you apply class weighting — the same sqrt-inverse weights from Week 2 — and see whether it helps the decoder the way it helped the encoder. Third, you benchmark latency for both models. The memo has five sections. The central question is a deployment recommendation: given everything you've measured — quality, per-class behavior, latency — what would you deploy, and for what use case? There is no single right answer. But "it depends" without saying on what doesn't count.

Here's the point breakdown for the memo. I want to be upfront about what matters most. Section 3 — the per-class analysis — carries 25 points, the most of any section. That's deliberate. If you only report "the decoder has higher macro F1" and stop there, you've missed the week. The aggregate metrics are real, but they hide the most interesting story. The rubric is in the course repo.

Here's your reading list. Nine papers. The three in bold — LoRA, decoder classification, and long-tail knowledge — we covered in the lecture. The other six are referenced on individual slides or provide deeper context. You don't need to read all nine. Pick two or three that connect to what you find most interesting in your results. If you're curious about why LoRA won over adapters, read He et al. If you want the cleanest encoder-vs-decoder comparison, read Weller et al. — it's an ICLR 2026 paper where they trained paired models on identical data to control for every variable except architecture. If you're thinking about the deployment question for your memo, Valdes Gonzalez frames model selection as a Pareto frontier problem — quality versus cost — which is exactly the framework you need.

Here's the durable mental model I want you to leave with. When you face an adaptation decision in practice, the shape is roughly this. If your task is dominated by common classes and you have a tight latency budget — a real-time classification API, for example — reach for an encoder with LoRA. If rare classes matter and you have some latency budget to spend, the decoder with LoRA is worth considering. If you can spend unlimited compute and accuracy is all that matters, full fine-tuning or a larger model is the right move. And if you're serving many fine-tuned variants concurrently and need to swap between them without reloading the base, adapters are still the right choice. The homework's deployment question asks you to identify which situation you're in for the consumer complaints task and justify it with your own evidence. There's no universal answer.

Next week we build on everything from today. New tools, new experiments. The encoder-vs-decoder question gets sharper. See you in the lab.

LoRA / PEFT: Encoder vs Decoder

ECBS5200 — Week 3

What happens when you only train 2% of the model?

Where we left off

This week

Today's plan

Quick poll

The Cost of Fine-Tuning

Why do we care about parameter efficiency?

What full fine-tuning actually requires

The intuition

LoRA: Low-Rank Adaptation

The numbers: LoRA on GPT-3 175B

Worked example: the math

LoRA in practice

Encoders vs Decoders

Two families of transformer. Same building blocks, different training.

Encoder: bidirectional

Decoder: left-to-right

A decoder for classification?

Classification head on a decoder

Why might a decoder work for classification?

What the decoder sees

The comparison

What pretraining buys you

Especially for the long tail.

The rare-class problem

Competing hypotheses for rare-class performance

The paper: Kandpal et al. 2023

What this means for your lab

The speed question

A different question about rare classes

Practical LoRA decisions

Things you need to know for the lab.

Every transformer layer has the same parts

Attention: Q, K, V, O

Attention naming: fused vs separate

Feed-forward: expand and contract

Summary: same roles, different names

Rank: how many dimensions?

Decoder-specific setup

Where to look in the lab

Make your predictions now

The lab

Encoder LoRA → Decoder reveal → Per-class comparison

Lab structure (~80 min)

Your homework

The rubric at a glance

Week 3 Reading List

A mental model for model choice

Next week