ECBS5200 Week 1

Practical Deep Learning Engineering for Applied ML

ECBS5200 — Week 1

A model is an artifact, not a service endpoint.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

What this course IS

Post-training engineering: you take a pretrained model and make it yours.

  • Fine-tune it on your data
  • Adapt it cheaply with LoRA
  • Analyze its errors systematically
  • Compress it via quantization
  • Distill its knowledge into a smaller model
  • Justify your final recommendation with evidence

You will do all of this on one real task, building up a single model line over the semester.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

What this course IS NOT

  • Not prompt engineering. You won't be writing system prompts.
  • Not an agents course. No tool-use, no chains, no orchestration.
  • Not an LLM apps course. You may have done that already — great.
  • Not a theory course. We care about what works, what it costs, and why.

We are inside the model, not outside it.

If your prior ML experience taught you to call APIs and parse outputs, this course teaches you what's happening on the other side of that API.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

The semester arc

Week Topic
1 Fine-tune a real model, audit the data
2 Improve: learning rate schedules, class weights, error analysis
3 Adapt cheaply with LoRA / PEFT
4 Analyze: confusion matrices, per-class deep dives
5 Compress: quantization (INT8, INT4)
6 Distill + final engineering recommendation

Each week builds on the last. You keep the same dataset. You keep the same model family. The artifact evolves.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

The task: consumer complaint classification

Dataset: determined-ai/consumer_complaints_medium

Real consumer complaints filed with the US Consumer Financial Protection Bureau.

  • 113 classes (after label cleanup — more on this shortly)
  • 57,846 train / 6,430 val / 21,432 test examples
  • Real text, real labels, real class imbalance

Base model: ModernBERT-base (149M parameters, Apache 2.0 license)

You will work with this dataset and this model all semester.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Individual work, one model line

This is individual work. Everyone gets the same dataset and base model. Your decisions are yours.

Over the semester you will make choices:

  • What learning rate? What schedule?
  • Which LoRA rank? Which layers to adapt?
  • How aggressively to quantize?
  • Whether to distill, and from what teacher?

Different students will make different choices and get different results. That's the point.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Assessment

Component Weight What it measures
Weekly memos 60% Can you analyze results and write clearly about trade-offs?
Weekly quizzes (Weeks 2-6) 15% Do you understand last week's concepts?
Final exam 20% Do you understand the concepts, not just the code?
Participation 5% Are you present and engaged?

The memos are where you demonstrate engineering judgment. Not "I got X accuracy" but "I chose Y because Z, and here's what happened."

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Today's plan

Block 1 — Lecture (13:30-15:10)

  • Course framing (you are here)
  • ModernBERT architecture
  • Dataset audit: what are we working with?
  • Baseline discipline: TF-IDF + logistic regression
  • Anatomy of a fine-tuning loop
  • The motivating benchmark: encoder vs. decoder

Block 2 — Lab (15:30-17:10)

  • Hands-on: data audit + classical baseline notebook (completable in class)

Homework: Fine-tune ModernBERT on full data (separate notebook)

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Why ModernBERT-base?

The base model for the semester

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Why ModernBERT-base?

  • Modern encoder (2024, updated 2025), not legacy BERT from 2018
  • 149M parameters, Apache 2.0 license — you can use it anywhere
  • Works with LoRA/PEFT, quantization, and distillation on free-tier GPUs
  • Uses SDPA (Scaled Dot-Product Attention) — faster and more memory-efficient
  • Same tokenizer/architecture patterns as classic BERT, but with modern training
Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

What's inside ModernBERT-base

  • 22 Transformer layers — reads the whole sequence at once (bidirectional, not left-to-right)
  • 768 dimensions per token
  • [CLS] token represents the full sequence
  • Classification head — tiny (768 → 113, <0.1% of params)
  • Pretrained encoder = 149M params (blue)
  • New classification head = ~87K params (red)
Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Dataset Audit for Supervised NLP

Before you train anything, look at what you're training on.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Why audit your data?

Training a model on data you haven't examined is malpractice.

Things that go wrong when you skip the audit:

  • Labels that look different but mean the same thing → inflated class count
  • Classes with 2 examples → model can't learn them, but they tank your macro-F1
  • Redacted or missing text → model learns patterns in the redaction, not the content
  • Extreme class imbalance → model ignores minority classes entirely

Every one of these is present in our dataset. We'll deal with them.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

The CFPB consumer complaints dataset

Source: US Consumer Financial Protection Bureau (CFPB)

Real complaints from real consumers about financial products and services.

Field What it contains
complaint_text The consumer's written complaint (free text)
issue The category label assigned to the complaint

On Hugging Face as: determined-ai/consumer_complaints_medium

This is not a toy dataset. These are real people with real problems.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Label distribution: the long tail

113 classes after cleanup.

The top class alone covers nearly 1 in 4 complaints.

Many tail classes have fewer than 50 training examples.

Any model that learns only the common classes will look decent on accuracy and terrible on macro F1.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Why 113 classes? The label merge story

The raw CFPB data has 153 unique issue labels.

Many are near-duplicates from a CFPB form change:

"Incorrect information on credit report"     ← old form
"Incorrect information on your report"        ← new form

Same complaint type, different wording. The form changed; the meaning didn't.

What we did:

  1. Merged near-duplicate labels → reduced to ~146
  2. Dropped classes with fewer than 5 total examples → 113 canonical classes

This is a normal data-engineering step. Messy labels are the rule, not the exception.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

What do complaints actually look like?

Short complaint:

"There are many mistakes appear in my report without my understanding."

Typical complaint:

"I have a prior XXXX account that is being reported as a charge off. I have contacted the company numerous times to resolve this matter. The account was paid in full XXXX XXXX, XXXX. Please update my credit report to reflect paid in full."

Redacted complaint:

"On XXXX XXXX, XXXX I contacted XXXX XXXX XXXX regarding charges of {$XXX} on my account. I was told XXXX XXXX XXXX would investigate..."

Notice the XXXX markers. They're everywhere.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Redaction prevalence

63.5% of all complaints contain at least one XXXX redaction marker.

This means for nearly two-thirds of the data, the model is working with incomplete text.

What's redacted:

  • Names → XXXX
  • Dates → XXXX
  • Account numbers → XXXX
  • Dollar amounts → {$XXX}

Consequence: The model cannot learn from specific dates, names, or amounts. It must learn from the structure and vocabulary of the complaint.

This is actually fine for classification — but you need to know it's happening.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Text length distribution

Statistic Value
Median length ~50 words
85–90th percentile fits in 128 tokens
Mean length longer (right-skewed)

Most complaints are short: a few sentences describing the problem.

A small fraction are very long — multi-paragraph narratives.

With max_length = 128 tokens, we cover 85–90% of complaints without any truncation.

The 10–15% that get truncated lose their tail end — but the complaint type is usually clear from the first few sentences.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Canonical data split

Split Examples Purpose
Train 57,846 Model learns from these
Validation 6,430 Tune hyperparameters, monitor overfitting
Test 21,432 Final evaluation only — never train on this

The split is fixed for the entire semester. Everyone uses the same split.

This is non-negotiable: you never, ever look at test set performance to make training decisions. That's data leakage.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Dataset audit: key takeaways

  1. Audit before you train. Know your data before you touch a model.
  2. 113 classes after merging near-duplicates and dropping tiny classes from 153 raw labels.
  3. Extreme long tail — top class is 23%, many classes < 0.1%.
  4. 63.5% of complaints are redacted — the model learns from structure, not specifics.
  5. Most text fits in 128 tokens — truncation affects only 10–15% of examples.
  6. Fixed split — 57,846 / 6,430 / 21,432. Same for everyone, all semester.

This is the terrain. Now let's see what a simple baseline can do on it.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Baseline Discipline

Always know what "dumb" gets you before you try "smart."

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Why baselines matter

A model is only impressive relative to something.

Without a baseline:

  • "58% accuracy" — is that good? Bad? You have no idea.
  • "0.30 macro-F1" — better than what?

With a baseline:

  • "58% accuracy vs. 54.2% for TF-IDF + logistic regression" — small but real improvement.
  • "0.30 macro-F1 vs. 0.132 for TF-IDF" — the neural model more than doubled F1.

The baseline turns a number into a story.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

The simplest baseline: majority class

Always predict the most common class.

In our dataset, "Incorrect information on your report" = 23.0% of all examples.

Metric Value
Accuracy 23.0%
Macro-F1 ~0.003

This model has zero intelligence. It learns nothing. It ignores every input.

But it gets 23% accuracy. That's the floor. If your model can't beat this, it's worthless.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

The classical baseline: TF-IDF + logistic regression

TF-IDF: Turn text into a sparse vector of word-importance scores.
Logistic regression: Learn a linear decision boundary in that vector space.

This takes about 30 seconds to train. No GPU. No pretrained weights. No deep learning.

It's the "can a simple model handle this?" test.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

TF-IDF + logistic regression: results

Metric Value
Val accuracy 54.2%
Macro-F1 0.132
Weighted F1 0.479

At first glance, 54.2% accuracy looks decent — more than double the majority baseline.

But look at that macro-F1: 0.132.

Something is very wrong.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

The accuracy vs. macro-F1 gap

54.2% accuracy, 0.132 macro-F1.

The model learned ~40 classes and completely ignores the other ~70.

70 out of 113 classes get F1 = 0. The model never predicts them. Not once.

The red bars are classes with zero F1. That's most of the chart.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Why this gap is the core lesson

The accuracy–macro-F1 gap tells you:

"Your model is biased toward common classes and ignoring rare ones."

This is the default failure mode of any model trained on imbalanced data without intervention.

It happens because:

  • Common classes dominate the loss function → model optimizes for them
  • Rare classes contribute little to the loss → model learns to ignore them
  • Accuracy rewards this behavior → you won't notice unless you check macro-F1

Checking only accuracy is how you ship a model that fails on 60% of complaint types.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

The scoreboard so far

Model Accuracy Macro-F1 Weighted F1 Cost
Majority class 23.0% ~0.003 Free
TF-IDF + LogReg 54.2% 0.132 0.479 Free, 30 sec
Fine-tuned encoder ? ? ? Free (Kaggle T4)

Can a neural model do better?

More importantly: can it do better where the baseline fails — on those 70 ignored classes?

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Baseline discipline: key takeaways

  1. Always run a baseline first. Numbers without a reference point are meaningless.
  2. Majority class = 23.0% accuracy. That's the absolute floor.
  3. TF-IDF + LogReg = 54.2% accuracy, 0.132 macro-F1. Learns ~40 classes, ignores 70.
  4. The accuracy–macro-F1 gap is your single most important diagnostic signal for imbalanced classification.
  5. 70 out of 113 classes get F1 = 0 from the classical baseline. Those are the classes where a neural model needs to prove its value.
Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Anatomy of a Fine-Tuning Loop

What actually happens when you fine-tune a pretrained encoder.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

The big picture

Fine-tuning = take a model that already understands language and teach it your specific task.

Pretrained encoder (ModernBERT-base, 149M params)
  ↓
+ Classification head (new, randomly initialized)
  ↓
Train on your labeled data
  ↓
Model that classifies consumer complaints

You're not training from scratch. You're adapting a model that already knows English grammar, word meaning, sentence structure. You just need to teach it your 113 labels.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Step 1: Load pretrained weights

model = AutoModelForSequenceClassification.from_pretrained(
    "answerdotai/ModernBERT-base",
    num_labels=113
)

This does two things:

  1. Loads the pretrained encoder (149M params) — already knows language
  2. Adds a classification head (random weights) — needs to learn our task

The encoder is the foundation. The classification head is the part you're really training.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Step 2: Tokenize the data

tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")

def tokenize(example):
    return tokenizer(example["text"], 
                     max_length=128, 
                     truncation=True, 
                     padding="max_length")

max_length = 128 — covers ~90% of complaints without truncation.

The tokenizer and the model must use the same vocabulary. They were trained together.

You saw this in Pre-Work Module 01. Same concept, now applied to 57,846 training examples.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Step 3: DataLoader — batching

Training processes data in batches, not one example at a time.

Batch size GPU memory Training speed Gradient noise
8 Low Slow High
16 Moderate Moderate Moderate
32 High Fast Low
64 Very high Very fast Very low

We use batch_size = 32 — fits comfortably on a free Kaggle T4 GPU (16 GB) with max_length=128.

Each batch: 32 tokenized complaints → model → 32 predictions → one gradient update.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Step 4: Forward pass

For each batch:

Input tokens  →  Encoder  →  Hidden states  →  [CLS] embedding  →  Head  →  113 logits
  1. Tokens enter the encoder (22 transformer layers in ModernBERT-base)
  2. The encoder produces a hidden state for every token position
  3. We take the [CLS] token's hidden state as the sequence representation
  4. The classification head maps it to 113 logits (one per class)

The logits are raw scores — not probabilities yet.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Step 5: Loss — cross-entropy

Cross-entropy loss = −log(p), where p is the model's predicted probability for the true class.

Scenario p(true class) Loss
Confident & correct 0.85 0.16
Uncertain 0.10 2.30
Confident & wrong 0.02 3.91

The curve is steep on the left — confident wrong answers get hammered.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Why cross-entropy explains the 70 ignored classes

Cross-entropy averages over every example in a batch.

Class Training examples Share of total loss
"Incorrect info on your report" 14,815 ~25% of loss
A rare class with 8 examples 8 ~0.01% of loss

The optimizer follows the gradient. The gradient is dominated by common classes.

Getting a rare class right barely moves the loss. Getting it wrong barely hurts. So the model learns to ignore it.

This is exactly why TF-IDF ignores 70 classes — and it will happen to our neural model too, unless we intervene (Week 2: class weighting).

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Step 6: Backward pass + optimizer

Backward pass (backpropagation): Compute the gradient of the loss with respect to every parameter.

"How should each weight change to reduce the loss?"

Optimizer step: Update every parameter by a small amount in the direction that reduces the loss.

new_weight = old_weight - learning_rate × gradient

The learning rate controls how big each step is.

For fine-tuning: much smaller than training from scratch (typically 1e-5 to 5e-5).

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

The learning rate: why fine-tuning is different

Training from scratch: Large learning rate (1e-3). Weights are random, need big updates.

Fine-tuning: Small learning rate (2e-5). Weights are already good, need gentle nudges.

Training from scratch:  lr = 0.001     (100x larger)
Fine-tuning:            lr = 0.00002   (gentle updates)

Too high → catastrophic forgetting: you overwrite what the model already knows.

Too low → underfitting: the model doesn't adapt enough to your task.

This is the single most important hyperparameter in fine-tuning.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Step 7: Validate

After each epoch (one pass through all training data):

  1. Switch model to eval mode — turns off dropout, freezes batch norm
  2. Run the full validation set through the model — no gradient computation
  3. Compute val loss, val accuracy, val macro-F1
  4. Compare to previous epoch — is the model still improving?

Train mode vs eval mode is a real source of bugs.

If you forget to switch to eval mode → dropout is still active → your validation numbers are noisy and wrong.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Step 8: Checkpoint

A checkpoint saves the model state so you can resume or deploy later.

What gets saved:

  • All model weights (encoder + classification head)
  • Optimizer state (momentum, adaptive learning rates)
  • Training metadata (epoch number, best validation score)

Why this matters:

  • If your GPU crashes at epoch 3, you can resume from the last checkpoint
  • At the end of training, you load the checkpoint with the best validation score — not the last one

The best epoch is rarely the last epoch.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

The complete loop

for epoch in range(num_epochs):
    
    model.train()                        # Train mode ON
    for batch in train_dataloader:
        logits = model(batch)            # Forward pass
        loss = cross_entropy(logits)     # Compute loss
        loss.backward()                  # Backward pass
        optimizer.step()                 # Update weights
        optimizer.zero_grad()            # Reset gradients
    
    model.eval()                         # Eval mode ON
    validate(model, val_dataloader)      # Measure performance
    save_checkpoint(model)               # Save state

That's it. This is the entire training loop. Everything else is details.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Fine-tuning anatomy: key takeaways

  1. Fine-tuning adapts pretrained knowledge — you're not training from scratch.
  2. Encoder + classification head — the encoder knows language, the head learns your task.
  3. The training loop: tokenize → batch → forward → loss → backward → optimize → validate → checkpoint.
  4. Learning rate is critical — too high destroys pretrained knowledge (catastrophic forgetting), too low won't adapt.
  5. Train vs eval mode — forgetting to switch is a real bug with real consequences.
  6. Best checkpoint ≠ last checkpoint — save checkpoints, use the best one.
Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

What you're building

For the first two weeks, you build the best encoder you can. One model, one dataset, one evolving artifact. You learn the training loop, improve it, analyze where it fails.

In Week 3, you meet a challenger: a decoder that trains on the same data, on the same free GPU, and beats your encoder on rare-class performance.

The rest of the semester: understand the trade-off and make the recommendation.

The full picture

Model Accuracy Macro-F1 Rare classes rescued Latency
Majority class 23.0% ~0.003 0 / 113
TF-IDF + LogReg 54.2% 0.132 43 / 113 instant
Fine-tuned encoder (149M, 3 ep) 56.6% 0.209 67 / 113 3 ms
Decoder + LoRA (494M, 3 ep) 57.0% 0.240 76 / 113 58 ms
Opus decoder (zero-shot) 44.0% 0.174 2,300 ms

Both the encoder and the decoder train on the same data, on the same free Kaggle T4.

The decoder wins on quality. The encoder wins on speed. Neither dominates.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

The trade-off

Encoder (149M) Decoder (494M)
Accuracy 56.6% 57.0%
Macro F1 (rare classes) 0.209 0.240 (+15%)
Zero-F1 classes 46 37 (9 more rescued)
Latency per example 3 ms 58 ms
64K complaints ~3 min ~7 min
Training time (T4) 32 min 54 min
Parameters trained 149M (100%) 2.3M (0.46%)

The decoder trains less than 1% of its parameters and beats the encoder that trains all of them.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Why doesn't the encoder just win?

The decoder has 3.3x more parameters trained on 9x more data.

Scaling laws predict: more parameters + more data = richer representations.

The encoder learns the task from scratch using 58,000 labeled examples. The decoder already understood the domain — it just needed LoRA to learn the 113 label boundaries.

This gap grows with decoder size:

Decoder Accuracy Macro F1
0.5B 57.0% 0.240
1.5B 58.3% 0.252
3B 58.7% 0.250

The bigger the decoder, the more the encoder falls behind.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

The semester question

This course is not about picking the winner.

Both models are useful. The question is: which one do you deploy, for what use case, and why?

  • A real-time complaint router that must respond in under 10ms? → Encoder.
  • A weekly batch analysis where quality on rare classes matters? → Decoder.
  • A budget-constrained startup with no GPU? → TF-IDF might be fine.

Your job this semester: build both, understand both, and make the recommendation.

Week by week, you'll fine-tune, adapt with LoRA, analyze errors, compress, and distill — on both architectures. At the end, you write the engineering recommendation.

Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Key takeaways

  1. The encoder (56.6%) and decoder (57.0%) are close on accuracy — the real gap is on rare classes (macro F1: 0.209 vs 0.240).
  2. The encoder is 19x faster per example (3 ms vs 58 ms) — real but not 1000x.
  3. The decoder trains 0.46% of its parameters and beats the encoder training 100% — scaling laws at work.
  4. Bigger decoders are better: 0.5B → 1.5B → 3B, accuracy keeps climbing.
  5. Neither model dominates. Your semester: build both, understand both, recommend one.
Fine-tuning a Real Model and Auditing the Data
ECBS5200 Week 1

Your homework this week

Block 2 (right now): Open week1_lab.ipynb and work through it. Data audit + classical baseline. You should finish in class.

After class — homework notebook: week1_homework.ipynb

  • Fine-tune ModernBERT-base on the full dataset (2 epochs, ~20 min training on T4)
  • Analyze: which of the 70 zero-F1 classes did the neural model rescue?
  • Experiment: change one hyperparameter and compare
  • Write your memo (prompts are embedded in the notebook)

Due: Wednesday morning before Week 2 class. Submit as HTML via Moodle.

Expected time: 5-6 hours outside of class.

Fine-tuning a Real Model and Auditing the Data

Welcome to ECBS5200, Practical Deep Learning Engineering for Applied ML. My name is Eduardo, and this semester we're going to do something different from what you've seen before. If you've taken an applied LLMs course, you've learned how to use models as services — prompt them, chain them, build apps around them. That's valuable. But in this course, we treat a model as a trainable, inspectable, compressible artifact. Something you own, something you can open up, something you can make smaller and faster and cheaper. That's a fundamentally different engineering relationship with a model.

So what is this course? It's post-training engineering. You start with a pretrained model — someone else spent millions of dollars training it on a massive corpus — and you make it yours. You fine-tune it on your specific task. You adapt it cheaply using techniques like LoRA. You analyze where it fails and why. You compress it with quantization so it runs faster and cheaper. You distill its knowledge into something smaller. And at the end, you write up a justified recommendation: here's what I built, here's what it costs, here's what it can and can't do. Every one of these steps happens on the same task, the same dataset, the same model family. You're building one cumulative model line, not jumping between toy problems.

Let me be equally clear about what this course is NOT. It's not prompt engineering — you won't be writing system prompts or few-shot examples. It's not an agents course — no tool-use, no chains. It's not about building LLM applications. If you've taken a course like that, great, that's useful background. But we're doing something different. We're inside the model. We're looking at loss curves, gradient updates, weight matrices, attention patterns. If your prior experience was calling an API and parsing the JSON that comes back, this course teaches you what's happening on the other side of that API call.

Here's the semester arc. Six Wednesdays, six weeks. Week 1 — that's today — we fine-tune a real model and audit the dataset. Week 2, we improve it: learning rate schedules, class weights, targeted error analysis. Week 3, we adapt it cheaply with LoRA, which lets you fine-tune a fraction of the parameters. Week 4, systematic error analysis — confusion matrices, per-class deep dives, figuring out where the model actually fails. Week 5, compression — quantization to INT8 and INT4, making the model smaller and faster without destroying accuracy. Week 6, distillation and your final engineering recommendation — using a bigger model to teach a smaller one, and synthesizing the term's measurements into a defended deployment choice. Each week builds on the previous one. Same dataset, same model family. The artifact evolves.

The task for the entire course is consumer complaint classification. These are real complaints filed by real people with the US Consumer Financial Protection Bureau — the CFPB. Someone writes in saying "my credit card company charged me twice" or "the debt collector is contacting me about a debt I already paid," and the complaint gets assigned to one of 113 categories. The dataset has about 58,000 training examples, 6,400 validation, and 21,000 test. The class distribution is wildly imbalanced — we'll look at that in detail shortly. Our base model is ModernBERT-base, 149 million parameters, Apache 2.0 licensed, so you can use it for anything. This is your dataset and your model for the entire semester.

One thing I want to be explicit about: this is individual work. Everyone starts from the same place — same dataset, same base model. But the engineering decisions you make are yours. What learning rate do you pick? What schedule? When you do LoRA, what rank? When you quantize, how aggressively? These choices have consequences, and different students will make different choices and get different results. That's the point. There's no single right answer — there's a space of reasonable decisions, and your job is to navigate it and defend your choices with evidence.

Assessment. Weekly memos are 60 percent of your grade — well over half. These are short write-ups where you analyze your results and explain your trade-offs. I'm not looking for "I got 58 percent accuracy." I'm looking for "I chose this learning rate because of X, the validation loss plateaued at epoch 3, and here's what the per-class breakdown tells me about where the model struggles." Engineering judgment, in writing. Weekly quizzes are 15 percent — these are closed-book quizzes taken at the start of each class from Week 2 through Week 6. They test the previous week's material. Did you actually understand what we covered? If you engaged with the material, you'll do fine. The final exam is 20 percent — conceptual understanding, not code recall. And participation is 5 percent. Show up, engage, ask questions.

Here's the plan for today. In the first block — the lecture — we'll finish this course framing, then look at the ModernBERT architecture so you know what's inside the model you'll be training. Then we audit the dataset, run a classical baseline to establish a floor, walk through the anatomy of a fine-tuning loop, and finish with the motivating benchmark that frames the entire semester. In the second block — the lab — you'll get hands-on with the data audit and classical baseline notebook, which is designed to be completable in class. Then for homework, you'll fine-tune ModernBERT on the full dataset in a separate notebook. Let's get into it.

Before we dive into the data, let's talk about the model we'll be using all semester. We're not using the original BERT from 2018 — that's eight years old at this point. ModernBERT came out in 2024 and benefits from six years of architecture improvements while keeping the same encoder-only design that makes BERT so effective for classification. The Apache 2.0 license means no licensing restrictions — you can use it anywhere, for anything, commercially or otherwise. And SDPA — Scaled Dot-Product Attention — is a practical detail that matters: it means attention is computed more efficiently on the T4 GPUs you'll be using on Kaggle. Faster training, lower memory usage, same results.

So why this model specifically? First, it's modern. BERT was groundbreaking in 2018, but a lot has improved since then — training recipes, attention implementations, data quality. ModernBERT incorporates all of that while keeping the encoder-only architecture that's ideal for classification tasks. Second, it's practical for our constraints: 149 million parameters fits comfortably on free-tier GPUs, and it works with all the techniques we'll cover — LoRA, quantization, distillation. Third, the Apache 2.0 license means zero restrictions. In industry, licensing matters. You don't want to build a production system on a model with usage restrictions. And finally, SDPA — Scaled Dot-Product Attention — is a concrete engineering win. It computes attention more efficiently on the T4 GPUs you'll be using, which means faster training and lower memory usage for the same model quality.

Let me walk you through the architecture from input to output. Your text goes into the tokenizer, which converts it to token IDs — up to 128 of them. Those IDs get mapped to 768-dimensional embedding vectors — each token becomes a vector of 768 numbers. Then those vectors pass through 22 Transformer layers, each with self-attention and a feed-forward network. The key thing about an encoder: it reads ALL tokens simultaneously. Bidirectional attention. Unlike a decoder, which reads left-to-right and can only see what came before, the encoder sees the entire input at once. That's why encoders are so much faster for classification — no autoregressive generation, just one forward pass. After 22 layers, we take the hidden state of the CLS token — the special token at the start — as the representation of the entire sequence. That 768-dimensional vector gets fed into the classification head, which is a tiny linear layer mapping 768 dimensions to 113 logits, one per class. The classification head is less than 1 percent of the model's parameters. Most of the intelligence is in those 22 pretrained encoder layers.

Before we touch a model, before we write a single line of training code, we're going to look at our data. This sounds obvious, but you'd be amazed how many people skip this step. They download a dataset, point a model at it, and start optimizing a number. Then three weeks later they discover the labels are noisy, or the class distribution is pathological, or half the text is redacted. We're not going to do that. We're going to audit the dataset first.

Why bother? Because training a model on data you haven't examined is malpractice. I don't use that word lightly. Here's what goes wrong when you skip the audit. You might have labels that look different but are actually the same thing — we have that. You might have classes with only two examples — we have that too. You might have redacted text where the model learns the pattern of the redaction markers instead of the actual content — yep, we have that. And you will almost certainly have extreme class imbalance — we definitely have that. Every single one of these problems is present in our dataset. The audit is how we find them and decide what to do about them.

Our dataset comes from the US Consumer Financial Protection Bureau, the CFPB. When a consumer has a problem with a financial product — their credit card, their mortgage, a debt collector — they can file a complaint with the CFPB. That complaint includes free-text describing the problem, and it gets categorized by issue type. We're using the "determined-ai/consumer_complaints_medium" version on Hugging Face. The two fields we care about are the complaint text — what the person wrote — and the issue label — what category it was assigned to. This is not a toy dataset. These are real complaints from real people about real financial problems.

Let's look at the label distribution. After cleanup — which I'll explain in a moment — we have 113 classes. The distribution is extremely long-tailed. The single largest class, "Incorrect information on your report," accounts for 23 percent of all complaints. Nearly one in four. The top 10 classes together cover about 60 percent of the data. And the bottom 63 classes — more than half of all categories — share less than 10 percent of the data between them. Many of those tail classes have fewer than 50 training examples. This is a hard classification problem. Any model that just learns the popular classes and ignores the tail will look decent on accuracy and terrible on macro-F1.

You might wonder why we have 113 classes instead of the 153 in the raw data. Here's the story. The CFPB changed their complaint form at some point, and when they did, they reworded some of the issue categories. So you get pairs like "Incorrect information on credit report" and "Incorrect information on your report" — same complaint type, different wording. If we treated those as separate classes, we'd be asking the model to distinguish between identical categories based on an administrative change. So we merged those near-duplicates. That brought us down to about 146. Then we dropped any class with fewer than 5 total examples — you simply cannot learn a class from 2 or 3 examples, and they add noise to the evaluation. That gives us 113 canonical classes. This kind of label cleanup is completely normal. In any real-world dataset, messy labels are the rule, not the exception. The first thing you do is look at them and clean them up.

Let's look at what the actual text looks like. Complaints range from one sentence — "there are many mistakes in my report" — to multiple paragraphs. A typical complaint is a few sentences describing the problem and what the consumer wants done about it. But here's something you'll notice immediately: those XXXX markers. The CFPB redacts personally identifiable information — names, dates, account numbers, dollar amounts — and replaces them with XXXX or curly-brace placeholders. This redaction is aggressive and pervasive, and it fundamentally changes what the model can and can't learn from the text.

Sixty-three and a half percent. Nearly two-thirds of all complaints contain at least one XXXX redaction marker. Names, dates, account numbers, dollar amounts — all replaced with XXXX. This means the model is working with incomplete text for most of the dataset. It cannot learn that complaints filed in January are different from complaints filed in June. It cannot learn that complaints about $500 charges are different from complaints about $5,000 charges. What it CAN learn is the structure and vocabulary of the complaint — the words people use to describe different types of problems. For classification purposes, that's actually fine. The issue category depends on what kind of problem it is, not on the specific dates or amounts involved. But you need to know this is happening, because if you ever tried to do something like severity estimation or dollar-amount prediction on this data, you'd be in trouble.

How long are these complaints? The median is about 50 words — a few sentences. The distribution is right-skewed, so the mean is higher, pulled up by a minority of very long complaints. The key number for us: 85 to 90 percent of complaints fit entirely within 128 tokens. That's our truncation threshold from the pre-work. So for the vast majority of complaints, the model sees the full text. For the 10 to 15 percent that exceed 128 tokens, we lose the tail end. But here's the thing — the complaint type is almost always clear from the first few sentences. People lead with their problem. "I was charged twice." "The debt collector is calling about a debt I don't owe." The specific details and timeline that follow are less important for classification than that opening framing.

The data split is fixed for the entire course. 57,846 training examples, 6,430 validation, 21,432 test. The training set is what the model learns from. The validation set is for monitoring overfitting and tuning hyperparameters — learning rate, number of epochs, that sort of thing. The test set is for final evaluation only. You run on it once, at the end, to report your results. You never use test set performance to make training decisions. If you do, you've leaked information and your results are meaningless. This is a basic principle, but it's worth stating explicitly because the temptation to peek is real. Everyone in the class uses the same split, so your results are directly comparable.

Let me recap. Always audit your data before training. We have 113 classes after cleaning up 153 raw labels — merging near-duplicates and dropping classes with fewer than 5 examples. The distribution is extremely long-tailed, with the top class at 23 percent. Nearly two-thirds of the text is partially redacted, so the model has to learn from complaint structure and vocabulary rather than specific details. Most complaints fit within our 128-token limit. And the data split is fixed for the entire semester. That's the terrain we're working with. Now let's see what a simple baseline can do on it.

We've looked at the data. Now, before we do anything neural, anything fancy, anything with transformers or GPUs or pretrained weights, we're going to run the simplest reasonable model we can think of. This is called baseline discipline, and it's one of the most important habits in applied machine learning. If you don't know what a simple model gets you, you have no way to judge whether your complex model is actually contributing anything.

Here's the problem. If I tell you my fine-tuned model gets 58 percent accuracy, what do you think? Is that good? Bad? You have no idea. You need a reference point. The baseline provides that reference point. When I tell you the model gets 58 percent accuracy and the TF-IDF baseline gets 54.2 percent, now you know the neural model is only a few points better on accuracy. But when I tell you the neural model gets 0.30 macro-F1 and the baseline gets 0.132, now you see something dramatic — the neural model more than doubled the F1 score. The baseline turns an isolated number into a story about where the model is actually adding value.

Let's start with the absolute simplest baseline: the majority-class predictor. You literally ignore the input and always predict the most common class. In our dataset, that's "Incorrect information on your report," which accounts for 23 percent of all complaints. So this zero-intelligence model gets 23 percent accuracy just by always guessing the same thing. Its macro-F1 is essentially zero — about 0.003 — because it gets zero F1 on 112 out of 113 classes. This is the floor. If any model you build this semester can't beat 23 percent accuracy, it has learned literally nothing beyond "guess the popular class."

Now let's run a real baseline — TF-IDF plus logistic regression. TF-IDF turns each complaint into a sparse vector where each dimension represents a word, and the value reflects how important that word is in this document relative to the corpus. Logistic regression then learns a linear decision boundary in that high-dimensional space. This is fast — about 30 seconds to train on a laptop, no GPU needed, no pretrained anything. It's the classic "can a simple model handle this" test for text classification. And it's important to run because if TF-IDF plus logistic regression solves your problem, you don't need deep learning at all.

Here are the results. The TF-IDF baseline gets 54.2 percent accuracy on the validation set. That's more than double the majority baseline's 23 percent. Weighted F1 is 0.479 — not great, but not terrible. But look at the macro-F1: 0.132. That's terrible. Remember, macro-F1 averages F1 across all 113 classes equally. If macro-F1 is 0.132 while accuracy is 54.2 percent, that tells you the model is doing well on some classes and catastrophically failing on others. Let's dig into what's happening.

Here's the core lesson of this slide. The model gets 54.2 percent accuracy but only 0.132 macro-F1. How is that possible? Because the model learned about 40 classes — the common ones — and completely ignores the other 70. It never predicts those 70 classes. Not once. For those 70 classes, F1 is zero. Now, those 70 classes are rare — they each have few examples — so ignoring them barely hurts accuracy. If a class represents 0.1 percent of the data, never predicting it only costs you 0.1 percentage points of accuracy. But macro-F1 treats every class equally. Seventy zeros in your average destroys the macro-F1, even if the other 40 classes have decent F1 scores. This is the accuracy vs. macro-F1 gap, and it's the most important diagnostic insight you'll learn today.

This gap is not just a fun fact — it's the central diagnostic pattern you need to internalize. When accuracy is high but macro-F1 is low, your model is biased toward common classes and ignoring rare ones. This is the default behavior. It's not a bug in the algorithm — it's the natural consequence of training on imbalanced data. Common classes dominate the cross-entropy loss, so the model optimizes for them. Rare classes contribute almost nothing to the total loss, so the model learns to ignore them. And accuracy rewards this behavior — you can't tell it's happening unless you look at macro-F1 or the per-class breakdown. This is how you ship a model that works fine for 40 complaint types and fails completely for 70 others. In a real deployment, that means 70 categories of consumers get misrouted, mishandled, or ignored.

Here's our scoreboard so far. The majority baseline: 23 percent accuracy, essentially zero macro-F1. The TF-IDF baseline: 54.2 percent accuracy, 0.132 macro-F1 — it learned 40 classes and ignores 70. Now the question is: can a neural model do better? And specifically, can it do better where the baseline fails? Can it learn some of those 70 classes that the linear model completely ignores? That's what we're going to find out when we fine-tune ModernBERT. The bar has been set. Let's see if deep learning can clear it.

Key takeaways. Always run a baseline — you cannot evaluate a model without a reference point. The majority baseline is 23 percent accuracy. The TF-IDF baseline is 54.2 percent accuracy but only 0.132 macro-F1 — it learned about 40 classes and completely ignores 70. The gap between accuracy and macro-F1 is the most important thing to watch in imbalanced classification. And those 70 classes with F1 of zero — that's exactly where a neural model needs to prove it's worth the additional complexity and compute cost.

Now we're going to walk through the mechanics of fine-tuning. Not the code — you'll see that in the notebook — but the concepts. What actually happens when you take a pretrained model like ModernBERT and adapt it to classify consumer complaints? There are about eight steps in the loop, and each one involves a design decision you need to understand.

Here's the big picture. Fine-tuning means taking a model that has already been trained on a huge corpus of text — it already understands English grammar, word meanings, sentence structure — and teaching it your specific task. In our case, we take ModernBERT-base, which has 149 million parameters that were trained on billions of words of text. We add a small classification head on top — a new, randomly initialized layer — and then we train on our labeled complaint data. We're not training 149 million parameters from scratch. We're nudging them slightly so that the representations they produce are useful for distinguishing between our 113 complaint categories. That's a much easier job than learning English from nothing.

Step one: load the pretrained model. When you call from_pretrained with num_labels=113, the library does two things. It loads the full pretrained encoder — all 149 million parameters that were trained on a massive text corpus. Those weights already encode useful representations of English. Then it adds a classification head on top — a linear layer that maps from the encoder's hidden dimension to our 113 classes. That classification head is randomly initialized. It knows nothing about our task yet. So at the start of training, the encoder is sophisticated and the head is random. Training will adjust both, but the head has the most to learn.

Step two: tokenize the data. You load the tokenizer that matches your model — they must use the same vocabulary, because the model's embedding layer expects specific token IDs. We set max_length to 128, enable truncation for the rare long complaints, and pad shorter ones to the full length. You covered this in pre-work module one. The only difference now is scale — we're tokenizing 57,846 training examples, not one complaint at a time. This step is done once before training starts, not inside the training loop.

Step three: the DataLoader. We don't feed examples to the model one at a time — that would be extremely slow. Instead, we batch them. The DataLoader takes our tokenized dataset and serves up batches of examples. We use a batch size of 32, which fits comfortably in the 16 gigabytes of memory on a free Kaggle T4 GPU when our max sequence length is 128 tokens. Larger batches would be faster but might not fit in memory. Smaller batches would add more noise to the gradient estimates. 32 is a reasonable choice for our setup. Each iteration of the training loop processes 32 complaints simultaneously: 32 tokenized inputs go in, 32 predictions come out, and we do one gradient update.

Step four: the forward pass. Each batch of tokenized inputs goes through the encoder — that's 22 transformer layers in ModernBERT-base. The encoder produces a hidden state vector for every token position. We take the hidden state at the CLS token position — that's the special token at the start of every sequence — and use it as the representation of the entire complaint. That CLS embedding gets fed into the classification head, which is a linear layer that produces 113 numbers, one for each class. Those numbers are called logits — raw scores, not probabilities. To get probabilities you'd apply softmax, but for training we don't need to because the loss function handles that.

Step five: the loss function. Cross-entropy loss is minus log of the probability the model assigns to the true class. Look at the curve. When the model is confident and correct — assigning 85% probability to the right class — the loss is only 0.16. When the model is uncertain, assigning only 10%, the loss jumps to 2.3. And when the model is confident but wrong — assigning just 2% to the true class — the loss is 3.9. That steep curve on the left is important: it means cross-entropy disproportionately punishes confident wrong answers. The model can't just be right on average — it really pays for being confidently wrong. This is computed for every example in every batch, then averaged to give a single number. Training means making that number go down.

Here's the insight that connects the loss function to the 70 ignored classes we saw in the baseline. Cross-entropy loss is averaged over all examples in a batch. If a class has 14,000 training examples, it contributes roughly 25 percent of the total loss. If a class has 8 examples, it contributes about one hundredth of a percent. The optimizer follows the gradient, and the gradient is dominated by the common classes. Getting a rare class right barely reduces the total loss. Getting it wrong barely increases it. So what does the model learn to do? Ignore the rare classes. Focus on the common ones where loss reduction is easy. This isn't a bug in the algorithm — it's the natural consequence of the loss function interacting with imbalanced data. The TF-IDF model did exactly this — ignored 70 classes. Our neural model will do the same unless we intervene. That intervention is coming in Week 2, when we talk about class weighting and other strategies. For now, just understand the mechanism.

Step six: the backward pass and optimizer step. Backpropagation computes the gradient of the loss with respect to every single parameter in the model — all 149 million of them plus the classification head. The gradient tells you: for each weight, which direction should it move to reduce the loss, and by how much? Then the optimizer takes those gradients and updates the weights. The simplest version is: new weight equals old weight minus learning rate times gradient. The learning rate is crucial. For fine-tuning, you use a much smaller learning rate than you'd use for training from scratch — typically 1e-5 to 5e-5. Why? Because the pretrained weights are already good. You want to nudge them gently, not overwrite them. If your learning rate is too high, you destroy the pretrained representations. That's called catastrophic forgetting.

Let me emphasize this point because it's the single most important hyperparameter you'll deal with. When training a model from scratch, you use a learning rate around 1e-3 — the weights are random, they need to move a lot. When fine-tuning, you use something like 2e-5 — that's 50 times smaller. The pretrained weights already encode useful knowledge about language. If you update them too aggressively, you overwrite that knowledge. The model forgets how English works while trying to learn your 113 categories. That's catastrophic forgetting. On the other hand, if your learning rate is too small, the model doesn't adapt enough to your specific task. Finding the right learning rate is the most impactful tuning decision you'll make, and we'll explore this more in Week 2.

Step seven: validation. After each pass through the training data — that's one epoch — you evaluate on the validation set. But first, you have to switch the model to eval mode. This turns off dropout and freezes batch normalization statistics. If you forget this step, dropout is still randomly zeroing out neurons during validation, and your validation metrics will be noisy and unreliable. This is a real, common bug. You also disable gradient computation during validation — you're not updating weights, just measuring performance. You compute the validation loss, accuracy, and macro-F1, and you compare to the previous epoch. Is the model still improving? If validation loss starts going up while training loss keeps going down, you're overfitting.

Step eight: checkpointing. A checkpoint saves everything you need to resume training or deploy the model — the model weights, the optimizer state, and metadata like which epoch you're on and what the best validation score was. This matters for two reasons. First, practical: GPU time on Kaggle is limited and sometimes things crash. If you saved a checkpoint, you can resume. Second, methodological: you want to use the model weights from the epoch with the best validation performance, not the weights from the last epoch. Typically, validation performance improves for a few epochs and then starts to degrade as the model overfits. The best epoch is rarely the last one. Your checkpoint strategy determines whether you can recover the best model.

Here's the complete loop in pseudocode. For each epoch: switch to train mode, iterate over batches, do the forward pass, compute the loss, do the backward pass, update weights, zero the gradients. Then switch to eval mode, run validation, save a checkpoint. That's the entire fine-tuning loop. Every fine-tuning job you'll ever run — whether it's BERT, GPT, or anything else — follows this exact structure. The details vary — different optimizers, different schedulers, different regularization — but the loop is the same. You'll see the actual code in the notebook, and it will map directly to these eight steps.

Let me summarize. Fine-tuning adapts a pretrained model to your task — you're leveraging billions of words of pretraining, not starting from nothing. The architecture is a pretrained encoder plus a new classification head. The training loop has eight steps: tokenize, batch, forward pass, loss, backward pass, optimizer step, validation, checkpoint. The learning rate is the most important hyperparameter — fine-tuning needs much smaller rates than training from scratch. Train mode vs eval mode is a real source of bugs. And always save checkpoints so you can use the best epoch, not just the last one.

This is the bridge slide. Everything before this was encoder-focused: how to train, how to improve, how to evaluate. Students might be thinking "this is my model for the semester." It is — but it's about to get competition. I want to be upfront about that. The encoder is their primary artifact for Weeks 1 and 2. In Week 3 they'll train a decoder with LoRA and discover it's better on quality but slower. From that point on, the course is about understanding that trade-off: which model do you deploy, for what scenario, and why? They're not maintaining two parallel model lines. They're building one, meeting a challenger, and learning to reason about the engineering decision.

Look at this table. The fine-tuned encoder — 149 million parameters, all of them trained, running on a free Kaggle T4 — gets 56.6 percent accuracy and 0.209 macro-F1. The decoder — 494 million parameters, with LoRA adapting less than half a percent of them, on the same free T4 — gets 57 percent accuracy and 0.240 macro-F1. The decoder is better on quality. It rescues 9 more rare classes than the encoder. But look at the latency: 3 milliseconds versus 58 milliseconds per example. The encoder is 19 times faster. Both models train on the same data, on the same hardware, for roughly the same time. Neither dominates. The decoder is better. The encoder is faster. That's the trade-off. You'll notice the Opus row doesn't have a "rare classes rescued" number — that's because the zero-shot evaluation was on a 500-example sample, not the full validation set, so the per-class numbers aren't directly comparable. What we know: Opus got about 29 classes right out of the 71 that appeared in the sample. The fine-tuned models were evaluated on all 6,430 validation examples, which is why those numbers are reliable.

Let me make the trade-off concrete. On quality: the decoder wins. Higher accuracy, 15 percent higher macro F1, and 9 more rare classes rescued from zero. On speed: the encoder wins. 3 milliseconds versus 58 milliseconds per example. For 64,000 complaints — a realistic monthly volume — the encoder finishes in 3 minutes, the decoder in 7 minutes. Both are fast enough for production. The speed gap is real but it's 2.5x, not 1000x. And here's the part that should really make you think: the decoder trained less than 1 percent of its parameters with LoRA. The encoder trained all 149 million of its parameters. The decoder got more out of less adaptation because it started with richer representations from pre-training on 9 times more data.

Why does a model training less than 1 percent of its parameters beat a model training all of them? Scaling laws. The decoder has 3.3 times more parameters, trained on 9 times more data during pre-training. It arrives with a richer understanding of language, financial concepts, and consumer complaints. It probably read CFPB complaints during pre-training — the database is public. The encoder has to learn everything from 58,000 labeled examples. The decoder already knew the domain; LoRA just teaches it the 113 label boundaries. And this effect gets stronger with model size. A 1.5 billion parameter decoder gets 58.3 percent accuracy. A 3 billion parameter decoder gets 58.7 percent. The bigger the decoder, the more the encoder falls behind. This is not a trend that reverses.

This course is not about picking the winner. If it were, I'd tell you the answer on slide one and we'd go home. Both models are useful. The encoder is fast, simple, and well-understood. The decoder is better on quality, especially on rare classes. The question — the real engineering question — is which one you deploy for a specific use case. A real-time complaint router that needs to respond in under 10 milliseconds? The encoder. A weekly batch analysis where getting rare classes right directly affects which customers get helped? The decoder. A startup with no GPU budget? TF-IDF might be fine. Your job this semester is to build both models, understand their failure modes, compress them, try to transfer knowledge between them, and write a final recommendation that accounts for quality, latency, cost, and the specific needs of the deployment scenario. That's what the memos are for. There is no single right answer.

To recap. The encoder and decoder are close on accuracy. The real gap is on rare classes — the decoder's macro F1 is 15 percent higher, and it rescues 9 more classes from zero. The encoder is 19 times faster per example, which matters for real-time applications but not for batch processing. The decoder achieves this by training less than half a percent of its parameters — the rest came free from pre-training on trillions of tokens. And this effect scales: bigger decoders are better. Neither model dominates. This semester you'll build both, analyze both, compress both, and at the end you'll make a principled engineering recommendation about which to deploy and when. That recommendation — not a leaderboard number — is what this course is about.

Here's what you're doing. Right now, in Block 2, open the lab notebook — week1_lab.ipynb. It walks you through the data audit and the classical baseline. You should finish it in class. After class, open the homework notebook — week1_homework.ipynb. This is where you fine-tune ModernBERT-base on the full dataset. Training takes about 20 minutes for 2 epochs on a Kaggle T4 — use that time to monitor the loss curve. After training, the notebook has analysis exercises: which of the 70 classes that TF-IDF ignored did the neural model rescue? Which are still at zero? You'll also run one experiment where you change a hyperparameter and see what happens. Finally, the notebook has your memo prompts built in — one section for each part of the rubric. Write your observations right there in the notebook, export to HTML, and submit via Moodle by Wednesday morning before next class. Plan for about 5 to 6 hours of work outside of class. See you next week.