Practical Deep Learning Engineering for Applied ML

Week	Topic
1	Fine-tune a real model, audit the data
2	Improve: learning rate schedules, class weights, error analysis
3	Adapt cheaply with LoRA / PEFT
4	Analyze: confusion matrices, per-class deep dives
5	Compress: quantization (INT8, INT4)
6	Distill + final engineering recommendation

Component	Weight	What it measures
Weekly memos	60%	Can you analyze results and write clearly about trade-offs?
Weekly quizzes (Weeks 2-6)	15%	Do you understand last week's concepts?
Final exam	20%	Do you understand the concepts, not just the code?
Participation	5%	Are you present and engaged?

Field	What it contains
complaint_text	The consumer's written complaint (free text)
issue	The category label assigned to the complaint

Statistic	Value
Median length	~50 words
85–90th percentile	fits in 128 tokens
Mean length	longer (right-skewed)

Split	Examples	Purpose
Train	57,846	Model learns from these
Validation	6,430	Tune hyperparameters, monitor overfitting
Test	21,432	Final evaluation only — never train on this

Metric	Value
Accuracy	23.0%
Macro-F1	~0.003

Metric	Value
Val accuracy	54.2%
Macro-F1	0.132
Weighted F1	0.479

Model	Accuracy	Macro-F1	Weighted F1	Cost
Majority class	23.0%	~0.003	—	Free
TF-IDF + LogReg	54.2%	0.132	0.479	Free, 30 sec
Fine-tuned encoder	?	?	?	Free (Kaggle T4)

Batch size	GPU memory	Training speed	Gradient noise
8	Low	Slow	High
16	Moderate	Moderate	Moderate
32	High	Fast	Low
64	Very high	Very fast	Very low

Scenario	p(true class)	Loss
Confident & correct	0.85	0.16
Uncertain	0.10	2.30
Confident & wrong	0.02	3.91

Class	Training examples	Share of total loss
"Incorrect info on your report"	14,815	~25% of loss
A rare class with 8 examples	8	~0.01% of loss

Model	Accuracy	Macro-F1	Rare classes rescued	Latency
Majority class	23.0%	~0.003	0 / 113	—
TF-IDF + LogReg	54.2%	0.132	43 / 113	instant
Fine-tuned encoder (149M, 3 ep)	56.6%	0.209	67 / 113	3 ms
Decoder + LoRA (494M, 3 ep)	57.0%	0.240	76 / 113	58 ms
Opus decoder (zero-shot)	44.0%	0.174	—	2,300 ms

	Encoder (149M)	Decoder (494M)
Accuracy	56.6%	57.0%
Macro F1 (rare classes)	0.209	0.240 (+15%)
Zero-F1 classes	46	37 (9 more rescued)
Latency per example	3 ms	58 ms
64K complaints	~3 min	~7 min
Training time (T4)	32 min	54 min
Parameters trained	149M (100%)	2.3M (0.46%)

Decoder	Accuracy	Macro F1
0.5B	57.0%	0.240
1.5B	58.3%	0.252
3B	58.7%	0.250

Welcome to ECBS5200, Practical Deep Learning Engineering for Applied ML. My name is Eduardo, and this semester we're going to do something different from what you've seen before. If you've taken an applied LLMs course, you've learned how to use models as services — prompt them, chain them, build apps around them. That's valuable. But in this course, we treat a model as a trainable, inspectable, compressible artifact. Something you own, something you can open up, something you can make smaller and faster and cheaper. That's a fundamentally different engineering relationship with a model.

So what is this course? It's post-training engineering. You start with a pretrained model — someone else spent millions of dollars training it on a massive corpus — and you make it yours. You fine-tune it on your specific task. You adapt it cheaply using techniques like LoRA. You analyze where it fails and why. You compress it with quantization so it runs faster and cheaper. You distill its knowledge into something smaller. And at the end, you write up a justified recommendation: here's what I built, here's what it costs, here's what it can and can't do. Every one of these steps happens on the same task, the same dataset, the same model family. You're building one cumulative model line, not jumping between toy problems.

Let me be equally clear about what this course is NOT. It's not prompt engineering — you won't be writing system prompts or few-shot examples. It's not an agents course — no tool-use, no chains. It's not about building LLM applications. If you've taken a course like that, great, that's useful background. But we're doing something different. We're inside the model. We're looking at loss curves, gradient updates, weight matrices, attention patterns. If your prior experience was calling an API and parsing the JSON that comes back, this course teaches you what's happening on the other side of that API call.

Here's the semester arc. Six Wednesdays, six weeks. Week 1 — that's today — we fine-tune a real model and audit the dataset. Week 2, we improve it: learning rate schedules, class weights, targeted error analysis. Week 3, we adapt it cheaply with LoRA, which lets you fine-tune a fraction of the parameters. Week 4, systematic error analysis — confusion matrices, per-class deep dives, figuring out where the model actually fails. Week 5, compression — quantization to INT8 and INT4, making the model smaller and faster without destroying accuracy. Week 6, distillation and your final engineering recommendation — using a bigger model to teach a smaller one, and synthesizing the term's measurements into a defended deployment choice. Each week builds on the previous one. Same dataset, same model family. The artifact evolves.

The task for the entire course is consumer complaint classification. These are real complaints filed by real people with the US Consumer Financial Protection Bureau — the CFPB. Someone writes in saying "my credit card company charged me twice" or "the debt collector is contacting me about a debt I already paid," and the complaint gets assigned to one of 113 categories. The dataset has about 58,000 training examples, 6,400 validation, and 21,000 test. The class distribution is wildly imbalanced — we'll look at that in detail shortly. Our base model is ModernBERT-base, 149 million parameters, Apache 2.0 licensed, so you can use it for anything. This is your dataset and your model for the entire semester.

One thing I want to be explicit about: this is individual work. Everyone starts from the same place — same dataset, same base model. But the engineering decisions you make are yours. What learning rate do you pick? What schedule? When you do LoRA, what rank? When you quantize, how aggressively? These choices have consequences, and different students will make different choices and get different results. That's the point. There's no single right answer — there's a space of reasonable decisions, and your job is to navigate it and defend your choices with evidence.

Assessment. Weekly memos are 60 percent of your grade — well over half. These are short write-ups where you analyze your results and explain your trade-offs. I'm not looking for "I got 58 percent accuracy." I'm looking for "I chose this learning rate because of X, the validation loss plateaued at epoch 3, and here's what the per-class breakdown tells me about where the model struggles." Engineering judgment, in writing. Weekly quizzes are 15 percent — these are closed-book quizzes taken at the start of each class from Week 2 through Week 6. They test the previous week's material. Did you actually understand what we covered? If you engaged with the material, you'll do fine. The final exam is 20 percent — conceptual understanding, not code recall. And participation is 5 percent. Show up, engage, ask questions.

Here's the plan for today. In the first block — the lecture — we'll finish this course framing, then look at the ModernBERT architecture so you know what's inside the model you'll be training. Then we audit the dataset, run a classical baseline to establish a floor, walk through the anatomy of a fine-tuning loop, and finish with the motivating benchmark that frames the entire semester. In the second block — the lab — you'll get hands-on with the data audit and classical baseline notebook, which is designed to be completable in class. Then for homework, you'll fine-tune ModernBERT on the full dataset in a separate notebook. Let's get into it.

Before we dive into the data, let's talk about the model we'll be using all semester. We're not using the original BERT from 2018 — that's eight years old at this point. ModernBERT came out in 2024 and benefits from six years of architecture improvements while keeping the same encoder-only design that makes BERT so effective for classification. The Apache 2.0 license means no licensing restrictions — you can use it anywhere, for anything, commercially or otherwise. And SDPA — Scaled Dot-Product Attention — is a practical detail that matters: it means attention is computed more efficiently on the T4 GPUs you'll be using on Kaggle. Faster training, lower memory usage, same results.

So why this model specifically? First, it's modern. BERT was groundbreaking in 2018, but a lot has improved since then — training recipes, attention implementations, data quality. ModernBERT incorporates all of that while keeping the encoder-only architecture that's ideal for classification tasks. Second, it's practical for our constraints: 149 million parameters fits comfortably on free-tier GPUs, and it works with all the techniques we'll cover — LoRA, quantization, distillation. Third, the Apache 2.0 license means zero restrictions. In industry, licensing matters. You don't want to build a production system on a model with usage restrictions. And finally, SDPA — Scaled Dot-Product Attention — is a concrete engineering win. It computes attention more efficiently on the T4 GPUs you'll be using, which means faster training and lower memory usage for the same model quality.

Let me walk you through the architecture from input to output. Your text goes into the tokenizer, which converts it to token IDs — up to 128 of them. Those IDs get mapped to 768-dimensional embedding vectors — each token becomes a vector of 768 numbers. Then those vectors pass through 22 Transformer layers, each with self-attention and a feed-forward network. The key thing about an encoder: it reads ALL tokens simultaneously. Bidirectional attention. Unlike a decoder, which reads left-to-right and can only see what came before, the encoder sees the entire input at once. That's why encoders are so much faster for classification — no autoregressive generation, just one forward pass. After 22 layers, we take the hidden state of the CLS token — the special token at the start — as the representation of the entire sequence. That 768-dimensional vector gets fed into the classification head, which is a tiny linear layer mapping 768 dimensions to 113 logits, one per class. The classification head is less than 1 percent of the model's parameters. Most of the intelligence is in those 22 pretrained encoder layers.

Before we touch a model, before we write a single line of training code, we're going to look at our data. This sounds obvious, but you'd be amazed how many people skip this step. They download a dataset, point a model at it, and start optimizing a number. Then three weeks later they discover the labels are noisy, or the class distribution is pathological, or half the text is redacted. We're not going to do that. We're going to audit the dataset first.

Why bother? Because training a model on data you haven't examined is malpractice. I don't use that word lightly. Here's what goes wrong when you skip the audit. You might have labels that look different but are actually the same thing — we have that. You might have classes with only two examples — we have that too. You might have redacted text where the model learns the pattern of the redaction markers instead of the actual content — yep, we have that. And you will almost certainly have extreme class imbalance — we definitely have that. Every single one of these problems is present in our dataset. The audit is how we find them and decide what to do about them.

Our dataset comes from the US Consumer Financial Protection Bureau, the CFPB. When a consumer has a problem with a financial product — their credit card, their mortgage, a debt collector — they can file a complaint with the CFPB. That complaint includes free-text describing the problem, and it gets categorized by issue type. We're using the "determined-ai/consumer_complaints_medium" version on Hugging Face. The two fields we care about are the complaint text — what the person wrote — and the issue label — what category it was assigned to. This is not a toy dataset. These are real complaints from real people about real financial problems.

Let's look at the label distribution. After cleanup — which I'll explain in a moment — we have 113 classes. The distribution is extremely long-tailed. The single largest class, "Incorrect information on your report," accounts for 23 percent of all complaints. Nearly one in four. The top 10 classes together cover about 60 percent of the data. And the bottom 63 classes — more than half of all categories — share less than 10 percent of the data between them. Many of those tail classes have fewer than 50 training examples. This is a hard classification problem. Any model that just learns the popular classes and ignores the tail will look decent on accuracy and terrible on macro-F1.

You might wonder why we have 113 classes instead of the 153 in the raw data. Here's the story. The CFPB changed their complaint form at some point, and when they did, they reworded some of the issue categories. So you get pairs like "Incorrect information on credit report" and "Incorrect information on your report" — same complaint type, different wording. If we treated those as separate classes, we'd be asking the model to distinguish between identical categories based on an administrative change. So we merged those near-duplicates. That brought us down to about 146. Then we dropped any class with fewer than 5 total examples — you simply cannot learn a class from 2 or 3 examples, and they add noise to the evaluation. That gives us 113 canonical classes. This kind of label cleanup is completely normal. In any real-world dataset, messy labels are the rule, not the exception. The first thing you do is look at them and clean them up.

Let's look at what the actual text looks like. Complaints range from one sentence — "there are many mistakes in my report" — to multiple paragraphs. A typical complaint is a few sentences describing the problem and what the consumer wants done about it. But here's something you'll notice immediately: those XXXX markers. The CFPB redacts personally identifiable information — names, dates, account numbers, dollar amounts — and replaces them with XXXX or curly-brace placeholders. This redaction is aggressive and pervasive, and it fundamentally changes what the model can and can't learn from the text.

Sixty-three and a half percent. Nearly two-thirds of all complaints contain at least one XXXX redaction marker. Names, dates, account numbers, dollar amounts — all replaced with XXXX. This means the model is working with incomplete text for most of the dataset. It cannot learn that complaints filed in January are different from complaints filed in June. It cannot learn that complaints about $500 charges are different from complaints about $5,000 charges. What it CAN learn is the structure and vocabulary of the complaint — the words people use to describe different types of problems. For classification purposes, that's actually fine. The issue category depends on what kind of problem it is, not on the specific dates or amounts involved. But you need to know this is happening, because if you ever tried to do something like severity estimation or dollar-amount prediction on this data, you'd be in trouble.

How long are these complaints? The median is about 50 words — a few sentences. The distribution is right-skewed, so the mean is higher, pulled up by a minority of very long complaints. The key number for us: 85 to 90 percent of complaints fit entirely within 128 tokens. That's our truncation threshold from the pre-work. So for the vast majority of complaints, the model sees the full text. For the 10 to 15 percent that exceed 128 tokens, we lose the tail end. But here's the thing — the complaint type is almost always clear from the first few sentences. People lead with their problem. "I was charged twice." "The debt collector is calling about a debt I don't owe." The specific details and timeline that follow are less important for classification than that opening framing.

The data split is fixed for the entire course. 57,846 training examples, 6,430 validation, 21,432 test. The training set is what the model learns from. The validation set is for monitoring overfitting and tuning hyperparameters — learning rate, number of epochs, that sort of thing. The test set is for final evaluation only. You run on it once, at the end, to report your results. You never use test set performance to make training decisions. If you do, you've leaked information and your results are meaningless. This is a basic principle, but it's worth stating explicitly because the temptation to peek is real. Everyone in the class uses the same split, so your results are directly comparable.

Let me recap. Always audit your data before training. We have 113 classes after cleaning up 153 raw labels — merging near-duplicates and dropping classes with fewer than 5 examples. The distribution is extremely long-tailed, with the top class at 23 percent. Nearly two-thirds of the text is partially redacted, so the model has to learn from complaint structure and vocabulary rather than specific details. Most complaints fit within our 128-token limit. And the data split is fixed for the entire semester. That's the terrain we're working with. Now let's see what a simple baseline can do on it.

We've looked at the data. Now, before we do anything neural, anything fancy, anything with transformers or GPUs or pretrained weights, we're going to run the simplest reasonable model we can think of. This is called baseline discipline, and it's one of the most important habits in applied machine learning. If you don't know what a simple model gets you, you have no way to judge whether your complex model is actually contributing anything.

Here's the problem. If I tell you my fine-tuned model gets 58 percent accuracy, what do you think? Is that good? Bad? You have no idea. You need a reference point. The baseline provides that reference point. When I tell you the model gets 58 percent accuracy and the TF-IDF baseline gets 54.2 percent, now you know the neural model is only a few points better on accuracy. But when I tell you the neural model gets 0.30 macro-F1 and the baseline gets 0.132, now you see something dramatic — the neural model more than doubled the F1 score. The baseline turns an isolated number into a story about where the model is actually adding value.

Let's start with the absolute simplest baseline: the majority-class predictor. You literally ignore the input and always predict the most common class. In our dataset, that's "Incorrect information on your report," which accounts for 23 percent of all complaints. So this zero-intelligence model gets 23 percent accuracy just by always guessing the same thing. Its macro-F1 is essentially zero — about 0.003 — because it gets zero F1 on 112 out of 113 classes. This is the floor. If any model you build this semester can't beat 23 percent accuracy, it has learned literally nothing beyond "guess the popular class."

Now let's run a real baseline — TF-IDF plus logistic regression. TF-IDF turns each complaint into a sparse vector where each dimension represents a word, and the value reflects how important that word is in this document relative to the corpus. Logistic regression then learns a linear decision boundary in that high-dimensional space. This is fast — about 30 seconds to train on a laptop, no GPU needed, no pretrained anything. It's the classic "can a simple model handle this" test for text classification. And it's important to run because if TF-IDF plus logistic regression solves your problem, you don't need deep learning at all.

Here are the results. The TF-IDF baseline gets 54.2 percent accuracy on the validation set. That's more than double the majority baseline's 23 percent. Weighted F1 is 0.479 — not great, but not terrible. But look at the macro-F1: 0.132. That's terrible. Remember, macro-F1 averages F1 across all 113 classes equally. If macro-F1 is 0.132 while accuracy is 54.2 percent, that tells you the model is doing well on some classes and catastrophically failing on others. Let's dig into what's happening.

Here's the core lesson of this slide. The model gets 54.2 percent accuracy but only 0.132 macro-F1. How is that possible? Because the model learned about 40 classes — the common ones — and completely ignores the other 70. It never predicts those 70 classes. Not once. For those 70 classes, F1 is zero. Now, those 70 classes are rare — they each have few examples — so ignoring them barely hurts accuracy. If a class represents 0.1 percent of the data, never predicting it only costs you 0.1 percentage points of accuracy. But macro-F1 treats every class equally. Seventy zeros in your average destroys the macro-F1, even if the other 40 classes have decent F1 scores. This is the accuracy vs. macro-F1 gap, and it's the most important diagnostic insight you'll learn today.

This gap is not just a fun fact — it's the central diagnostic pattern you need to internalize. When accuracy is high but macro-F1 is low, your model is biased toward common classes and ignoring rare ones. This is the default behavior. It's not a bug in the algorithm — it's the natural consequence of training on imbalanced data. Common classes dominate the cross-entropy loss, so the model optimizes for them. Rare classes contribute almost nothing to the total loss, so the model learns to ignore them. And accuracy rewards this behavior — you can't tell it's happening unless you look at macro-F1 or the per-class breakdown. This is how you ship a model that works fine for 40 complaint types and fails completely for 70 others. In a real deployment, that means 70 categories of consumers get misrouted, mishandled, or ignored.

Here's our scoreboard so far. The majority baseline: 23 percent accuracy, essentially zero macro-F1. The TF-IDF baseline: 54.2 percent accuracy, 0.132 macro-F1 — it learned 40 classes and ignores 70. Now the question is: can a neural model do better? And specifically, can it do better where the baseline fails? Can it learn some of those 70 classes that the linear model completely ignores? That's what we're going to find out when we fine-tune ModernBERT. The bar has been set. Let's see if deep learning can clear it.

Key takeaways. Always run a baseline — you cannot evaluate a model without a reference point. The majority baseline is 23 percent accuracy. The TF-IDF baseline is 54.2 percent accuracy but only 0.132 macro-F1 — it learned about 40 classes and completely ignores 70. The gap between accuracy and macro-F1 is the most important thing to watch in imbalanced classification. And those 70 classes with F1 of zero — that's exactly where a neural model needs to prove it's worth the additional complexity and compute cost.

Now we're going to walk through the mechanics of fine-tuning. Not the code — you'll see that in the notebook — but the concepts. What actually happens when you take a pretrained model like ModernBERT and adapt it to classify consumer complaints? There are about eight steps in the loop, and each one involves a design decision you need to understand.

Here's the big picture. Fine-tuning means taking a model that has already been trained on a huge corpus of text — it already understands English grammar, word meanings, sentence structure — and teaching it your specific task. In our case, we take ModernBERT-base, which has 149 million parameters that were trained on billions of words of text. We add a small classification head on top — a new, randomly initialized layer — and then we train on our labeled complaint data. We're not training 149 million parameters from scratch. We're nudging them slightly so that the representations they produce are useful for distinguishing between our 113 complaint categories. That's a much easier job than learning English from nothing.

Step one: load the pretrained model. When you call from_pretrained with num_labels=113, the library does two things. It loads the full pretrained encoder — all 149 million parameters that were trained on a massive text corpus. Those weights already encode useful representations of English. Then it adds a classification head on top — a linear layer that maps from the encoder's hidden dimension to our 113 classes. That classification head is randomly initialized. It knows nothing about our task yet. So at the start of training, the encoder is sophisticated and the head is random. Training will adjust both, but the head has the most to learn.

Step two: tokenize the data. You load the tokenizer that matches your model — they must use the same vocabulary, because the model's embedding layer expects specific token IDs. We set max_length to 128, enable truncation for the rare long complaints, and pad shorter ones to the full length. You covered this in pre-work module one. The only difference now is scale — we're tokenizing 57,846 training examples, not one complaint at a time. This step is done once before training starts, not inside the training loop.

Step three: the DataLoader. We don't feed examples to the model one at a time — that would be extremely slow. Instead, we batch them. The DataLoader takes our tokenized dataset and serves up batches of examples. We use a batch size of 32, which fits comfortably in the 16 gigabytes of memory on a free Kaggle T4 GPU when our max sequence length is 128 tokens. Larger batches would be faster but might not fit in memory. Smaller batches would add more noise to the gradient estimates. 32 is a reasonable choice for our setup. Each iteration of the training loop processes 32 complaints simultaneously: 32 tokenized inputs go in, 32 predictions come out, and we do one gradient update.

Step four: the forward pass. Each batch of tokenized inputs goes through the encoder — that's 22 transformer layers in ModernBERT-base. The encoder produces a hidden state vector for every token position. We take the hidden state at the CLS token position — that's the special token at the start of every sequence — and use it as the representation of the entire complaint. That CLS embedding gets fed into the classification head, which is a linear layer that produces 113 numbers, one for each class. Those numbers are called logits — raw scores, not probabilities. To get probabilities you'd apply softmax, but for training we don't need to because the loss function handles that.

Step five: the loss function. Cross-entropy loss is minus log of the probability the model assigns to the true class. Look at the curve. When the model is confident and correct — assigning 85% probability to the right class — the loss is only 0.16. When the model is uncertain, assigning only 10%, the loss jumps to 2.3. And when the model is confident but wrong — assigning just 2% to the true class — the loss is 3.9. That steep curve on the left is important: it means cross-entropy disproportionately punishes confident wrong answers. The model can't just be right on average — it really pays for being confidently wrong. This is computed for every example in every batch, then averaged to give a single number. Training means making that number go down.

Here's the insight that connects the loss function to the 70 ignored classes we saw in the baseline. Cross-entropy loss is averaged over all examples in a batch. If a class has 14,000 training examples, it contributes roughly 25 percent of the total loss. If a class has 8 examples, it contributes about one hundredth of a percent. The optimizer follows the gradient, and the gradient is dominated by the common classes. Getting a rare class right barely reduces the total loss. Getting it wrong barely increases it. So what does the model learn to do? Ignore the rare classes. Focus on the common ones where loss reduction is easy. This isn't a bug in the algorithm — it's the natural consequence of the loss function interacting with imbalanced data. The TF-IDF model did exactly this — ignored 70 classes. Our neural model will do the same unless we intervene. That intervention is coming in Week 2, when we talk about class weighting and other strategies. For now, just understand the mechanism.

Step six: the backward pass and optimizer step. Backpropagation computes the gradient of the loss with respect to every single parameter in the model — all 149 million of them plus the classification head. The gradient tells you: for each weight, which direction should it move to reduce the loss, and by how much? Then the optimizer takes those gradients and updates the weights. The simplest version is: new weight equals old weight minus learning rate times gradient. The learning rate is crucial. For fine-tuning, you use a much smaller learning rate than you'd use for training from scratch — typically 1e-5 to 5e-5. Why? Because the pretrained weights are already good. You want to nudge them gently, not overwrite them. If your learning rate is too high, you destroy the pretrained representations. That's called catastrophic forgetting.

Let me emphasize this point because it's the single most important hyperparameter you'll deal with. When training a model from scratch, you use a learning rate around 1e-3 — the weights are random, they need to move a lot. When fine-tuning, you use something like 2e-5 — that's 50 times smaller. The pretrained weights already encode useful knowledge about language. If you update them too aggressively, you overwrite that knowledge. The model forgets how English works while trying to learn your 113 categories. That's catastrophic forgetting. On the other hand, if your learning rate is too small, the model doesn't adapt enough to your specific task. Finding the right learning rate is the most impactful tuning decision you'll make, and we'll explore this more in Week 2.

Step seven: validation. After each pass through the training data — that's one epoch — you evaluate on the validation set. But first, you have to switch the model to eval mode. This turns off dropout and freezes batch normalization statistics. If you forget this step, dropout is still randomly zeroing out neurons during validation, and your validation metrics will be noisy and unreliable. This is a real, common bug. You also disable gradient computation during validation — you're not updating weights, just measuring performance. You compute the validation loss, accuracy, and macro-F1, and you compare to the previous epoch. Is the model still improving? If validation loss starts going up while training loss keeps going down, you're overfitting.

Step eight: checkpointing. A checkpoint saves everything you need to resume training or deploy the model — the model weights, the optimizer state, and metadata like which epoch you're on and what the best validation score was. This matters for two reasons. First, practical: GPU time on Kaggle is limited and sometimes things crash. If you saved a checkpoint, you can resume. Second, methodological: you want to use the model weights from the epoch with the best validation performance, not the weights from the last epoch. Typically, validation performance improves for a few epochs and then starts to degrade as the model overfits. The best epoch is rarely the last one. Your checkpoint strategy determines whether you can recover the best model.

Here's the complete loop in pseudocode. For each epoch: switch to train mode, iterate over batches, do the forward pass, compute the loss, do the backward pass, update weights, zero the gradients. Then switch to eval mode, run validation, save a checkpoint. That's the entire fine-tuning loop. Every fine-tuning job you'll ever run — whether it's BERT, GPT, or anything else — follows this exact structure. The details vary — different optimizers, different schedulers, different regularization — but the loop is the same. You'll see the actual code in the notebook, and it will map directly to these eight steps.

Let me summarize. Fine-tuning adapts a pretrained model to your task — you're leveraging billions of words of pretraining, not starting from nothing. The architecture is a pretrained encoder plus a new classification head. The training loop has eight steps: tokenize, batch, forward pass, loss, backward pass, optimizer step, validation, checkpoint. The learning rate is the most important hyperparameter — fine-tuning needs much smaller rates than training from scratch. Train mode vs eval mode is a real source of bugs. And always save checkpoints so you can use the best epoch, not just the last one.

This is the bridge slide. Everything before this was encoder-focused: how to train, how to improve, how to evaluate. Students might be thinking "this is my model for the semester." It is — but it's about to get competition. I want to be upfront about that. The encoder is their primary artifact for Weeks 1 and 2. In Week 3 they'll train a decoder with LoRA and discover it's better on quality but slower. From that point on, the course is about understanding that trade-off: which model do you deploy, for what scenario, and why? They're not maintaining two parallel model lines. They're building one, meeting a challenger, and learning to reason about the engineering decision.

Look at this table. The fine-tuned encoder — 149 million parameters, all of them trained, running on a free Kaggle T4 — gets 56.6 percent accuracy and 0.209 macro-F1. The decoder — 494 million parameters, with LoRA adapting less than half a percent of them, on the same free T4 — gets 57 percent accuracy and 0.240 macro-F1. The decoder is better on quality. It rescues 9 more rare classes than the encoder. But look at the latency: 3 milliseconds versus 58 milliseconds per example. The encoder is 19 times faster. Both models train on the same data, on the same hardware, for roughly the same time. Neither dominates. The decoder is better. The encoder is faster. That's the trade-off. You'll notice the Opus row doesn't have a "rare classes rescued" number — that's because the zero-shot evaluation was on a 500-example sample, not the full validation set, so the per-class numbers aren't directly comparable. What we know: Opus got about 29 classes right out of the 71 that appeared in the sample. The fine-tuned models were evaluated on all 6,430 validation examples, which is why those numbers are reliable.

Let me make the trade-off concrete. On quality: the decoder wins. Higher accuracy, 15 percent higher macro F1, and 9 more rare classes rescued from zero. On speed: the encoder wins. 3 milliseconds versus 58 milliseconds per example. For 64,000 complaints — a realistic monthly volume — the encoder finishes in 3 minutes, the decoder in 7 minutes. Both are fast enough for production. The speed gap is real but it's 2.5x, not 1000x. And here's the part that should really make you think: the decoder trained less than 1 percent of its parameters with LoRA. The encoder trained all 149 million of its parameters. The decoder got more out of less adaptation because it started with richer representations from pre-training on 9 times more data.

Why does a model training less than 1 percent of its parameters beat a model training all of them? Scaling laws. The decoder has 3.3 times more parameters, trained on 9 times more data during pre-training. It arrives with a richer understanding of language, financial concepts, and consumer complaints. It probably read CFPB complaints during pre-training — the database is public. The encoder has to learn everything from 58,000 labeled examples. The decoder already knew the domain; LoRA just teaches it the 113 label boundaries. And this effect gets stronger with model size. A 1.5 billion parameter decoder gets 58.3 percent accuracy. A 3 billion parameter decoder gets 58.7 percent. The bigger the decoder, the more the encoder falls behind. This is not a trend that reverses.

This course is not about picking the winner. If it were, I'd tell you the answer on slide one and we'd go home. Both models are useful. The encoder is fast, simple, and well-understood. The decoder is better on quality, especially on rare classes. The question — the real engineering question — is which one you deploy for a specific use case. A real-time complaint router that needs to respond in under 10 milliseconds? The encoder. A weekly batch analysis where getting rare classes right directly affects which customers get helped? The decoder. A startup with no GPU budget? TF-IDF might be fine. Your job this semester is to build both models, understand their failure modes, compress them, try to transfer knowledge between them, and write a final recommendation that accounts for quality, latency, cost, and the specific needs of the deployment scenario. That's what the memos are for. There is no single right answer.

To recap. The encoder and decoder are close on accuracy. The real gap is on rare classes — the decoder's macro F1 is 15 percent higher, and it rescues 9 more classes from zero. The encoder is 19 times faster per example, which matters for real-time applications but not for batch processing. The decoder achieves this by training less than half a percent of its parameters — the rest came free from pre-training on trillions of tokens. And this effect scales: bigger decoders are better. Neither model dominates. This semester you'll build both, analyze both, compress both, and at the end you'll make a principled engineering recommendation about which to deploy and when. That recommendation — not a leaderboard number — is what this course is about.

Here's what you're doing. Right now, in Block 2, open the lab notebook — week1_lab.ipynb. It walks you through the data audit and the classical baseline. You should finish it in class. After class, open the homework notebook — week1_homework.ipynb. This is where you fine-tune ModernBERT-base on the full dataset. Training takes about 20 minutes for 2 epochs on a Kaggle T4 — use that time to monitor the loss curve. After training, the notebook has analysis exercises: which of the 70 classes that TF-IDF ignored did the neural model rescue? Which are still at zero? You'll also run one experiment where you change a hyperparameter and see what happens. Finally, the notebook has your memo prompts built in — one section for each part of the rubric. Write your observations right there in the notebook, export to HTML, and submit via Moodle by Wednesday morning before next class. Plan for about 5 to 6 hours of work outside of class. See you next week.

Practical Deep Learning Engineering for Applied ML

ECBS5200 — Week 1

A model is an artifact, not a service endpoint.

What this course IS

What this course IS NOT

The semester arc

The task: consumer complaint classification

Individual work, one model line

Assessment

Today's plan

Why ModernBERT-base?

The base model for the semester

Why ModernBERT-base?

What's inside ModernBERT-base

Dataset Audit for Supervised NLP

Before you train anything, look at what you're training on.

Why audit your data?

The CFPB consumer complaints dataset

Label distribution: the long tail

Why 113 classes? The label merge story

What do complaints actually look like?

Redaction prevalence

Text length distribution

Canonical data split

Dataset audit: key takeaways

Baseline Discipline

Always know what "dumb" gets you before you try "smart."

Why baselines matter

The simplest baseline: majority class

The classical baseline: TF-IDF + logistic regression

TF-IDF + logistic regression: results

The accuracy vs. macro-F1 gap

Why this gap is the core lesson

The scoreboard so far

Baseline discipline: key takeaways

Anatomy of a Fine-Tuning Loop

What actually happens when you fine-tune a pretrained encoder.

The big picture

Step 1: Load pretrained weights

Step 2: Tokenize the data

Step 3: DataLoader — batching

Step 4: Forward pass

Step 5: Loss — cross-entropy

Why cross-entropy explains the 70 ignored classes

Step 6: Backward pass + optimizer

The learning rate: why fine-tuning is different

Step 7: Validate

Step 8: Checkpoint

The complete loop

Fine-tuning anatomy: key takeaways

What you're building

The full picture

The trade-off

Why doesn't the encoder just win?

The semester question

Key takeaways

Your homework this week