This module is optional background. It covers how the models you'll use this semester were built — from raw text to the artifact you fine-tune. You don't need this to do the coursework, but it helps you understand WHY the models behave the way they do. If you're short on time, skip this module. If you're curious about what happened before you got the model, read on.
You're going to spend six weeks fine-tuning, adapting, analyzing, compressing, and distilling pretrained models. Every one of those models went through a pipeline that cost millions of dollars and months of compute before it reached you. This module walks you through that pipeline — not in detail, but enough that when you see a model card that says "pretrained on 2 trillion tokens," you understand what that means and why it matters.
The pipeline has four stages, though not every model goes through all of them. Pre-training is where the model learns language — grammar, facts, reasoning patterns — from massive amounts of text. Mid-training is optional and used to adapt the model to specific domains or extend its capabilities like longer context. Post-training is where the model learns to follow instructions and behave helpfully — this is instruction tuning and RLHF. And then there's your fine-tuning, which is stage 4 and the subject of this entire course. Most of the raw intelligence lives in stage 1. Stages 2 and 3 are about shaping how that intelligence is expressed.
Pre-training is where the model learns language. The model reads an enormous amount of text and learns to predict what comes next. Look at the scale progression. In 2020, GPT-3 trained on 300 billion tokens and it cost about 4.6 million dollars in GPU time. By 2024, Llama 3 trained on 15 trillion tokens — 50 times more data — and cost somewhere between 60 and 100 million dollars. Qwen 2.5, the decoder you'll use in this course, trained on 18 trillion tokens. ModernBERT, your encoder, trained on 2 trillion tokens. In four years, the field went from hundreds of billions of tokens to tens of trillions. The cost went from millions to hundreds of millions. What comes out of this process is a model that has implicitly learned grammar, vocabulary, facts about the world, reasoning patterns, and even some ability to follow instructions — all as a side effect of predicting text.
Encoders and decoders are two ways to build a language model. Encoders read the entire input at once — every token can attend to every other token, in both directions. Decoders read left to right — each token can only see what came before it. This makes decoders natural for text generation, where you produce one token at a time. But here's something that surprises a lot of people: for classification, both architectures do one forward pass. When you add a classification head to a decoder, there's no autoregressive generation. The input goes through the model once, and the last token's hidden state feeds into the classification head, which outputs logits. The causal attention mask adds no meaningful computational overhead. So why is our decoder 19 times slower than our encoder? Because it has 3.3 times more parameters — 494 million versus 149 million. At equal parameter counts, an encoder and a decoder would classify at roughly the same speed. The decoder models in the world tend to be bigger because they're trained at larger scale — that's a market reality, not an architectural limitation.
Scaling laws are one of the most important empirical findings in machine learning. In 2020, researchers at OpenAI showed that model quality — measured by loss on held-out text — improves predictably as you increase parameters, data, and compute. This isn't a vague trend. It's a power law: double the compute, get a predictable drop in loss. This finding drove the industry to build bigger and bigger models, because the returns were reliable. Then in 2022, the Chinchilla paper from DeepMind refined the picture: it's not just about model size, it's about the balance between parameters and training data. A smaller model trained on more data can beat a larger model trained on less. This is why modern models like Qwen and Llama are trained on trillions of tokens, not just billions. The scaling laws explain something you'll see directly in this course: a 494-million parameter decoder, pretrained on vastly more data than a 149-million parameter encoder, produces better representations — and that advantage carries through to fine-tuning.
Here's where scaling laws become directly relevant to your coursework. The two models you'll use are dramatically different in scale. ModernBERT has 149 million parameters and was trained on 2 trillion tokens. Qwen has 494 million parameters and was trained on 18 trillion tokens. That's 3.3 times more parameters and 9 times more data. Scaling laws predict the Qwen model will have richer internal representations of language — and that's exactly what we observe. The decoder, adapting less than half a percent of its parameters with LoRA, beats the encoder that fine-tunes all 149 million parameters on rare-class macro F1. But scaling laws describe quality, not speed. The encoder is faster because it's smaller — 149 million parameters means less computation per forward pass than 494 million. That speed advantage is real and matters for deployment. The course is about navigating that trade-off: better representations versus faster inference.
Mid-training is continued pre-training with a different data mix. The model keeps doing the same thing it did in pre-training — predicting the next token — but the data shifts. The clearest example is context extension. Llama 3 was pretrained with an 8K token context window. Meta took that trained model and continued training on progressively longer documents — 16K, 32K, then 128K tokens. Same loss function, same architecture, but now the data includes long documents, so the model learns to attend over longer ranges. That's how you turn an 8K model into a 128K model without starting over. Code specialization is similar: CodeLlama took the base Llama model and continued pre-training on 500 billion tokens of code. The model already knew English; the continued training made it much better at programming languages. Qwen 2.5 — the decoder you'll use in this course — went through continued pre-training on math, code, and multilingual data after the initial pre-training phase. That's part of why it performs well across different domains. The pattern is always the same: same training process, different data diet.
Post-training is the stage that turns a text prediction engine into something that behaves like an assistant. Instruction tuning trains the model on thousands of examples of instructions paired with good responses. RLHF — reinforcement learning from human feedback — or its simpler variant DPO further aligns the model to produce responses humans prefer. This is why you'll see models released in two versions: Base and Instruct. The Base model is the raw output of pre-training. The Instruct model has been aligned to follow instructions and be helpful. For this course, we use the Base version for our decoder classification experiments. Why? Because we're adding a classification head and fine-tuning with LoRA. We want the model's rich representations of language, not its instruction-following behavior. The Instruct version's alignment can actually interfere with classification head training.
And this is where the course begins. You receive a pretrained model — one that has been through the entire pipeline described in this module. It already understands language: grammar, vocabulary, sentence structure, and a vast amount of world knowledge. Your job is to teach it a specific task: classifying consumer complaints into 113 categories. That's a much easier job than learning English from scratch, which is why fine-tuning works with so much less data and compute than pre-training. You'll explore multiple approaches to this over the semester: full fine-tuning where you update every parameter, LoRA where you update less than 1 percent, quantization where you compress without retraining, and distillation where you transfer knowledge from one model to another. Each approach has trade-offs, and understanding those trade-offs is the point of the course.
To summarize: the models you'll work with this semester were built through a multi-stage pipeline. Pre-training on trillions of tokens gives them language understanding. Scaling laws explain why the bigger decoder has better representations than the smaller encoder. Encoders and decoders are different architectural choices with different speed-quality trade-offs. Post-training alignment produces the Instruct variants you may have used in chat, but for classification we use the raw Base models. And your fine-tuning — the subject of this entire course — is the final stage where you specialize a general-purpose model for a specific task. Everything in this module is background. The real work starts in Week 1.
These are the foundational papers of the field. The Transformer paper introduced the architecture that everything is built on. BERT showed how to pre-train encoders for downstream tasks. GPT-2 demonstrated that decoder pre-training at scale produces surprisingly capable models. The scaling laws paper quantified why bigger models are better. Chinchilla refined that understanding. InstructGPT introduced RLHF for alignment. And Megatron-LM showed how to train these models efficiently at massive scale. None of these are required reading for this course. But if you're curious about the engineering and science behind the models you'll be working with, these are the papers to start with.