ECBS5200 Pre-Work

How LLMs Are Built

Pre-Work Module 09 (Optional)

ECBS5200 — Practical Deep Learning Engineering

Module 09: How LLMs Are Built (Optional)
ECBS5200 Pre-Work

Why this matters for you

This course is about post-training engineering. You receive a pretrained model and adapt it.

But that model didn't appear from nothing. Understanding how it was built helps you:

  • Know what the model already knows (and what it doesn't)
  • Understand why a 494M decoder can beat a 149M encoder
  • Make better decisions about which model to start from
  • Speak the language of the field you're entering
Module 09: How LLMs Are Built (Optional)
ECBS5200 Pre-Work

The pipeline: four stages

1. Pre-training          Learns language from raw text
        ↓
2. Mid-training          (Optional) Domain or capability adaptation
        ↓
3. Post-training         Instruction tuning, alignment
        ↓
4. Your fine-tuning      What this course teaches

Most of the intelligence comes from stage 1. Stages 2-3 shape the behavior. Stage 4 is you.

Module 09: How LLMs Are Built (Optional)
ECBS5200 Pre-Work

Stage 1: Pre-training

The goal: learn to predict text. The model learns language as a side effect.

Model Year Parameters Training tokens Estimated cost
GPT-3 2020 175B 300B ~$4.6M
Llama 2 2023 70B 2T ~$2M+
Llama 3 2024 405B 15T ~$60-100M
Qwen 2.5 2024 0.5B–72B 18T undisclosed
ModernBERT 2024 149M 2T undisclosed

In four years: training data went from 300B to 18T tokens (60x). Costs went from $5M to $100M+.

Module 09: How LLMs Are Built (Optional)
ECBS5200 Pre-Work

Encoders vs decoders: why both exist

Encoder Decoder
Reads text All at once (bidirectional) Left to right (causal mask)
Pre-training task Fill in masked words Predict next word
Good at Understanding, classification Generation, and also classification
For classification One forward pass Also one forward pass (with classification head)
Example ModernBERT (149M) Qwen 2.5 (0.5B–72B)

For classification, the speed difference comes from model size, not architecture. A 494M decoder is ~3x slower than a 149M encoder because it has 3x more parameters — not because it's a decoder.

Module 09: How LLMs Are Built (Optional)
ECBS5200 Pre-Work

Scaling laws: why bigger models are better

In 2020, OpenAI published a finding that changed the field:

Model quality scales predictably with three things:

  1. Number of parameters
  2. Amount of training data
  3. Amount of compute

Double the compute → predictable improvement in quality. No plateaus, no diminishing returns (within the range studied).

The Chinchilla insight (2022): most models were too big for their data. A smaller model trained on more data beats a larger model trained on less.

Module 09: How LLMs Are Built (Optional)
ECBS5200 Pre-Work

What scaling laws mean for this course

You will work with two models:

  • ModernBERT-base: 149M parameters, trained on 2T tokens
  • Qwen2.5-0.5B: 494M parameters, trained on 18T tokens

The decoder has 3.3x more parameters trained on 9x more data.

Scaling laws predict it will have richer representations. That's exactly what we observe: the decoder, with LoRA on 0.46% of its parameters, beats the fully fine-tuned encoder on rare-class performance.

The encoder's advantage? It's smaller and therefore faster. Scaling laws say nothing about inference speed.

Module 09: How LLMs Are Built (Optional)
ECBS5200 Pre-Work

Stage 2: Mid-training

Same training process, different data diet.

The model keeps predicting tokens, but the training data shifts to emphasize specific capabilities:

Llama 3 → 3.1: Meta continued training on progressively longer documents (16K → 32K → 128K tokens). Same loss function, longer inputs. Turned an 8K-context model into a 128K-context model without starting over.

Llama → CodeLlama: Continued pre-training on 500B+ tokens of code. The model already knew English; now it gets much better at Python and Java.

Qwen 2.5: Alibaba continued pre-training on math, code, and multilingual data. That's why Qwen 2.5 is strong on math benchmarks despite being a general-purpose model.

Module 09: How LLMs Are Built (Optional)
ECBS5200 Pre-Work

Stage 3: Post-training alignment

After pre-training, the model can predict text but doesn't know how to be helpful.

Instruction tuning: train on (instruction, response) pairs so the model follows directions.

RLHF / DPO: use human preferences to align the model's behavior — helpful, harmless, honest.

This is why Qwen2.5-0.5B-Instruct exists alongside Qwen2.5-0.5B:

  • Base: raw pre-trained model. Good representations but doesn't follow instructions.
  • Instruct: post-trained to be a helpful assistant. Better at chat, worse for classification heads.

For classification with LoRA, we use Base — we want the representations, not the chat behavior.

Module 09: How LLMs Are Built (Optional)
ECBS5200 Pre-Work

Stage 4: Your fine-tuning

This is where you come in.

You receive a model that already understands language. You teach it your specific task:

  • Full fine-tuning: update all parameters (Weeks 1-2)
  • LoRA: update <1% of parameters (Week 3)
  • Quantization: compress without retraining (Week 5)
  • Distillation: transfer knowledge between models (Week 6)

The model arrives knowing English, world knowledge, and reasoning. You just need to teach it 113 complaint categories.

Module 09: How LLMs Are Built (Optional)
ECBS5200 Pre-Work

Key takeaways

  1. Pre-training is where models learn language — from trillions of tokens, at enormous cost.
  2. Scaling laws predict that more parameters + more data = better quality. The 494M decoder was trained on 9x more data than the 149M encoder.
  3. Encoders read bidirectionally (fast, good for classification). Decoders read left-to-right (slower, richer representations).
  4. Post-training (instruction tuning, RLHF) shapes behavior. For classification, we use Base models, not Instruct.
  5. Your fine-tuning is the last mile. The model already knows language. You teach it your task.
Module 09: How LLMs Are Built (Optional)
ECBS5200 Pre-Work

Further reading (optional)

Paper What it's about Year
Vaswani et al., "Attention Is All You Need" The Transformer architecture 2017
Devlin et al., "BERT" Encoder pre-training for NLP 2018
Radford et al., "Language Models are Unsupervised Multitask Learners" (GPT-2) Decoder pre-training at scale 2019
Kaplan et al., "Scaling Laws for Neural Language Models" Why bigger = better 2020
Hoffmann et al., "Chinchilla" Optimal compute allocation 2022
Ouyang et al., "InstructGPT" RLHF alignment 2022
Shoeybi et al., "Megatron-LM" Efficient large-scale training 2020

None of these are required. They're here if you want to go deeper.

Module 09: How LLMs Are Built (Optional)

This module is optional background. It covers how the models you'll use this semester were built — from raw text to the artifact you fine-tune. You don't need this to do the coursework, but it helps you understand WHY the models behave the way they do. If you're short on time, skip this module. If you're curious about what happened before you got the model, read on.

You're going to spend six weeks fine-tuning, adapting, analyzing, compressing, and distilling pretrained models. Every one of those models went through a pipeline that cost millions of dollars and months of compute before it reached you. This module walks you through that pipeline — not in detail, but enough that when you see a model card that says "pretrained on 2 trillion tokens," you understand what that means and why it matters.

The pipeline has four stages, though not every model goes through all of them. Pre-training is where the model learns language — grammar, facts, reasoning patterns — from massive amounts of text. Mid-training is optional and used to adapt the model to specific domains or extend its capabilities like longer context. Post-training is where the model learns to follow instructions and behave helpfully — this is instruction tuning and RLHF. And then there's your fine-tuning, which is stage 4 and the subject of this entire course. Most of the raw intelligence lives in stage 1. Stages 2 and 3 are about shaping how that intelligence is expressed.

Pre-training is where the model learns language. The model reads an enormous amount of text and learns to predict what comes next. Look at the scale progression. In 2020, GPT-3 trained on 300 billion tokens and it cost about 4.6 million dollars in GPU time. By 2024, Llama 3 trained on 15 trillion tokens — 50 times more data — and cost somewhere between 60 and 100 million dollars. Qwen 2.5, the decoder you'll use in this course, trained on 18 trillion tokens. ModernBERT, your encoder, trained on 2 trillion tokens. In four years, the field went from hundreds of billions of tokens to tens of trillions. The cost went from millions to hundreds of millions. What comes out of this process is a model that has implicitly learned grammar, vocabulary, facts about the world, reasoning patterns, and even some ability to follow instructions — all as a side effect of predicting text.

Encoders and decoders are two ways to build a language model. Encoders read the entire input at once — every token can attend to every other token, in both directions. Decoders read left to right — each token can only see what came before it. This makes decoders natural for text generation, where you produce one token at a time. But here's something that surprises a lot of people: for classification, both architectures do one forward pass. When you add a classification head to a decoder, there's no autoregressive generation. The input goes through the model once, and the last token's hidden state feeds into the classification head, which outputs logits. The causal attention mask adds no meaningful computational overhead. So why is our decoder 19 times slower than our encoder? Because it has 3.3 times more parameters — 494 million versus 149 million. At equal parameter counts, an encoder and a decoder would classify at roughly the same speed. The decoder models in the world tend to be bigger because they're trained at larger scale — that's a market reality, not an architectural limitation.

Scaling laws are one of the most important empirical findings in machine learning. In 2020, researchers at OpenAI showed that model quality — measured by loss on held-out text — improves predictably as you increase parameters, data, and compute. This isn't a vague trend. It's a power law: double the compute, get a predictable drop in loss. This finding drove the industry to build bigger and bigger models, because the returns were reliable. Then in 2022, the Chinchilla paper from DeepMind refined the picture: it's not just about model size, it's about the balance between parameters and training data. A smaller model trained on more data can beat a larger model trained on less. This is why modern models like Qwen and Llama are trained on trillions of tokens, not just billions. The scaling laws explain something you'll see directly in this course: a 494-million parameter decoder, pretrained on vastly more data than a 149-million parameter encoder, produces better representations — and that advantage carries through to fine-tuning.

Here's where scaling laws become directly relevant to your coursework. The two models you'll use are dramatically different in scale. ModernBERT has 149 million parameters and was trained on 2 trillion tokens. Qwen has 494 million parameters and was trained on 18 trillion tokens. That's 3.3 times more parameters and 9 times more data. Scaling laws predict the Qwen model will have richer internal representations of language — and that's exactly what we observe. The decoder, adapting less than half a percent of its parameters with LoRA, beats the encoder that fine-tunes all 149 million parameters on rare-class macro F1. But scaling laws describe quality, not speed. The encoder is faster because it's smaller — 149 million parameters means less computation per forward pass than 494 million. That speed advantage is real and matters for deployment. The course is about navigating that trade-off: better representations versus faster inference.

Mid-training is continued pre-training with a different data mix. The model keeps doing the same thing it did in pre-training — predicting the next token — but the data shifts. The clearest example is context extension. Llama 3 was pretrained with an 8K token context window. Meta took that trained model and continued training on progressively longer documents — 16K, 32K, then 128K tokens. Same loss function, same architecture, but now the data includes long documents, so the model learns to attend over longer ranges. That's how you turn an 8K model into a 128K model without starting over. Code specialization is similar: CodeLlama took the base Llama model and continued pre-training on 500 billion tokens of code. The model already knew English; the continued training made it much better at programming languages. Qwen 2.5 — the decoder you'll use in this course — went through continued pre-training on math, code, and multilingual data after the initial pre-training phase. That's part of why it performs well across different domains. The pattern is always the same: same training process, different data diet.

Post-training is the stage that turns a text prediction engine into something that behaves like an assistant. Instruction tuning trains the model on thousands of examples of instructions paired with good responses. RLHF — reinforcement learning from human feedback — or its simpler variant DPO further aligns the model to produce responses humans prefer. This is why you'll see models released in two versions: Base and Instruct. The Base model is the raw output of pre-training. The Instruct model has been aligned to follow instructions and be helpful. For this course, we use the Base version for our decoder classification experiments. Why? Because we're adding a classification head and fine-tuning with LoRA. We want the model's rich representations of language, not its instruction-following behavior. The Instruct version's alignment can actually interfere with classification head training.

And this is where the course begins. You receive a pretrained model — one that has been through the entire pipeline described in this module. It already understands language: grammar, vocabulary, sentence structure, and a vast amount of world knowledge. Your job is to teach it a specific task: classifying consumer complaints into 113 categories. That's a much easier job than learning English from scratch, which is why fine-tuning works with so much less data and compute than pre-training. You'll explore multiple approaches to this over the semester: full fine-tuning where you update every parameter, LoRA where you update less than 1 percent, quantization where you compress without retraining, and distillation where you transfer knowledge from one model to another. Each approach has trade-offs, and understanding those trade-offs is the point of the course.

To summarize: the models you'll work with this semester were built through a multi-stage pipeline. Pre-training on trillions of tokens gives them language understanding. Scaling laws explain why the bigger decoder has better representations than the smaller encoder. Encoders and decoders are different architectural choices with different speed-quality trade-offs. Post-training alignment produces the Instruct variants you may have used in chat, but for classification we use the raw Base models. And your fine-tuning — the subject of this entire course — is the final stage where you specialize a general-purpose model for a specific task. Everything in this module is background. The real work starts in Week 1.

These are the foundational papers of the field. The Transformer paper introduced the architecture that everything is built on. BERT showed how to pre-train encoders for downstream tasks. GPT-2 demonstrated that decoder pre-training at scale produces surprisingly capable models. The scaling laws paper quantified why bigger models are better. Chinchilla refined that understanding. InstructGPT introduced RLHF for alignment. And Megatron-LM showed how to train these models efficiently at massive scale. None of these are required reading for this course. But if you're curious about the engineering and science behind the models you'll be working with, these are the papers to start with.