ECBS5200 Pre-Work

Pretrained Encoders for Classification

Pre-Work Module 02

ECBS5200 — Practical Deep Learning Engineering

Module 02: Pretrained Encoders for Classification
ECBS5200 Pre-Work

What does "pretrained" mean?

A pretrained model has already learned language by reading massive amounts of text.

  • Billions of words from books, articles, code, web pages
  • Trained for days or weeks on expensive hardware
  • Learned grammar, syntax, word meaning, context, nuance

You don't train this from scratch. Someone already did.

You download their work and build on top of it.

Module 02: Pretrained Encoders for Classification
ECBS5200 Pre-Work

The encoder architecture

Input: a sequence of token IDs (from Module 01)
Output: a dense vector for every token

Token IDs:   [CLS]  I   have  a  billing  problem  [SEP]
                ↓   ↓    ↓    ↓     ↓       ↓       ↓
Encoder:     [ ... transformer layers ... ]
                ↓   ↓    ↓    ↓     ↓       ↓       ↓
Vectors:     [768] [768] [768] [768] [768]  [768]  [768]

Each token becomes a 768-dimensional vector — a rich numerical representation that captures meaning in context.

Module 02: Pretrained Encoders for Classification
ECBS5200 Pre-Work

The [CLS] token: your sentence summary

Remember from Module 01: [CLS] is added at the start of every input.

After passing through the encoder, the [CLS] vector becomes a summary of the entire input.

[CLS] representation → 768 numbers summarizing the whole complaint

This is the vector we'll use for classification.

Why [CLS]? Because it has no "word meaning" of its own — the model learns to pack the overall meaning of the input into this position.

Module 02: Pretrained Encoders for Classification
ECBS5200 Pre-Work

The classification head

A classification head is a small network on top of the encoder:

[CLS] vector (768) → Dense (768→768) → GELU → LayerNorm → Linear (768→111)
  • A small prediction head transforms the [CLS] representation (one dense layer + activation + normalization)
  • A final linear layer maps to 113 outputs (one per complaint category, after Week 1 label cleanup)
  • The highest score = the model's prediction

~676K parameters total — less than 0.5% of the model. The encoder does the hard work.

Module 02: Pretrained Encoders for Classification
ECBS5200 Pre-Work

Why not train from scratch?

Training a language model from scratch requires:

  • Billions of training tokens
  • Hundreds of GPU-hours (or more)
  • Thousands of dollars in compute

We have:

  • ~220,000 labeled complaints
  • Free Kaggle GPUs (30 hrs/week)
  • One semester

From-scratch training is not an option. Transfer learning is.

Module 02: Pretrained Encoders for Classification
ECBS5200 Pre-Work

Transfer learning: stand on the shoulders of giants

Step 1: Someone trains a model on billions of words    (expensive, done once)
Step 2: You download their pretrained model             (free, takes 30 seconds)
Step 3: You add a classification head                   (a small network)
Step 4: You fine-tune on YOUR data                      (cheap, takes minutes)

The model already understands language.
You just teach it: "given this complaint, which category?"

This is why modern NLP is accessible. You don't need Google's budget.

Module 02: Pretrained Encoders for Classification
ECBS5200 Pre-Work

ModernBERT-base: our model for the semester

Property Value
Model answerdotai/ModernBERT-base
Parameters ~149 million
Hidden size 768
Layers 22 transformer layers
Vocabulary ~50,368 tokens
Max sequence length 8,192 tokens (we use 128)
  • A modern, efficient encoder released by Answer.AI
  • "Base" size — large enough to be powerful, small enough to fine-tune on free GPUs
  • This is the backbone we'll use for every experiment
Module 02: Pretrained Encoders for Classification
ECBS5200 Pre-Work

Base model vs. classification model

Two ways to load the same backbone:

Base model (AutoModel)
→ Gives you raw embeddings (768-dim vectors for each token)
→ No prediction capability by itself

Classification model (AutoModelForSequenceClassification)
→ Same backbone + a classification head
→ Gives you logits (one score per class)

# Base: 149M parameters (encoder only)
base = AutoModel.from_pretrained("answerdotai/ModernBERT-base")

# Classification: 149M + ~676K parameters (encoder + head)
classifier = AutoModelForSequenceClassification.from_pretrained(
    "answerdotai/ModernBERT-base", num_labels=111
)
Module 02: Pretrained Encoders for Classification
ECBS5200 Pre-Work

Key takeaways

  1. Pretrained models learned language from billions of words — you get that knowledge for free
  2. Encoders turn token sequences into dense, context-aware vectors
  3. The [CLS] token becomes a summary of the entire input
  4. The classification head is a small network (~676K params) mapping [CLS] → class scores
  5. Transfer learning lets you build on existing work instead of starting from scratch
  6. ModernBERT-base (149M params) is our backbone for every experiment this semester

Next: try it yourself in the notebook!

Module 02: Pretrained Encoders for Classification

Welcome to Module 2. In Module 1, we talked about tokenization — turning text into numbers. Now we're going to talk about what happens next: what the model actually does with those numbers. Specifically, we're going to talk about pretrained encoder models, what they are, why we use them, and how we attach a classification head to make predictions. This is the foundation for everything we'll build this semester.

So what does "pretrained" actually mean? It means someone — a research lab, a company — took a model architecture, fed it billions of words of text, and trained it for days or weeks on very expensive hardware. During that training, the model learned an enormous amount about how language works. It learned grammar. It learned that "bank" means something different in "river bank" versus "bank account." It learned how negation works, how lists work, how formal language differs from informal language. All of that knowledge is baked into the model's parameters — its weights. When we say we're using a pretrained model, we mean we're downloading all of that learned knowledge and using it as our starting point. We are not starting from zero.

Let's talk about what the encoder actually does. You give it a sequence of token IDs — the output from Module 1 — and it gives you back a dense vector for every single token. In the case of ModernBERT-base, each of those vectors is 768 numbers long. These aren't random numbers. Each vector is a rich, learned representation of what that token means in the context of the entire sentence. The word "charge" gets a different vector in "they charged my card" versus "criminal charge." That's the whole point — context-dependent representations. The model processes all the tokens simultaneously through multiple transformer layers, and each layer refines the representations further. By the final layer, each vector encodes a sophisticated understanding of that token's role in the input.

Remember the CLS token from Module 1? I told you it was special, and now I can explain why. The CLS token is inserted at the beginning of every input. It doesn't correspond to any real word. During pretraining, the model learns to use this position as a summary of the entire sequence. By the time the input has passed through all the transformer layers, the CLS token's 768-dimensional vector has absorbed information from every other token in the sequence through the self-attention mechanism. It becomes a compressed representation of the whole input. And that's exactly what we need for classification — one vector that summarizes the entire complaint, which we can then feed into a classifier. This is the bridge between Module 1 and what we're doing now.

Now here's the beautiful part. To turn this encoder into a classifier, we add what's called a classification head. It has two parts. First, a small prediction head — a dense layer that maps 768 dimensions to 768 dimensions, followed by a GELU activation and layer normalization. This gives the model a chance to transform the CLS representation before making a decision. Then a final linear layer maps those 768 values down to 113 — one score for each complaint category. The whole head is about 676 thousand parameters. That sounds like a lot until you compare it to the 149 million in the encoder — it's less than half a percent. The encoder is doing all the heavy lifting, learning rich representations of language. The head's job is comparatively simple: take that representation and map it to class scores.

You might wonder — why don't we just train our own model from scratch? Let me give you the honest engineering answer. Training a model like ModernBERT from scratch required billions of tokens of training data and hundreds of GPU-hours on expensive hardware. That cost thousands of dollars. We have about 220,000 labeled complaints and access to free Kaggle GPUs with a 30-hour weekly limit. We simply do not have the data, the compute, or the time to train a language model from scratch. And we don't need to. That's the whole point of transfer learning. Someone else already invested the resources to teach a model how language works. We just need to teach it our specific task.

Transfer learning is the idea that makes this entire course possible. Here's how it works. Step one: a well-funded research lab trains a model on billions of words of text. That's the expensive part, and it's already done. Step two: you download their pretrained model. That's free and takes about 30 seconds. Step three: you add a classification head — a small network, mostly one dense layer and a final classifier. Step four: you fine-tune the model on your specific dataset. For us, that means training on our labeled consumer complaints. The fine-tuning step is cheap because the model already understands language. You're not teaching it English from scratch. You're just teaching it to map complaints to categories. This is what makes modern NLP accessible to people who don't have Google's budget. And it's exactly what we'll do starting in Week 1 of the course.

The specific model we'll use all semester is called ModernBERT-base. It's made by Answer.AI and it has about 149 million parameters. That sounds like a lot, and it is — but it's a "base" sized model, which means it's large enough to be genuinely powerful but small enough that we can fine-tune it on free Kaggle GPUs in a reasonable amount of time. It has 22 transformer layers, a hidden size of 768 — which is why our CLS vector is 768 dimensions — and a vocabulary of about 50,000 tokens. Its maximum sequence length is actually 8,192 tokens, but we'll use 128 as we discussed in Module 1. You'll get to know this model very well. Every experiment you run this semester starts with ModernBERT-base as the backbone.

In the notebook, you'll see two different ways to load this model. The first is AutoModel, which gives you just the base encoder. You put tokens in, you get 768-dimensional vectors out. It can't make predictions by itself — it just produces representations. The second is AutoModelForSequenceClassification, which wraps the same encoder and adds a classification head on top. You put tokens in, you get 113 logits out — one score for each complaint category. The difference in parameter count is tiny: about 676 thousand extra parameters for the classification head, compared to the 149 million in the encoder. That's less than half a percent. That tells you everything about where the intelligence lives — it's in the encoder, not the head. The head is just a thin decision layer on top.

Let me wrap up the key points. Pretrained models give you language understanding for free — someone else paid for that training. Encoders turn token sequences into dense vectors that capture meaning in context. The CLS token's vector becomes a summary of the whole input. The classification head is a small network — less than half a percent of the model — that maps the CLS vector to class scores. Transfer learning is what makes all of this practical on a student budget. And ModernBERT-base with its 149 million parameters is the specific model you'll use all semester. Now go open the notebook and load the model yourself. You'll see the architecture, count the parameters, and watch it process a real complaint. See you in Module 3.