Pretrained Encoders for Classification

Property	Value
Model	`answerdotai/ModernBERT-base`
Parameters	~149 million
Hidden size	768
Layers	22 transformer layers
Vocabulary	~50,368 tokens
Max sequence length	8,192 tokens (we use 128)

Welcome to Module 2. In Module 1, we talked about tokenization — turning text into numbers. Now we're going to talk about what happens next: what the model actually does with those numbers. Specifically, we're going to talk about pretrained encoder models, what they are, why we use them, and how we attach a classification head to make predictions. This is the foundation for everything we'll build this semester.

So what does "pretrained" actually mean? It means someone — a research lab, a company — took a model architecture, fed it billions of words of text, and trained it for days or weeks on very expensive hardware. During that training, the model learned an enormous amount about how language works. It learned grammar. It learned that "bank" means something different in "river bank" versus "bank account." It learned how negation works, how lists work, how formal language differs from informal language. All of that knowledge is baked into the model's parameters — its weights. When we say we're using a pretrained model, we mean we're downloading all of that learned knowledge and using it as our starting point. We are not starting from zero.

Let's talk about what the encoder actually does. You give it a sequence of token IDs — the output from Module 1 — and it gives you back a dense vector for every single token. In the case of ModernBERT-base, each of those vectors is 768 numbers long. These aren't random numbers. Each vector is a rich, learned representation of what that token means in the context of the entire sentence. The word "charge" gets a different vector in "they charged my card" versus "criminal charge." That's the whole point — context-dependent representations. The model processes all the tokens simultaneously through multiple transformer layers, and each layer refines the representations further. By the final layer, each vector encodes a sophisticated understanding of that token's role in the input.

Remember the CLS token from Module 1? I told you it was special, and now I can explain why. The CLS token is inserted at the beginning of every input. It doesn't correspond to any real word. During pretraining, the model learns to use this position as a summary of the entire sequence. By the time the input has passed through all the transformer layers, the CLS token's 768-dimensional vector has absorbed information from every other token in the sequence through the self-attention mechanism. It becomes a compressed representation of the whole input. And that's exactly what we need for classification — one vector that summarizes the entire complaint, which we can then feed into a classifier. This is the bridge between Module 1 and what we're doing now.

Now here's the beautiful part. To turn this encoder into a classifier, we add what's called a classification head. It has two parts. First, a small prediction head — a dense layer that maps 768 dimensions to 768 dimensions, followed by a GELU activation and layer normalization. This gives the model a chance to transform the CLS representation before making a decision. Then a final linear layer maps those 768 values down to 113 — one score for each complaint category. The whole head is about 676 thousand parameters. That sounds like a lot until you compare it to the 149 million in the encoder — it's less than half a percent. The encoder is doing all the heavy lifting, learning rich representations of language. The head's job is comparatively simple: take that representation and map it to class scores.

You might wonder — why don't we just train our own model from scratch? Let me give you the honest engineering answer. Training a model like ModernBERT from scratch required billions of tokens of training data and hundreds of GPU-hours on expensive hardware. That cost thousands of dollars. We have about 220,000 labeled complaints and access to free Kaggle GPUs with a 30-hour weekly limit. We simply do not have the data, the compute, or the time to train a language model from scratch. And we don't need to. That's the whole point of transfer learning. Someone else already invested the resources to teach a model how language works. We just need to teach it our specific task.

Transfer learning is the idea that makes this entire course possible. Here's how it works. Step one: a well-funded research lab trains a model on billions of words of text. That's the expensive part, and it's already done. Step two: you download their pretrained model. That's free and takes about 30 seconds. Step three: you add a classification head — a small network, mostly one dense layer and a final classifier. Step four: you fine-tune the model on your specific dataset. For us, that means training on our labeled consumer complaints. The fine-tuning step is cheap because the model already understands language. You're not teaching it English from scratch. You're just teaching it to map complaints to categories. This is what makes modern NLP accessible to people who don't have Google's budget. And it's exactly what we'll do starting in Week 1 of the course.

The specific model we'll use all semester is called ModernBERT-base. It's made by Answer.AI and it has about 149 million parameters. That sounds like a lot, and it is — but it's a "base" sized model, which means it's large enough to be genuinely powerful but small enough that we can fine-tune it on free Kaggle GPUs in a reasonable amount of time. It has 22 transformer layers, a hidden size of 768 — which is why our CLS vector is 768 dimensions — and a vocabulary of about 50,000 tokens. Its maximum sequence length is actually 8,192 tokens, but we'll use 128 as we discussed in Module 1. You'll get to know this model very well. Every experiment you run this semester starts with ModernBERT-base as the backbone.

In the notebook, you'll see two different ways to load this model. The first is AutoModel, which gives you just the base encoder. You put tokens in, you get 768-dimensional vectors out. It can't make predictions by itself — it just produces representations. The second is AutoModelForSequenceClassification, which wraps the same encoder and adds a classification head on top. You put tokens in, you get 113 logits out — one score for each complaint category. The difference in parameter count is tiny: about 676 thousand extra parameters for the classification head, compared to the 149 million in the encoder. That's less than half a percent. That tells you everything about where the intelligence lives — it's in the encoder, not the head. The head is just a thin decision layer on top.

Let me wrap up the key points. Pretrained models give you language understanding for free — someone else paid for that training. Encoders turn token sequences into dense vectors that capture meaning in context. The CLS token's vector becomes a summary of the whole input. The classification head is a small network — less than half a percent of the model — that maps the CLS vector to class scores. Transfer learning is what makes all of this practical on a student budget. And ModernBERT-base with its 149 million parameters is the specific model you'll use all semester. Now go open the notebook and load the model yourself. You'll see the architecture, count the parameters, and watch it process a real complaint. See you in Module 3.

Pretrained Encoders for Classification

Pre-Work Module 02

ECBS5200 — Practical Deep Learning Engineering

What does "pretrained" mean?

The encoder architecture

The [CLS] token: your sentence summary

The classification head

Why not train from scratch?

Transfer learning: stand on the shoulders of giants

ModernBERT-base: our model for the semester

Base model vs. classification model

Key takeaways

Next: try it yourself in the notebook!