Tokenization & Truncation

max_length	Complaints fully covered	Training time per epoch
64	~50%	~18 min
128	~99%	~37 min
256	~99.9%	~74 min
512	~100%	~148 min

Hello and welcome to the first pre-work module for our Practical Deep Learning Engineering course. In this module, we're going to talk about tokenization and truncation — two things that happen before your model ever sees any data. Understanding these is important because every single experiment you run this semester starts here.

Here's the fundamental problem. Deep learning models — all of them — don't read text. They read numbers. Sequences of numbers. So before we can do anything useful with a piece of text, we need a systematic way to convert it into a sequence of numbers. That process is called tokenization. And the choices made in that process have real consequences for what the model can and can't do.

Here's what a tokenizer does at a high level. You give it a string of text, and it gives you back a sequence of integers. Each integer is called a token ID, and it's just an index into a vocabulary — a big lookup table that was built when the model was originally trained. That vocabulary is fixed. You don't get to add words to it or change it. Your tokenizer and your model must agree on the same vocabulary, or nothing works.

Now, you might think the tokenizer just splits on spaces and looks up each word. That would be word-level tokenization, and almost nobody does that anymore. Instead, modern tokenizers use subword tokenization. They break words into smaller pieces. So a word like "overpayment" becomes "over" and "payment." A word like "refinancing" becomes three pieces: "ref," "in," and "ancing." Even common financial words like "mortgage" get split into "mort" and "gage." Why do this? Because a word-level vocabulary would need to include every possible word — millions of entries. Subword tokenization lets you cover virtually any word with a fixed vocabulary of about 50 thousand pieces. It handles misspellings, rare terms, and words the model has never seen before, all with the same vocabulary.

There's one more thing the tokenizer does automatically — it adds special tokens. At the beginning of every sequence, it inserts a CLS token, which stands for "classification." At the end, it adds a SEP token, for "separator." You don't add these yourself; the tokenizer handles it. The CLS token is particularly important for us because in classification tasks, the model's representation of that CLS token is what gets fed into the classification head. It becomes the summary of the entire input. We'll see this more concretely in Module 2.

Let's make this concrete with an example from the actual dataset we'll use all semester. This is a real consumer complaint: "There are many mistakes appear in my report without my understanding." When we tokenize this, we get 14 tokens including the CLS and SEP special tokens. Each word here happened to be a single token — that's because these are all common English words. But that's a short complaint. Some complaints in our dataset are hundreds of words long, which brings us to the next problem.

Here's the thing — models have a maximum sequence length. They physically cannot process more tokens than that limit. In this course, we're going to use a max length of 128 tokens. We chose that number carefully: the median complaint in our dataset is about 64 tokens, and over 99 percent of complaints fit within 128 tokens without any truncation at all. But for the rare longer complaints, anything past token 128 just gets cut off. Silently. The model never sees it. This is called truncation, and it's important to understand because it means your model is literally making decisions based on incomplete information for those few inputs. It's not a bug — it's a trade-off between coverage and computational cost.

Why 128 specifically? Here's the trade-off table. At 64 tokens, we'd only fully cover about half of complaints — the median is right around 64, so we'd be truncating roughly half the data. At 128, we cover over 99 percent of complaints — less than 1 percent get truncated at all. Going up to 256 would cover essentially everything, but it doubles the training time. Going to 512 covers every last complaint but takes nearly two and a half hours per epoch. When you're running on free GPUs with time limits, that matters a lot. 128 is the clear sweet spot — almost zero truncation, reasonable training time. This is the kind of engineering trade-off you'll make throughout this course: there's no theoretically "correct" answer, just a decision you make given your constraints and then defend with evidence.

When you actually use a tokenizer in code — and you'll do this in the notebook — you call it with your text and a few parameters. You specify the max length, tell it to truncate if needed, and tell it to pad shorter sequences to the full length. It returns a dictionary with two things: input IDs, which are the actual token numbers, and an attention mask, which tells the model which positions are real tokens and which are just padding. The padding is there because models process fixed-length batches — if your text is only 14 tokens but max length is 128, the remaining 114 positions get filled with a padding token and the attention mask tells the model to ignore them.

Here's what padding looks like in practice. Complaint A is only 14 real tokens, so the remaining slots up to 128 get filled with the padding token. The attention mask has ones for the 14 real tokens and zeros for the rest. Complaint B happens to use all 128 positions, so its mask is all ones. The model processes these as a batch — same length, same shape — but the attention mask ensures it only pays attention to the real content. This is purely mechanical, but it's important to understand because when you look at your data in the notebook, you'll see these padding tokens and masks everywhere.

Let me recap the key points. Tokenizers turn text into sequences of numbers — subword token IDs. The vocabulary is fixed from pretraining. Special tokens get added automatically. Truncation silently drops anything past your max length. Padding fills short sequences with a padding token. And the max length itself is an engineering decision you make based on your data and your constraints. We chose 128 because it covers over 99 percent of complaints — you'll verify that yourself in the notebook. Now, go open the notebook and try this yourself on real complaints from our dataset. You'll tokenize actual text, see what subword splitting looks like, and see exactly what truncation does to a long complaint. See you in Module 2.

Tokenization & Truncation

Pre-Work Module 01

ECBS5200 — Practical Deep Learning Engineering

The problem

What a tokenizer does

Not word-level: subword tokenization

Special tokens

Real example: a consumer complaint

The truncation problem

Why 128?

What the tokenizer actually returns

Padding: making sequences the same length

Key takeaways

Next: try it yourself in the notebook!