Welcome to Module 07. Today we are going to talk about quantization — one of the most practically important techniques in modern deep learning engineering. The core idea is deceptively simple: what if we stored the numbers inside a model using fewer bits? That is it. That is the whole idea. But the implications are enormous, and understanding the trade-offs is something every working ML engineer needs to internalize. We will cover the intuition here; you will get your hands dirty with real quantization in Week 5.
Let us start with the basics. When you train a neural network, the output of that training process is a collection of numbers — millions of them. These are the weights and biases that define how the model transforms inputs into outputs. Every single thing the model learned is encoded in these numbers. ModernBERT-base, which we have been using in this course, has roughly 149 million parameters. That is 149 million individual floating-point numbers. And the key insight for today is that the format we use to store each of those numbers has a direct impact on how much memory the model consumes, how fast it runs, and how much it costs to serve.
Here are the three formats you need to know. FP32, or 32-bit floating point, is the default that most models are trained in. Each number gets 32 bits — 4 bytes — of storage. That gives you excellent numerical precision. FP16 cuts that in half: 16 bits, 2 bytes per number. You lose some precision in the tail of the distribution, but for the vast majority of deep learning workloads, the difference in model quality is negligible. Then there is INT8 — 8-bit integers. Here you are going from floating-point numbers to integers, which means you need to scale and round your values. That is a more aggressive transformation, but it gives you a 4x reduction in size compared to FP32. The key point: each step down the precision ladder trades some numerical fidelity for concrete savings in memory and compute.
Let us do the arithmetic. 149 million parameters times 4 bytes each gives you roughly 571 megabytes in FP32. Move to FP16 and you cut that in half — about 286 megabytes. Go all the way to INT8 and you are down to approximately 144 megabytes. That is a 4x reduction from where you started. And notice — you have not changed the model architecture. You have not retrained anything. You have not removed any layers or attention heads. You have simply changed the numerical precision of the stored values. This is why quantization is so appealing from an engineering perspective: it is a relatively straightforward transformation that yields dramatic savings.
The best mental model I can give you is JPEG compression. When you save a photograph as a JPEG, you are throwing away some information to make the file smaller. At high quality settings, you genuinely cannot tell the difference. Crank the compression up too high, and you start seeing those blocky artifacts — the algorithm threw away information that actually mattered. Quantization works the same way. Going from FP32 to FP16 is like saving a JPEG at quality 95 — almost nobody can tell the difference, and for virtually all practical purposes the model behaves identically. Going to INT8 is more like quality 75 or 80 — it is usually fine, but you need to actually check. You need to run your evaluation metrics and confirm that the quantized model still meets your performance requirements. The reconstruction error from rounding and scaling is small, but it is not zero, and for some sensitive tasks it can matter.
So when does this actually matter? Three scenarios come up constantly in practice. First, serving costs. If your model is 4x smaller, you need 4x less GPU memory to serve it. That can be the difference between needing an 80-gigabyte A100 and getting by with a 16-gigabyte T4 — which translates directly into your cloud bill. Second, edge and mobile deployment. If you want a model running on a phone or an embedded device, you simply do not have the luxury of FP32. INT8 or even INT4 is often a hard requirement. Third, scale. When you are serving a model to millions of users, every byte of memory and every millisecond of latency has a cost. Quantization is one of the highest-leverage optimizations you can make in a production ML system. This is engineering, not research — it is about making practical trade-offs that let you ship.
I want to be very clear about scope. Today is about building intuition. You should walk away from this module understanding what quantization is, why it matters, and what the basic trade-offs look like. You do not need to know how to implement it yet. In Week 5, we will get hands-on. You will take a model you have trained and quantize it using real tools. You will measure the impact on your F1 scores and accuracy. You will compare inference latency. And you will learn about more sophisticated approaches like dynamic quantization and quantization-aware training. But all of that will make much more sense if you come in with the conceptual foundation we are building right now.
Let me leave you with the key points. First, model weights are just numbers, and the format you store them in is a choice. Second, that choice has massive implications for model size — going from FP32 to INT8 gives you a 4x reduction. Third, this is not free — you are trading numerical precision for efficiency, just like JPEG compression trades image fidelity for file size. Fourth, and this is critical — you must measure the impact. Do not just quantize and ship. Run your evaluation suite and confirm that the quality degradation is acceptable for your use case. And finally, remember that the best model in the world is useless if you cannot deploy it within your resource constraints. Quantization is one of the key tools that bridges the gap between research and production. See you in the notebook.