Welcome back. Four weeks in, and you've now fine-tuned two architectures, compared them, recommended one for deployment, and spent last week diagnosing how both models fail. This week, we compress them. You take the encoder and the decoder you trained in Week 3, you quantize their weights to int8 and int4, and you measure what changed: accuracy, per-tier accuracy, latency, peak VRAM, calibration. Six configurations, five numbers each, one deployment decision at the end. By the time you leave today you will have opinions about quantization grounded in your own measurements, not in folklore. And I'll tell you in advance — at least one of the things you'll measure is going to flatly contradict a promise quantization is routinely sold on.
One week ago you sat in this room and learned a lesson that matters today. Your encoder is one hundred forty-nine million parameters, pretrained on roughly two trillion tokens. Your decoder is four hundred ninety-four million, pretrained on eighteen trillion. Folk wisdom: bigger plus more data should mean better calibrated. Not what you measured. Encoder ECE six point two percent. Decoder ECE ten percent — substantially worse. Larger models have more expressive power to concentrate probability mass and tend to drift toward overconfidence after fine-tuning. The lesson: never predict calibration from size. You measure it. Today the same lesson applies again. We will predict a whole lot of things about quantization, and then we will measure them.
Here is the one sentence I want you to leave with today. Quantization is a toolbox, not a technique. The wording is very specific. "Technique" would imply one thing called quantization with one set of trade-offs, and you apply it or you don't. That's how it's often taught. It's wrong. Quantization in twenty twenty-six is a whole family of tools — bitsandbytes, AWQ, GPTQ, FP8, SmoothQuant, GGUF — each targeting a specific constraint, each best on some hardware and worst on others. The claims you read in blog posts about what quantization buys you are routinely true on one setup and false on another. So the professional skill is not "do I quantize." It's "which tool matches this constraint, on this hardware, for this model size." Only way to know is to measure. We'll come back to that sentence at the end of class.
Here's the rhythm of the day. Lecture first — we'll build vocabulary, set up the framing, look at what the production stack actually uses in twenty twenty-six, and walk through the measurements you'll take. Then the lab, eighty minutes, where you load both your Week Three models, quantize each one to int8 and to int4, and measure six configurations end to end. F1, per-tier accuracy deltas, inference latency, peak GPU memory, calibration. After class, the homework extends it with a split-val calibration experiment, named-class trajectories across precisions, a constraint-envelope check, and a five-section memo. That memo is where the week's intellectual work really lives. We'll walk through the rubric at the very end of lecture so you know exactly what you're writing toward when you sit down with it later.
Quick orientation on the proper nouns. bitsandbytes powered your QLoRA training in Week Three and powers the quantized loading path in today's lab. AWQ — activation-aware weight quantization — is the twenty twenty-six production default for int4 inference. GPTQ is AWQ's precursor, now moving from frontier to baseline. FP8 is the eight-bit floating point format Nvidia added for Hopper and newer — H100, H200, Blackwell. SmoothQuant is a weight-and-activation quantization method used on Turing-class hardware, which is what your T4s are, where FP8 isn't available. GGUF is the format llama dot cpp uses for CPU and edge inference. vLLM, TensorRT-LLM, and SGLang are inference servers — what ships these kernels to production. Every one of these names will connect to a specific use case by the time we finish Act Two.
Section one. We will absolutely get to measurement, but not yet. First we have to look at the underlying operation — what actually happens numerically when you quantize a weight — and we'll look at the specific algorithm we're going to run in today's lab, which is the LLM.int8 path implemented by bitsandbytes. Once those mechanics are clear in your head, the measurement results we look at later become interpretable.
Simplest possible question. A weight is a number. You can store it in thirty-two bits, sixteen, eight, four, or fewer. Fewer bits, less precision, less memory. Your Week Three training used fp16 — half-precision floating point, sixteen bits per weight. Today's lab takes your fp16 checkpoint and maps each weight into eight-bit integers, then into four-bit integers. Int8 represents a weight as one of two hundred fifty-six discrete values. Int4 as one of sixteen. The mapping has to preserve what the model learned, which requires calibration: for each weight tensor, compute a scale factor mapping the fp16 range onto the integer range, store the scale factor separately for the inverse at compute time. Quantization is that mapping plus the inverse, every forward pass. The interesting part is what it costs you on a real model.
The specific algorithm we run today is LLM.int8, Dettmers and colleagues, NeurIPS twenty twenty-two. The insight is clever. A small fraction of weight columns in large transformers have outlier magnitudes that break straightforward quantization — you can't fit them into int8 without losing signal that's load-bearing for accuracy. So LLM.int8 splits each matrix multiplication into two paths. The bulk path quantizes columns to int8, multiplies, reconstructs. The outlier path keeps a small handful of columns in fp16, multiplies in fp16. The two outputs sum at the end. Nearly-int8 memory savings on the weights while preserving accuracy on outlier-sensitive columns. bitsandbytes implements this, and in your lab today you invoke it through exactly one flag in the model loader.
Read the quote on the slide carefully. It's from the LLM.int8 paper's own runtime discussion — not from a critic of the paper, from Dettmers and his coauthors. The algorithm was designed for very large models. For models in the hundreds of billions of parameters, the LLM.int8 decomposition is genuinely a win — you save a huge amount of memory and the inference is competitive. But when you run that same decomposition on a small model, the overhead of splitting the matrix into the int8 fast path and the fp16 outlier path can actually exceed the savings of doing most of the work in int8. And the paper says this explicitly. For any model below six point seven billion parameters, the authors warn you that the int8 procedure may cost more than it saves. Now look at your two models. Your encoder is one hundred forty-nine million parameters. Your decoder is four hundred ninety-four million. Both of them are firmly, unambiguously below the six point seven billion threshold the paper flagged. So when you run the int8 lab cell today, you are running the algorithm in exactly the regime its own authors told you to be careful about. The numbers are about to show you what that means in concrete terms.
Int4 is four bits per weight, sixteen representable levels. Uniform quantization — evenly spaced from minus eight to plus seven — wastes precision, because trained weights aren't uniformly distributed. They cluster near zero with a long tail. NF4, normal float four, from the QLoRA paper by Dettmers and colleagues in twenty twenty-three, places its sixteen levels non-uniformly to match where the weights actually live. A given number of levels covers more of the actual information. Double quantization is a separate trick that compresses the scale factors themselves. bitsandbytes packages both together under the nf4 flag. And to connect the dots: this is the same recipe QLoRA used to fine-tune a sixty-five billion parameter model on a single forty-eight gigabyte GPU. Same machinery you ran in Week Three.
Important framing. bitsandbytes was designed for training. Primary purpose: let a small team fine-tune a model they couldn't otherwise afford by holding it at int4 during backprop instead of fp16 or bf16. Your Week Three decoder was loaded through exactly this machinery. QLoRA on a single T4 exists because bitsandbytes exists. But it was never engineered as the inference-time path you ship to production. For inference at scale the production stack uses different tools — we'll meet them shortly. Today we're deliberately using bitsandbytes at inference time. Not because that's what you'd do at work. We're doing it so you can see, with your own measurements, what happens when you use the training tool in the deployment regime. That gap between design and usage explains almost every surprising number you'll see today.
Here is the three-promise pitch you will hear about quantization from blog posts, vendor marketing, and tutorials. Promise one: accuracy is preserved — the model gets just as good answers, only slightly less certain about them, after you quantize. Promise two: memory drops — the weights are half or a quarter as big, so peak memory on the GPU drops correspondingly. Promise three: latency drops — there's less data to move into the matrix-multiplication units, so inference is faster. All three of those sound reasonable, and you've probably read all three in some form. The first one is usually true for models in the regime we care about. The other two are where it gets interesting. Your lab today is going to check all three promises, on your specific hardware, with your specific tool, on your specific models. Some of them are going to hold. Some of them are going to reverse direction. Pay attention to which.
Before we move on, do the predict-then-observe exercise. For each of the three promises, write down what you think will happen on your T4 when you load the encoder or the decoder you trained, and compare fp16 to int8. Will macro F1 stay the same, drop, or improve? Will peak VRAM drop, stay the same, or go up? Will latency drop, stay the same, or go up? Not what the blog posts predict — what you predict, with whatever priors you've got. The answers you write right now will be checked against your own measurements in the lab in about ninety minutes. The rubric for your memo explicitly rewards engaging with your own numbers and engaging with how those numbers compare to your priors. Skipping this prediction step makes that engagement shallow. Take twenty seconds. Write the three predictions.
Section two. So at this point you know what bitsandbytes is, you know what LLM.int8 is, you know what NF4 is. If the question is "what tool would I reach for at work, when I need quantized inference in production," the answer is honestly none of those. The production inference tool in twenty twenty-six is something different, and this section is about naming it. These are names you'll see on job postings, in Slack channels, and in architecture diagrams at any company running large-model inference today.
This is honestly the single most important slide for your professional trajectory in applied ML. The bar chart on the right shows inference throughput on Qwen two point five thirty-two B, running through vLLM, measured in tokens per second. Four numbers to remember. AWQ with the Marlin kernel — seven hundred forty-one tokens per second. GPTQ with Marlin — seven hundred twelve. Unquantized fp16 — four hundred sixty-one. bitsandbytes — one hundred sixty-eight. Read that gap one more time. AWQ is four point four times faster than bitsandbytes on the exact same hardware, on the exact same model. That gap is the difference between what you've learned to use today, in your lab, and what the production stack actually runs at companies that ship LLM inference at scale. One important caveat — these ratios are at thirty-two billion parameter scale on an H one hundred. They will not transfer one-to-one to your one hundred forty-nine million encoder or four hundred ninety-four million decoder. AWQ also pays kernel overhead that proportionally grows on tiny models. The qualitative direction is right — the production stack is faster than bitsandbytes — but if you write in your memo "AWQ would be four point four times faster on my one hundred forty-nine million encoder," you're factually wrong. The lesson is direction, not magnitude. These numbers should inform how you write memo section four, where you're explicitly asked to acknowledge that your T4 bitsandbytes measurements understate what a real production stack would achieve. We're using the training tool. The production tool is over there.
AWQ stands for activation-aware weight quantization. It was published by Lin and colleagues at MIT and collaborators, and it won best paper at MLSys twenty twenty-four. The insight is deceptively simple, but it took the field a while to land on it. If you just quantize every weight aggressively to four bits, accuracy drops noticeably. If you quantize most weights aggressively, but you identify the roughly one percent of weights that really matter — and here's the key — ranked by the magnitude of the activations they multiply, not by their own weight magnitude — and you protect those salient weights from aggressive quantization, your accuracy holds up. The practical impact of this paper has been enormous. Major model families now ship pre-quantized AWQ checkpoints on HuggingFace. Every serious inference server — vLLM, TensorRT-LLM, SGLang — has optimized AWQ kernels. If you go into a job in applied ML and someone says "quantize this model for deployment," AWQ is what they probably mean. This is the paper you're most likely to see cited in an architecture review at work.
GPTQ came before AWQ. Frantar and colleagues from IST Austria and ETH Zurich, ICLR twenty twenty-three. One-shot post-training quantization with second-order Hessian-based error correction. Mathematically beautiful, broadly deployed, and for a year and a half it was what you reached for. In twenty twenty-six, it's being displaced. The AutoGPTQ wrapper most downstream libraries used was archived in April twenty twenty-five. vLLM's RFC thirty-nine five eighty-three, from early twenty twenty-six, proposes deprecating GPTQ and bitsandbytes from vLLM entirely on grounds of low usage versus maintenance burden. You'll still encounter GPTQ in older checkpoints and in the literature — you should be able to read the paper — but you wouldn't choose it for a new project today.
FP8 is the eight-bit floating-point format Nvidia added for Hopper-class hardware — that's H100 and newer. Kurtic and colleagues at Neural Magic published an empirical evaluation in twenty twenty-four that characterized FP8's behavior across the Llama three point one family. The headline finding is striking: FP8 in the eight-bit-weight eight-bit-activation regime is effectively lossless — accuracy stays within fractions of a percent of fp16 across every model size they tested. The catch, and it's a real catch for us, is hardware. FP8 compute paths require modern FP8-capable hardware — Hopper-class (H100, H200), Blackwell, and Ada-class GPUs (L4, RTX 40-series) all have it. Your T4s are Turing-class, compute capability seven point five, which doesn't support efficient FP8 compute at all. The operational fact for this course is simple: there is no FP8 path on T4. So when you write your memo and someone asks "why don't we just use FP8," the honest answer on your hardware is "we can't." If we had H100s in this room, it would be one of the first configurations we'd measure. We don't.
SmoothQuant is the fourth tool you need to know. Xiao Guangxuan and colleagues at MIT published this at ICML twenty twenty-three. The difference from AWQ and bitsandbytes is important and worth pausing on. Those two are weight-only quantization — they shrink the weights, but activations stay in higher precision. SmoothQuant quantizes both weights AND activations to eight bits. That's the W8A8 regime — eight-bit weights, eight-bit activations. The technical insight is that activations have certain channels with very large outliers that make direct quantization hard. So SmoothQuant applies a mathematically equivalent offline rescaling that migrates the difficulty from activations onto the weights, and then both can quantize cleanly. On modern hardware like H100, FP8 has largely displaced SmoothQuant — same regime, fewer steps. But on older hardware like your T4s, where FP8 isn't available at all, SmoothQuant remains the mainstream W8A8 path in TensorRT-LLM and in ONNX Runtime.
One more tool, then we wrap the toolbox. GGUF is the format used by llama dot cpp, the C-plus-plus inference library originally written to run large language models on consumer CPUs. Not strictly a quantization algorithm — a container format wrapping schemes like Q4_K_M and Q5_K_S, each with their own trade-offs, all optimized for CPU. Production use case is simple: anywhere you don't have a GPU. Laptops, phones, embedded devices, Raspberry Pis. "Run this on-device without a server" — GGUF is the tool. Slower than GPU inference by a wide margin, but where no GPU exists, the trade is worth taking. Not on today's menu — all our hardware has a GPU. Know the name and what it points at.
Slide-summary of the toolbox. Left column is the constraint you face. Middle column is the tool. Right column is where it lives in the ecosystem. If your constraint is training memory — you want to fine-tune a model that barely fits in memory — you reach for bitsandbytes NF4 via PEFT. That is the QLoRA recipe you ran in Week Three. If your constraint is production int4 inference at scale, you reach for AWQ through vLLM or TensorRT-LLM or SGLang. If you're on H100 or Blackwell hardware and you want lossless compression, FP8 through TransformerEngine. GPTQ for older checkpoints you encounter. SmoothQuant when you need W8A8 and you don't have FP8. GGUF when you don't have a GPU at all. Each constraint has its tool. Today's lab uses the very first row — bitsandbytes — for both int8 and int4 inference. At work, over your career, you would touch most of the other rows.
Section three. You've now got the vocabulary, and you've got the tool landscape. Before we look at any numbers from the lab, we have to set the measurement frame — what exactly we're measuring, why these five things and not others, and how to read each of them. Three short slides on framing, then we go to the numbers.
Six configurations is what you're going to produce in the lab. Two models from Week Three — the one hundred forty-nine million parameter encoder and the four hundred ninety-four million parameter decoder. Three precisions each — fp16 as your baseline, int8 via LLM.int8, int4 via NF4. For every one of those six configurations, you take five measurements. Macro F1 tells you aggregate quality. Peak VRAM tells you how much GPU memory you actually paid. Latency tells you how fast the model served at batch size thirty-two — median of five timed batches after three warmup batches, so the number is stable. Per-tier delta accuracy tells you WHERE the quality landed — head, mid, or tail of the class distribution. ECE tells you whether the model's confidence distribution still matches its empirical accuracy. Six times five is thirty numbers. They collapse cleanly into a summary table and two plots.
The Pareto frame is how you think about a multi-axis trade-off — worth investing thirty seconds on, because we come back to it. Six points, two axes. For each pair, ask: is one strictly better than the other on both axes? If yes, the worse one is Pareto-dominated — you'd never ship it, another option beats it on everything you care about. If no, both are on the Pareto frontier, and the right choice depends on which axis matters most for your deployment. Deployment decisions live on the frontier by definition. This is honestly the single most common conceptual frame in applied ML trade-off analysis. You'll see it for model size versus quality, accuracy versus latency, memory versus throughput. Same frame, different axes, every time.
The per-tier frame. Your aggregate macro F1 averages over all one hundred thirteen classes — wildly unequal classes. The top class has thirteen thousand training examples; the bottom few have under ten. Averaging a quality metric over such heterogeneous classes hides where the quality lives. So we bucket. Head is the top twenty classes by training frequency. Mid is the next forty. Tail is the bottom fifty-three. For each tier we compute the change in accuracy when we compress from fp16 to int8 or int4. The result tells you whether quantization damage is distributed evenly across tiers or concentrates somewhere. This is the core measurement of memo section two, and the rubric grades it heavily.
Recall the Week Four lesson. A model can be correct fifty-seven percent of the time, and at the same time it can systematically claim to be correct ninety percent of the time. That gap between the model's claimed confidence and its empirical accuracy is the expected calibration error, ECE. For deployments that gate on model confidence — for example, route high-confidence predictions to the model, low-confidence to a human reviewer — ECE matters as much as accuracy does. Temperature scaling is a beautifully simple, one-parameter post-hoc fix. You divide all the model's logits by a single scalar T, you fit that T to minimize negative log-likelihood on a calibration set, and the resulting re-softmaxed confidences usually track accuracy much better than they did before. Quantization can shift a model's confidence distribution. So your lab measures ECE before AND after temperature scaling at each precision, on both models. That's the two-by-three ECE table that goes into memo section three.
The final framing slide. Real deployment decisions are not "which configuration has the highest F1." They are constraint satisfaction problems. Given a hardware budget, a throughput requirement, a quality floor, and a calibration ceiling — which configurations meet ALL of the constraints simultaneously? The ones that do are your viable set. Within that viable set, you pick based on which you prefer, or which gives you the most headroom for growth, or which is cheapest to operate at scale. Today's hypothetical constraint envelope is on the slide. Single T4 GPU. A hundred requests per second sustained, which translates to roughly ten milliseconds per example at batch thirty-two. Macro F1 at least zero point two zero. Post-scaling ECE at most zero point zero five. Your homework checks each of the six configurations against each of the four constraints. Memo section four is where you make the actual deployment call and defend it.
Now we look at numbers. Everything on the next several slides is what the instructor verification produced on Kaggle T4 — these are the exact numbers you're going to see when you run the lab in about ninety minutes. Your bootstrap seed is fixed, so the tier delta numbers will match exactly. The latency numbers will move by five to fifteen percent run-to-run due to T4 shared-instance noise — that's expected and normal, you won't reproduce them to the millisecond, but the directions and ratios will hold.
This is the central table of the week. Six rows, one per configuration, five columns of measurement. You will see this table on your screen when you run the lab. Let me read across two of the rows so you have an anchor. Encoder fp16: accuracy twenty-one zero macro F1, two point three one milliseconds per example, zero point four one gigabytes of VRAM, ECE post-scaling at forty tenths of a percent. Decoder int8, three rows down: accuracy twenty-four oh-one macro F1, twelve point six five milliseconds per example, one point eight seven gigabytes of VRAM, ECE post-scaling at seventy tenths of a percent. Now take a few seconds and scan across all six rows. Notice anything yet? Several things in this table are immediately surprising once you start looking carefully, and we are about to spend the next five slides unpacking them one at a time.
First Pareto plot. X-axis is latency in milliseconds per example. Y-axis is macro F1. Higher is better on Y. Lower is better on X. The top-left corner is dominance. Blue points are the encoder, red points are the decoder. Circles are fp16, squares are int8, triangles are int4. Look at the encoder first. Blue circle at two point three milliseconds — that's the fastest. Blue square at five point eight milliseconds — slowest of the three. Blue triangle at three point two — middle. Within the encoder, fp16 is faster than int4, which is faster than int8. That is the exact opposite of what the folk pitch promised you. Same pattern on the decoder. Red circle fastest, red square slowest. Among all six points on this plot, nothing dominates fp16-encoder on latency at comparable F1. That is the Pareto claim you make in memo section one.
Second Pareto plot. Same axes and markers, X-axis is now peak VRAM in gigabytes. Look at the encoder. Blue circle at zero point four one — that's fp16, the LEAST memory of the three. Blue square at zero point five nine. Blue triangle at zero point six two. fp16 uses less memory than int8, which uses less than int4. The exact opposite of what the bit-count argument would predict. Same story on the decoder — fp16 at one point one five gigabytes, both int8 and int4 above one point eight. The weights in int8 are half as large; in int4, a quarter. And yet peak VRAM at inference went UP. Something else is consuming the memory — the bitsandbytes quantization state, the fp16 outlier branches, scratch space for the decomposed matmuls. At small model scale the overhead exceeds the weight savings. The second broken promise.
Pull it all together. Three promises, checked against your own numbers. For int8 via bitsandbytes on your T4, only accuracy held. Encoder F1 from twenty-one zero-nine to twenty zero-seven — basically noise. Decoder F1 from twenty-three ninety-nine to twenty-four-oh-one — unchanged to four decimal places. Promise one, kept. Promise two, VRAM drops — reversed: up forty-four percent on encoder, sixty-three percent on decoder. Promise three, latency drops — reversed: two and a half times slower on both models. Two of three promises broken at this scale on this tool. This is the toolbox thesis in a single observation: most quantization claims are true somewhere and false elsewhere. A production quantization path on modern hardware — AWQ or FP8 through vLLM or TensorRT-LLM on a larger model — may keep all three. The point isn't "same algorithm, better hardware." It's that the result depends on the tool, the model scale, and the hardware. Your measurement is specific to all three. That's the lesson, made empirical.
Per-tier damage chart. Two panels — encoder left, decoder right. Gray bars int8, red bars int4. Y-axis is delta accuracy versus fp16, with ninety-five percent CI whiskers on every bar. On the encoder, the gray int8 bars are all essentially at zero. None of those CIs cleanly exclude zero — int8 barely moved the tier-level predictions. The red int4 bars tell a different story: head down about three points, tail looks up about three points. On the decoder, int8 again is nothing statistically distinguishable. int4 on the decoder shows real damage — head and mid both down about three points, CIs cleanly exclude zero. Those are the asterisks you'll see in your lab table. But the asterisk has a limit. Next slide.
This is an important pedagogical slide. Pause on it. Your lab's tier table is going to show an asterisk next to the encoder tail delta at int4. The number is positive. The confidence interval is narrow. It cleanly excludes zero. By the simple read of your bootstrap output, this looks like a real effect — and a student who stops at "the CI excludes zero, therefore the effect is real" is going to land in the Satisfactory band on memo section two. So let me set up the stronger reading. The tail tier has two hundred ten validation examples spread across fifty-three classes. A three-percentage-point delta on that tier corresponds to roughly six flipped predictions, total. Six. A bootstrap confidence interval measures the sampling variance of those specific six flips under resampling the val set you have right now. It tells you, given those six flips, how much they would jitter around if you bootstrapped your existing data. It does NOT prove that a different stratified split — a different seed-forty-two draw of the data — would show the same six examples moving. That kind of generalization is called external validity, and bootstrap doesn't give you external validity for free. External validity requires either enough flipped predictions to make resampling robust, or actual reshuffle replication on a fresh val set. Six predictions, frankly, isn't enough. The rubric for memo section two explicitly rewards students who flag this kind of finding as directional, pending a reshuffle test.
Section five. We've covered accuracy, latency, and memory. The fifth measurement is calibration, and it has its own story. Calibration in Week Five is the payoff of Week Four's groundwork — you already know what ECE is, you know how to compute it, you know how to fix it with temperature scaling. Now what we check is what quantization does to it. Does compressing weights move the calibration picture, and does T-scaling still rescue you when it does?
Bar chart. Six configurations along the x-axis. Pre-scaling ECE in lighter shade, post-scaling in darker. Black dashed horizontal line at zero point zero five — the post-scaling ECE ceiling in today's deployment envelope. Notice two patterns. First, the encoder bars all sit low — around three to six percent pre-scaling, all three post-scaling under four percent. Second, the decoder bars. All three pre-scaling around ten percent. Post-scaling brings them to about seven — right near the ceiling but not under it. All three encoder configurations sit below the ceiling line. All three decoder configurations sit above it. The decoder fails the ECE ceiling at every precision. This finding will matter when we get to deployment in a few minutes.
Important homework exercise. In the lab you fit T on all of val and evaluated on all of val — an in-sample estimate that might overstate how much T-scaling actually helps. In the homework you fit T on half A of val and evaluate the resulting T on half B, then compare. On the decoder, T-scaling drops half B's ECE by about three and a third percentage points across all three precisions — nearly identical recovery to the full-val number. T from half A generalizes cleanly. On the encoder at int4, the recovery shrinks visibly — from about two points at fp16 down to less than one point at int4. That's a real finding for memo section three. It suggests that when the logit dynamic range is compressed by int4, a scalar temperature has less headroom to work with. The rubric rewards a specific deployment implication tied to that number.
Here's why this calibration work matters. Many production deployments gate on confidence. You set a threshold — say zero point eight — above which you trust the model, below which you route to human review. That design assumes calibration. Overconfident models send too few examples to review and miss errors. Underconfident models flood reviewers with low-confidence examples and waste their time. Quantization can break calibration in both directions. Three options at deployment time. One: re-fit temperature scaling for each quantized checkpoint you ship. Two: accept a shifted threshold. Three: pick a configuration whose calibration already meets your ceiling without scaling. Memo section three asks you to make this concrete — which option fits your chosen deployment, and why.
Section six. We've now measured everything we set out to measure. Now is the synthesis moment — where measurement discipline becomes actual engineering judgment. This is the work of memo section four, which is the heaviest-weighted section in the entire memo at thirty points out of a hundred. Get this section right and you've shown me you can do the job.
Four constraints, on the slide. Let me explain why each one. Hardware is a single T4 — that's a product decision about your cloud spend, given to you, not negotiated. Throughput is a hundred requests per second sustained, which at batch size thirty-two translates to ten milliseconds per example maximum latency. Quality is a macro F1 floor of zero point two zero — low enough to admit our models, high enough to rule out a truly broken configuration. Calibration is post-scaling ECE at most zero point zero five — a threshold that corresponds to being able to deploy a confidence-gated system without re-engineering the threshold logic for each model swap. For each of your six configurations, you check all four constraints. The output is a six-by-four table of true-false. Configurations that pass all four are your viable set.
Here is the check. Encoder fp16, encoder int8, encoder int4 — all three pass all four constraints. Decoder fp16 passes latency and F1, fails ECE. Decoder int8 additionally fails latency — twelve and a half milliseconds exceeds the ten-millisecond ceiling. Decoder int4 passes latency and F1, fails ECE. All three decoder configurations are out because of calibration. Post-scaling ECE around seven percent, ceiling zero point zero five — fail at every precision. Last week, the deployment memo that said "ship the decoder because it has higher F1" is reversed here. Not because the decoder got worse — its F1 is still higher. But because a constraint we weren't measuring in Week Three turns out to bind. The decoder is disqualified at every precision on an axis orthogonal to the one that made it look better a week ago.
So within the three viable configurations, which do you pick? The F1 numbers are so close — twenty-one zero, twenty zero seven, twenty-one two — that they're within measurement noise of each other. No reasonable story separates them on quality. On latency and VRAM, fp16 is the clear winner. Two point three milliseconds beats three point two and five point eight. Zero point four one gigabytes beats zero point five nine and zero point six two. Both Pareto axes point at fp16. Recommendation: encoder fp16. That's your memo section four answer, given the stated envelope. Notice what happened: the decoder's F1 advantage from Week Three disappeared under calibration scrutiny, and within the encoder set, quantization's supposed benefits reversed. The answer that survives is the simplest one — no quantization, smaller model. You earn the right to recommend the boring answer by having the measurements that show why.
Your answer is T4-plus-bitsandbytes-specific. Four axes shift it. First, if we moved to H100 with vLLM and AWQ, the AWQ int4 path becomes roughly four times faster than bitsandbytes int4 per the Lin twenty twenty-four benchmarks at the model scales they were measured, and more memory-efficient too — note those numbers are from a thirty-two billion parameter model, so the exact ratio doesn't transfer to our small models, but the qualitative point is that the int4 path likely becomes the dominant Pareto point, not fp16. Second, if we raise the ECE ceiling from zero point zero five up to zero point zero eight, the decoder configurations become viable, and decoder fp16 wins on F1 again. Third, if we moved to an L4 or A10 GPU, decoder int8's latency might come under the ten-millisecond ceiling, and if ECE relaxed too, decoder int8 becomes viable. Fourth, if we moved to on-device with no GPU, we'd be in GGUF territory and quantization becomes mandatory. Section four of the rubric rewards naming at least one of these axes. Pick the one most relevant to your imagined deployment and write a counterfactual paragraph.
And a quick Week Six preview. Distillation. You've been compressing within a model family this week — same model, fewer bits per weight. Distillation compresses across family. You take a teacher — here, your decoder — and you train a smaller student model to match the teacher's predictions on a dataset. The student keeps the teacher's quality at a fraction of the inference cost. It's another tool in the same toolbox — a different compression axis, but the same measurement discipline. Same per-tier checks. Same Pareto reasoning. Same calibration concern. Same deployment envelope. The mental model you build this week transfers directly to next week. We'll do it with your Week Three decoder as teacher and the encoder as student, and you'll measure what transfers, what doesn't, and where the distilled student sits on the Pareto frontier relative to both original models.
Lab logistics. Eighty minutes in the lab. T4 accelerator on, persistence on, internet on, HF token attached as a secret. The notebook installs bitsandbytes on top of pre-installed packages — no --upgrade flag, that's been a source of crashes. About twenty-five minutes is actual compute; the remaining fifty-five is reading, predicting, interpreting. For the homework, attach your lab notebook's output to the homework notebook via Kaggle's "Add Input" → "Your Work" — that makes the four output files visible without creating a separate dataset. The discovery code finds them regardless of what slug Kaggle assigns. Memo is embedded in the homework notebook, same structure as Week Four. Rubric at assessments/week5_memo_rubric.md — read it before you write. Due Wednesday morning before next week's class. Questions, then break, then lab at three thirty.
One sentence to take with you, and we're done. Quantization is not a single thing you apply to a model. It is a cabinet of named tools — and you've now met all six of them — and your job as a practitioner is to choose from that cabinet based on the constraint you're actually under, measure what your chosen tool actually does on your specific hardware and your specific model, and defend the choice with your own numbers against a specific deployment envelope. That's the week. That's the memo. That's the skill. See you in the lab at three thirty.
If we have time, three more slides as a bonus. The story of trying to live-load a fourteen billion parameter model at int4 on a single T4 — same hardware your encoder fits on six times over — and what it taught me about the tool we've been using all class. If we don't have time today, the slides are in the deck, you can read them like a blog post. Toolbox thesis applies either way.
So I wanted to give you a wow moment. The plan was simple. Live load a fourteen billion parameter model at int4 on a single T4, in front of you, while you watch. The model is twenty-nine gigabytes at fp16. Your T4 has fifteen point six gigabytes of VRAM. The model is literally almost twice as big as the GPU you'd run it on. And the demo is supposed to be — quantization makes this work. So I built the verification notebook. Uploaded it to Kaggle. The notebook tries to download the fp16 weights into the working directory. Kaggle gives you twenty point nine gigabytes of working directory. The fp16 weights are twenty-nine point four. The download died partway through with operating system error twenty-eight, no space left on device. I could not even DOWNLOAD the demonstration model on the GPU instance I was going to run it on. That's a story by itself. The constraint that bound was not VRAM at inference time. It was disk space at download time. Production constraints are weirder than the textbook prepares you for.
Pre-quantized variants exist for exactly this reason. The Unsloth team has already done the int4 quantization step, taken the resulting weights, and uploaded them to Hugging Face as a separate repo. The bytes on disk are already int4. The download is about ten gigabytes — well inside the working directory budget. You point your loader at the unsloth slash Qwen two point five fourteen B Instruct bnb dash four bit repo, transformers reads the config, sees that the weights are pre-quantized, loads them straight into bitsandbytes' four-bit format with no intermediate fp16 step. Twelve seconds of load time. Peak VRAM nine point nine three gigabytes on a single T4 with five point seven gigabytes of headroom. And the model works. That quote on the slide is the actual completion the model generated when I asked it to classify a credit card late fee complaint. Coherent answer. On task. A fourteen point seven billion parameter model running on the same hardware your one hundred forty-nine million parameter encoder fits on six times over. THAT is what bitsandbytes was built for. It's not the inference acceleration story I've been making fun of all class. It's the fitting-things-on-hardware-they-shouldn't-fit-on story. Today we deliberately misused the tool, in your lab, to teach you what its failure mode looks like. The other half of the picture is up here.
But — and this is the closer — you wouldn't ship this. Six point three tokens per second. That is slow enough that a user typing a message into a chat box and waiting for a response would feel the lag. It is not a product. Compare to the production stack slide from Act two. AWQ with Marlin via vLLM, seven hundred forty-one tokens per second on a thirty-two billion parameter model. Different model size, different hardware, so the numbers don't line up apples to apples — but the qualitative gap is enormous, and it points at the right shape of the answer. Bitsandbytes, here on the T4, fits the model. It got the fourteen billion parameter model loaded onto fifteen gigabytes of VRAM. It did its job — the job it was actually designed for. But it did NOT make the model fast. To make a fourteen billion parameter model fast at inference time you reach for AWQ, GPTQ, FP8 — the production stack. Different tool, different job. Same toolbox. That's the thesis we opened with — quantization is a toolbox, not a technique. You don't pick a technique. You pick a tool, you measure what it does on your hardware and your model, and you defend the choice. Now go to the lab.