ECBS5200 Week 5

Quantization

ECBS5200 — Week 5

You diagnosed them. Now compress them.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Where we left off

Week 4 handed you five diagnostic tools. The biggest finding:

Model Macro F1 ECE (pre-scaling)
Encoder (149M) 0.209 0.062
Decoder (494M) 0.240 0.101

Decoder wins on F1. Decoder is also more overconfident.

Size doesn't predict calibration. You measure it.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

This week's thesis

Quantization is a toolbox, not a technique.

Each method targets a different constraint — training memory, deployment scale, inference latency, edge hardware. The trade-off depends on your tool, your model, and your hardware.

Most claims are true somewhere and false elsewhere. Measure on your setup before you ship.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Today's shape

Lecture → Lab (80 min) → Homework + memo.

Lab: load both Week 3 checkpoints at fp16 / int8 / int4. Measure six configurations. F1, per-tier Δacc, latency, peak VRAM, ECE.

Homework: extend the per-tier and calibration analysis, write a 5-section deployment memo.

Six rows of measurements. One defensible deployment decision.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Vocabulary you'll hear today

Tool names: bitsandbytes, AWQ, GPTQ, FP8, SmoothQuant, GGUF
Numerical formats: fp16, bf16, int8, int4, NF4
Frameworks: vLLM, TensorRT-LLM, SGLang, llama.cpp, TransformerEngine

Each name has a place. By the end of today you will know which place.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Act 1: What quantization actually does

Before we measure anything, look at the operation.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Precision: what a weight costs to store

A weight is a number. How many bits you spend on it determines its precision AND its storage cost.

Format Bits / weight Range Your usage
fp32 32 ±3.4×10³⁸, very precise original pretraining
fp16 16 ±65,504, less precise your Week 3 training default
bf16 16 same range as fp32 Ampere+ hardware
int8 8 −128 to +127 today, Act 1
int4 4 −8 to +7 today, Act 2
Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

int8 via LLM.int8() — the algorithm we use

Dettmers et al. 2022 (NeurIPS). Each matrix multiply decomposes into:

  1. INT8 fast path — the bulk of weight columns
  2. FP16 outlier path — small number of columns with magnitudes that break quantization
  3. Recompose — sum the two paths

The algorithm powering BitsAndBytesConfig(load_in_8bit=True) in today's lab.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

The footnote that matters today

From the runtime discussion in the LLM.int8 paper:

"The quantization overhead can slow inference for models with less than 6.7B parameters, as compared to a FP16 baseline."

Paper's threshold Your encoder Your decoder
6,700M 149M 494M
45× below 14× below

You are firmly in the regime the paper warned about.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

int4: four bits per weight

Format Bits Values representable
fp16 16 continuous range, ~65k distinct values
int8 8 256 integer levels
int4 4 16 integer levels
NF4 4 16 levels placed non-uniformly to match normal weight distributions

NF4 + double quant (QLoRA paper) is what bnb_4bit_quant_type="nf4" gives you.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

bitsandbytes: training memory, not deployment latency

Your Week 3 QLoRA trained through bitsandbytes. It's an excellent training-memory tool.

"bitsandbytes is the on-ramp tool. It's designed to let you fine-tune models you couldn't otherwise afford. It was never engineered as the path you ship to production."

Today's lab uses it for inference anyway — so you can see the trade-offs with your own numbers.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

The three promises of quantization

Quantization is routinely pitched on three promises:

  1. Accuracy preserved — model quality barely drops
  2. Memory drops — smaller weights, less VRAM
  3. Latency drops — less data to move, faster inference

The first promise is usually true on small models. The other two depend.

Today you will check all three against your own T4 measurements.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Before the measurement — predict

For each promise you are about to check, on T4 with bitsandbytes, on 149M and 494M models:

  • Macro F1 at int8 vs fp16: goes UP, DOWN, or stays the same?
  • Peak VRAM at int8 vs fp16: DOWN, UP, or same?
  • Latency at int8 vs fp16: DOWN, UP, or same?

Write your predictions down. You'll check them in the lab. The memo rewards specific numbers over folk wisdom.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Act 2: The 2026 production stack

bitsandbytes is the training tool. What's the inference tool?

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

The ecosystem in 2026

Inference throughput, Qwen2.5-32B via vLLM:

Tool tok/s ×vs bitsandbytes
AWQ + Marlin 741 4.4×
GPTQ + Marlin 712 4.2×
FP16 (no quant) 461 2.7×
bitsandbytes 168 1.0×
GGUF (llama.cpp) 93 0.6×

These numbers matter for your memo.
Caveat: ratios are at 32B scale on H100. They don't transfer 1:1 to your 149M / 494M models — the direction does, the magnitude does not.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

AWQ — the 2026 default for int4

Lin et al. 2024 (MLSys Best Paper). [Paper: 2306.00978]

Key insight: ~1% of weights are "salient" and deserve protection. Rank them by activation statistics (not weight magnitude), protect those, aggressively quantize the rest.

Major model families ship pre-quantized AWQ checkpoints on HuggingFace. Supported by vLLM, TensorRT-LLM, SGLang.

This is the paper you'll see cited at work for int4 inference.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

GPTQ — AWQ's precursor, still encountered

Frantar et al. 2023 (ICLR). [Paper: 2210.17323]

One-shot second-order post-training quantization. Used to be the state of the art.

In 2026:

  • AutoGPTQ wrapper was archived April 2025
  • vLLM RFC #39583 proposes deprecating it
  • Still encountered in checkpoints, in papers, in older production systems

You should understand it to read the literature. You would not pick it for a new project today.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

FP8 — Hopper-era lossless

Kurtic et al. 2024"Give Me BF16 or Give Me Death?" [2411.02355]

Llama-3.1 family across FP8 / INT8 / INT4 via vLLM:

  • FP8 W8A8 is effectively lossless across all model scales.
  • Requires modern FP8-capable hardware. T4 does not support this path.
  • Not available on T4.

FP8 is the format you'll see Nvidia actively pushing for modern inference.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

SmoothQuant — weights AND activations

Xiao et al. 2023 (ICML). [2211.10438]

bitsandbytes and AWQ are weight-only quantization. SmoothQuant also quantizes activations — W8A8.

Rescales activations and weights via an offline mathematical equivalence.

On Turing/T4 (where FP8 is unavailable) this is the mainstream W8A8 path in TRT-LLM and ONNX Runtime.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

GGUF — CPU and edge inference

Format used by llama.cpp and its ecosystem.

Not a quantization algorithm per se — a container format that supports multiple quantization schemes (Q4_K_M, Q5_K_S, etc.) optimized for CPU inference.

Production use case: run quantized LLMs on laptops, phones, Raspberry Pis, anywhere without a GPU.

If your constraint is "no GPU," GGUF is the tool.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

The toolbox, organized

Constraint Tool Where it lives
Training memory (QLoRA) bitsandbytes NF4 HuggingFace, PEFT
Production int4 inference AWQ vLLM, TRT-LLM, SGLang
Modern-hardware inference FP8 TransformerEngine on H100+
Legacy / research GPTQ Older checkpoints
W8A8 on Turing SmoothQuant TRT-LLM, ONNX Runtime
No GPU GGUF llama.cpp

Today's lab uses row one. At work you'd touch all six.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Act 3: What we'll measure

The frame. Then the results.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Six configurations

Two models × three precisions:

  • encoder fp16, encoder int8, encoder int4
  • decoder fp16, decoder int8, decoder int4

For each configuration, five measurements:

  1. Macro F1 — quality
  2. Peak VRAM — memory footprint at inference
  3. Latency — milliseconds per example, batched
  4. Per-tier Δacc — where damage lands (head / mid / tail)
  5. ECE — calibration, pre and post temperature-scaling
Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

The Pareto frame

Six points on two axes.

Pareto-dominated: if another point is better on BOTH axes, you'd never ship this one.

Pareto-frontier: points where no other point dominates them on every axis you care about.

The deployment choice lives on the frontier.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

The per-tier frame

Aggregate macro F1 averages over 113 classes. The average can hide structure.

Tiers by training frequency:

  • Head — top 20 classes (most training data)
  • Mid — next 40 classes
  • Tail — bottom 53 classes (least training data, fewer val examples)

Per-tier Δaccuracy shows you whether compression damage is uniform across tiers or concentrated somewhere.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

The calibration frame

A model can be 57% accurate AND be catastrophically overconfident about it.

ECE = expected gap between stated confidence and empirical accuracy.

Temperature scaling = one-parameter fix: divide all logits by a scalar T before softmax.

Quantization can shift calibration. You measure ECE before and after T-scaling at each precision.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

The deployment frame

Real deployment is not "which model is best." It's:

"Given constraints X Y Z, which configuration satisfies all of them?"

Today's hypothetical:

  • Single T4 GPU
  • 100 requests/second sustained → ≈10 ms/ex batched
  • Macro F1 floor 0.20
  • Post-scaling ECE ceiling 0.05

Six configurations, four constraints. You'll check each against each.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Act 4: What the numbers show

Now the results. Your lab will reproduce these.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

The six-row summary

Model Precision Macro F1 Latency ms/ex VRAM GB ECE post
encoder fp16 0.210 2.31 0.41 0.040
encoder int8 0.207 5.80 0.59 0.039
encoder int4 0.212 3.19 0.62 0.027
decoder fp16 0.240 5.05 1.15 0.069
decoder int8 0.240 12.65 1.87 0.072
decoder int4 0.221 7.92 1.89 0.070

Six configs. Five columns of evidence per config. Everything else today reads this table.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Pareto 1: quality vs latency

Look at which precision is fastest on each model.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Pareto 2: quality vs memory

Look at which precision uses the least memory.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

The three promises — checked

On your T4 with bitsandbytes, int8 vs fp16:

  1. Macro F1 preserved → kept
  2. Peak VRAM dropsreversed (+44%-63%)
  3. Latency dropsreversed (2.5× slower)

Two of three promises broken at this scale on this tool.

The lesson of the week, empirically.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Per-tier Δacc — where damage lands

int8 bars: nothing statistically distinguishable.
int4 bars: head and mid take real damage on the decoder.

A real CI (excluding zero) needs BOTH narrow CI AND enough flipped predictions to survive a reshuffle — we'll come back to that.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

The bootstrap caveat

The encoder tail at int4 has CI [+0.005, +0.053] — it excludes zero.

But: 210 val examples, 53 classes. A 3pp Δacc on that tier ≈ 6 flipped predictions.

A bootstrap CI measures how these 6 flips would vary under resampling this val set. It does NOT tell you the effect would survive under a different stratified split.

Real effect = narrow CI AND enough flipped predictions. One criterion without the other is directional, not proof.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Act 5: Calibration under compression

ECE is orthogonal to accuracy. Quantization can move it.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

ECE under compression

Six configs, pre- and post-T-scaling.

Decoder is more miscalibrated than encoder at every precision.
Temperature scaling helps both, but the decoder bottoms out around 0.07.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Does T-scaling generalize?

Homework exercise: fit T on half of val, evaluate on the other half.

If T from half A drops ECE on half B by the same amount it dropped on the full val set, the fit generalized.

If not, T is partially memorizing the fit set.

For the encoder at int4, the generalization is weaker than at fp16. That's a memo-worthy observation.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

The deployment implication

If your deployment gates on confidence thresholds (e.g., route below 0.8 confidence to human review):

  • A miscalibrated model routes the wrong examples to review
  • Temperature scaling must be re-fit on quantized logits, not reused from fp16
  • If T-scaling only partially recovers, the threshold itself shifts

This is why calibration lives in the rubric — it changes your deployment recipe.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Act 6: Deployment reasoning

Six configs. Four constraints. One recommendation.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

The constraint envelope

Constraint Threshold Rationale
Hardware Single T4 Fixed budget
Throughput 100 req/s → ≤10 ms/ex Product SLA
Quality macro F1 ≥ 0.20 Floor for useful triage
Calibration post-scaling ECE ≤ 0.05 Confidence-gating viability

Each constraint binds or does not. You check each config against each constraint.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Which configs pass?

Config Latency F1 ECE All 4?
encoder fp16 ✅ ✅ ✅ ✅
encoder int8 ✅ ✅ ✅ ✅
encoder int4 ✅ ✅ ✅ ✅
decoder fp16 ✅ ✅
decoder int8 ✅
decoder int4 ✅ ✅

Three viable configs. All encoder. The decoder fails on calibration at every precision.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Within the viable set

Three encoder configurations meet the envelope. Which do you pick?

encoder fp16 encoder int8 encoder int4
Macro F1 0.210 0.207 0.212
Latency ms/ex 2.31 5.80 3.19
VRAM GB 0.41 0.59 0.62

F1 differences are within measurement noise. fp16 dominates on latency AND memory.

Encoder fp16.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Hardware-dependent claims

Your answer is stack-specific. If we changed the setup:

Change What would move
Move to H100 + vLLM + AWQ int4 path becomes fastest AND smallest (Lin 2024 benchmarks)
Raise ECE ceiling to 0.08 Decoder configs become viable; dec fp16 wins F1
Move to L4 or A10 dec int8 passes latency; decoder viable if ECE relaxes too
Move to on-device (no GPU) GGUF and quantization become mandatory

Memo section 4 rewards naming this explicitly.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Week 6 preview: another compression tool

You compressed within a family (same model, fewer bits per weight).

Next week: compress the family. Take the bigger model and train a smaller one to match its outputs.

Distillation is another row in the quantization toolbox — same measurement discipline, different mechanism.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Lab and homework — logistics

Lab (80 min, T4): week5_lab.ipynb. Six quantized-inference runs. ~25 min compute + ~55 min reading and interpretation.

Homework (~5 hours): week5_homework.ipynb. Attach your lab notebook's outputs via Add Input → Your Work. No separate dataset upload needed.

Memo (100 points): embedded in the homework notebook. Rubric: assessments/week5_memo_rubric.md.

Due: Wednesday morning before Week 6 class.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Today in one sentence

Quantization is not a thing you do.

It's a cabinet of tools you choose from, measure on your setup, and defend against constraints.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

If you have time...

I tried to live-load a 14B model at int4 on a single T4. Three slides. ~3 minutes.

The toolbox thesis — and a ridiculous Kaggle story about disk space.

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

I tried to give you a wow moment

Plan: live-load a 14B model at int4 on a single T4 in front of you. The fp16 model is bigger than the GPU. Quantization makes it fit. Magic.

I couldn't even download it.

Qwen2.5-14B fp16 weights 29.4 GB
Kaggle /kaggle/working quota 20.9 GB
Outcome OSError: No space left on device
Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

Pre-quantized variants exist for exactly this

unsloth/Qwen2.5-14B-Instruct-bnb-4bit — same model, weights stored as int4 on disk. Download: ~10 GB. Fits.

Single T4
Peak VRAM after load 9.93 GB / 15.64 GB
Headroom 5.70 GB
Load time 12 s

"Category: Fee Dispute. The complaint falls under the 'Fee Dispute' category because it involves a disagreement over an unexpected late fee charged despite timely payment."

— actual generated output, batch 1, on a T4

Quantization — Toolbox, Measurement, Deployment
ECBS5200 Week 5

But you wouldn't ship this

Generation throughput: 6.3 tokens/second.

A user typing into a chat box would feel the lag. That's not a product.

tok/s
14B via bitsandbytes int4 (this demo) 6.3
32B via AWQ + Marlin on H100 (Act 2) 741

Different model, different hardware, different regime — qualitative gap, not a measured ratio.

bitsandbytes fits the model. The production stack runs it.

Toolbox, not technique.

Quantization — Toolbox, Measurement, Deployment

Welcome back. Four weeks in, and you've now fine-tuned two architectures, compared them, recommended one for deployment, and spent last week diagnosing how both models fail. This week, we compress them. You take the encoder and the decoder you trained in Week 3, you quantize their weights to int8 and int4, and you measure what changed: accuracy, per-tier accuracy, latency, peak VRAM, calibration. Six configurations, five numbers each, one deployment decision at the end. By the time you leave today you will have opinions about quantization grounded in your own measurements, not in folklore. And I'll tell you in advance — at least one of the things you'll measure is going to flatly contradict a promise quantization is routinely sold on.

One week ago you sat in this room and learned a lesson that matters today. Your encoder is one hundred forty-nine million parameters, pretrained on roughly two trillion tokens. Your decoder is four hundred ninety-four million, pretrained on eighteen trillion. Folk wisdom: bigger plus more data should mean better calibrated. Not what you measured. Encoder ECE six point two percent. Decoder ECE ten percent — substantially worse. Larger models have more expressive power to concentrate probability mass and tend to drift toward overconfidence after fine-tuning. The lesson: never predict calibration from size. You measure it. Today the same lesson applies again. We will predict a whole lot of things about quantization, and then we will measure them.

Here is the one sentence I want you to leave with today. Quantization is a toolbox, not a technique. The wording is very specific. "Technique" would imply one thing called quantization with one set of trade-offs, and you apply it or you don't. That's how it's often taught. It's wrong. Quantization in twenty twenty-six is a whole family of tools — bitsandbytes, AWQ, GPTQ, FP8, SmoothQuant, GGUF — each targeting a specific constraint, each best on some hardware and worst on others. The claims you read in blog posts about what quantization buys you are routinely true on one setup and false on another. So the professional skill is not "do I quantize." It's "which tool matches this constraint, on this hardware, for this model size." Only way to know is to measure. We'll come back to that sentence at the end of class.

Here's the rhythm of the day. Lecture first — we'll build vocabulary, set up the framing, look at what the production stack actually uses in twenty twenty-six, and walk through the measurements you'll take. Then the lab, eighty minutes, where you load both your Week Three models, quantize each one to int8 and to int4, and measure six configurations end to end. F1, per-tier accuracy deltas, inference latency, peak GPU memory, calibration. After class, the homework extends it with a split-val calibration experiment, named-class trajectories across precisions, a constraint-envelope check, and a five-section memo. That memo is where the week's intellectual work really lives. We'll walk through the rubric at the very end of lecture so you know exactly what you're writing toward when you sit down with it later.

Quick orientation on the proper nouns. bitsandbytes powered your QLoRA training in Week Three and powers the quantized loading path in today's lab. AWQ — activation-aware weight quantization — is the twenty twenty-six production default for int4 inference. GPTQ is AWQ's precursor, now moving from frontier to baseline. FP8 is the eight-bit floating point format Nvidia added for Hopper and newer — H100, H200, Blackwell. SmoothQuant is a weight-and-activation quantization method used on Turing-class hardware, which is what your T4s are, where FP8 isn't available. GGUF is the format llama dot cpp uses for CPU and edge inference. vLLM, TensorRT-LLM, and SGLang are inference servers — what ships these kernels to production. Every one of these names will connect to a specific use case by the time we finish Act Two.

Section one. We will absolutely get to measurement, but not yet. First we have to look at the underlying operation — what actually happens numerically when you quantize a weight — and we'll look at the specific algorithm we're going to run in today's lab, which is the LLM.int8 path implemented by bitsandbytes. Once those mechanics are clear in your head, the measurement results we look at later become interpretable.

Simplest possible question. A weight is a number. You can store it in thirty-two bits, sixteen, eight, four, or fewer. Fewer bits, less precision, less memory. Your Week Three training used fp16 — half-precision floating point, sixteen bits per weight. Today's lab takes your fp16 checkpoint and maps each weight into eight-bit integers, then into four-bit integers. Int8 represents a weight as one of two hundred fifty-six discrete values. Int4 as one of sixteen. The mapping has to preserve what the model learned, which requires calibration: for each weight tensor, compute a scale factor mapping the fp16 range onto the integer range, store the scale factor separately for the inverse at compute time. Quantization is that mapping plus the inverse, every forward pass. The interesting part is what it costs you on a real model.

The specific algorithm we run today is LLM.int8, Dettmers and colleagues, NeurIPS twenty twenty-two. The insight is clever. A small fraction of weight columns in large transformers have outlier magnitudes that break straightforward quantization — you can't fit them into int8 without losing signal that's load-bearing for accuracy. So LLM.int8 splits each matrix multiplication into two paths. The bulk path quantizes columns to int8, multiplies, reconstructs. The outlier path keeps a small handful of columns in fp16, multiplies in fp16. The two outputs sum at the end. Nearly-int8 memory savings on the weights while preserving accuracy on outlier-sensitive columns. bitsandbytes implements this, and in your lab today you invoke it through exactly one flag in the model loader.

Read the quote on the slide carefully. It's from the LLM.int8 paper's own runtime discussion — not from a critic of the paper, from Dettmers and his coauthors. The algorithm was designed for very large models. For models in the hundreds of billions of parameters, the LLM.int8 decomposition is genuinely a win — you save a huge amount of memory and the inference is competitive. But when you run that same decomposition on a small model, the overhead of splitting the matrix into the int8 fast path and the fp16 outlier path can actually exceed the savings of doing most of the work in int8. And the paper says this explicitly. For any model below six point seven billion parameters, the authors warn you that the int8 procedure may cost more than it saves. Now look at your two models. Your encoder is one hundred forty-nine million parameters. Your decoder is four hundred ninety-four million. Both of them are firmly, unambiguously below the six point seven billion threshold the paper flagged. So when you run the int8 lab cell today, you are running the algorithm in exactly the regime its own authors told you to be careful about. The numbers are about to show you what that means in concrete terms.

Int4 is four bits per weight, sixteen representable levels. Uniform quantization — evenly spaced from minus eight to plus seven — wastes precision, because trained weights aren't uniformly distributed. They cluster near zero with a long tail. NF4, normal float four, from the QLoRA paper by Dettmers and colleagues in twenty twenty-three, places its sixteen levels non-uniformly to match where the weights actually live. A given number of levels covers more of the actual information. Double quantization is a separate trick that compresses the scale factors themselves. bitsandbytes packages both together under the nf4 flag. And to connect the dots: this is the same recipe QLoRA used to fine-tune a sixty-five billion parameter model on a single forty-eight gigabyte GPU. Same machinery you ran in Week Three.

Important framing. bitsandbytes was designed for training. Primary purpose: let a small team fine-tune a model they couldn't otherwise afford by holding it at int4 during backprop instead of fp16 or bf16. Your Week Three decoder was loaded through exactly this machinery. QLoRA on a single T4 exists because bitsandbytes exists. But it was never engineered as the inference-time path you ship to production. For inference at scale the production stack uses different tools — we'll meet them shortly. Today we're deliberately using bitsandbytes at inference time. Not because that's what you'd do at work. We're doing it so you can see, with your own measurements, what happens when you use the training tool in the deployment regime. That gap between design and usage explains almost every surprising number you'll see today.

Here is the three-promise pitch you will hear about quantization from blog posts, vendor marketing, and tutorials. Promise one: accuracy is preserved — the model gets just as good answers, only slightly less certain about them, after you quantize. Promise two: memory drops — the weights are half or a quarter as big, so peak memory on the GPU drops correspondingly. Promise three: latency drops — there's less data to move into the matrix-multiplication units, so inference is faster. All three of those sound reasonable, and you've probably read all three in some form. The first one is usually true for models in the regime we care about. The other two are where it gets interesting. Your lab today is going to check all three promises, on your specific hardware, with your specific tool, on your specific models. Some of them are going to hold. Some of them are going to reverse direction. Pay attention to which.

Before we move on, do the predict-then-observe exercise. For each of the three promises, write down what you think will happen on your T4 when you load the encoder or the decoder you trained, and compare fp16 to int8. Will macro F1 stay the same, drop, or improve? Will peak VRAM drop, stay the same, or go up? Will latency drop, stay the same, or go up? Not what the blog posts predict — what you predict, with whatever priors you've got. The answers you write right now will be checked against your own measurements in the lab in about ninety minutes. The rubric for your memo explicitly rewards engaging with your own numbers and engaging with how those numbers compare to your priors. Skipping this prediction step makes that engagement shallow. Take twenty seconds. Write the three predictions.

Section two. So at this point you know what bitsandbytes is, you know what LLM.int8 is, you know what NF4 is. If the question is "what tool would I reach for at work, when I need quantized inference in production," the answer is honestly none of those. The production inference tool in twenty twenty-six is something different, and this section is about naming it. These are names you'll see on job postings, in Slack channels, and in architecture diagrams at any company running large-model inference today.

This is honestly the single most important slide for your professional trajectory in applied ML. The bar chart on the right shows inference throughput on Qwen two point five thirty-two B, running through vLLM, measured in tokens per second. Four numbers to remember. AWQ with the Marlin kernel — seven hundred forty-one tokens per second. GPTQ with Marlin — seven hundred twelve. Unquantized fp16 — four hundred sixty-one. bitsandbytes — one hundred sixty-eight. Read that gap one more time. AWQ is four point four times faster than bitsandbytes on the exact same hardware, on the exact same model. That gap is the difference between what you've learned to use today, in your lab, and what the production stack actually runs at companies that ship LLM inference at scale. One important caveat — these ratios are at thirty-two billion parameter scale on an H one hundred. They will not transfer one-to-one to your one hundred forty-nine million encoder or four hundred ninety-four million decoder. AWQ also pays kernel overhead that proportionally grows on tiny models. The qualitative direction is right — the production stack is faster than bitsandbytes — but if you write in your memo "AWQ would be four point four times faster on my one hundred forty-nine million encoder," you're factually wrong. The lesson is direction, not magnitude. These numbers should inform how you write memo section four, where you're explicitly asked to acknowledge that your T4 bitsandbytes measurements understate what a real production stack would achieve. We're using the training tool. The production tool is over there.

AWQ stands for activation-aware weight quantization. It was published by Lin and colleagues at MIT and collaborators, and it won best paper at MLSys twenty twenty-four. The insight is deceptively simple, but it took the field a while to land on it. If you just quantize every weight aggressively to four bits, accuracy drops noticeably. If you quantize most weights aggressively, but you identify the roughly one percent of weights that really matter — and here's the key — ranked by the magnitude of the activations they multiply, not by their own weight magnitude — and you protect those salient weights from aggressive quantization, your accuracy holds up. The practical impact of this paper has been enormous. Major model families now ship pre-quantized AWQ checkpoints on HuggingFace. Every serious inference server — vLLM, TensorRT-LLM, SGLang — has optimized AWQ kernels. If you go into a job in applied ML and someone says "quantize this model for deployment," AWQ is what they probably mean. This is the paper you're most likely to see cited in an architecture review at work.

GPTQ came before AWQ. Frantar and colleagues from IST Austria and ETH Zurich, ICLR twenty twenty-three. One-shot post-training quantization with second-order Hessian-based error correction. Mathematically beautiful, broadly deployed, and for a year and a half it was what you reached for. In twenty twenty-six, it's being displaced. The AutoGPTQ wrapper most downstream libraries used was archived in April twenty twenty-five. vLLM's RFC thirty-nine five eighty-three, from early twenty twenty-six, proposes deprecating GPTQ and bitsandbytes from vLLM entirely on grounds of low usage versus maintenance burden. You'll still encounter GPTQ in older checkpoints and in the literature — you should be able to read the paper — but you wouldn't choose it for a new project today.

FP8 is the eight-bit floating-point format Nvidia added for Hopper-class hardware — that's H100 and newer. Kurtic and colleagues at Neural Magic published an empirical evaluation in twenty twenty-four that characterized FP8's behavior across the Llama three point one family. The headline finding is striking: FP8 in the eight-bit-weight eight-bit-activation regime is effectively lossless — accuracy stays within fractions of a percent of fp16 across every model size they tested. The catch, and it's a real catch for us, is hardware. FP8 compute paths require modern FP8-capable hardware — Hopper-class (H100, H200), Blackwell, and Ada-class GPUs (L4, RTX 40-series) all have it. Your T4s are Turing-class, compute capability seven point five, which doesn't support efficient FP8 compute at all. The operational fact for this course is simple: there is no FP8 path on T4. So when you write your memo and someone asks "why don't we just use FP8," the honest answer on your hardware is "we can't." If we had H100s in this room, it would be one of the first configurations we'd measure. We don't.

SmoothQuant is the fourth tool you need to know. Xiao Guangxuan and colleagues at MIT published this at ICML twenty twenty-three. The difference from AWQ and bitsandbytes is important and worth pausing on. Those two are weight-only quantization — they shrink the weights, but activations stay in higher precision. SmoothQuant quantizes both weights AND activations to eight bits. That's the W8A8 regime — eight-bit weights, eight-bit activations. The technical insight is that activations have certain channels with very large outliers that make direct quantization hard. So SmoothQuant applies a mathematically equivalent offline rescaling that migrates the difficulty from activations onto the weights, and then both can quantize cleanly. On modern hardware like H100, FP8 has largely displaced SmoothQuant — same regime, fewer steps. But on older hardware like your T4s, where FP8 isn't available at all, SmoothQuant remains the mainstream W8A8 path in TensorRT-LLM and in ONNX Runtime.

One more tool, then we wrap the toolbox. GGUF is the format used by llama dot cpp, the C-plus-plus inference library originally written to run large language models on consumer CPUs. Not strictly a quantization algorithm — a container format wrapping schemes like Q4_K_M and Q5_K_S, each with their own trade-offs, all optimized for CPU. Production use case is simple: anywhere you don't have a GPU. Laptops, phones, embedded devices, Raspberry Pis. "Run this on-device without a server" — GGUF is the tool. Slower than GPU inference by a wide margin, but where no GPU exists, the trade is worth taking. Not on today's menu — all our hardware has a GPU. Know the name and what it points at.

Slide-summary of the toolbox. Left column is the constraint you face. Middle column is the tool. Right column is where it lives in the ecosystem. If your constraint is training memory — you want to fine-tune a model that barely fits in memory — you reach for bitsandbytes NF4 via PEFT. That is the QLoRA recipe you ran in Week Three. If your constraint is production int4 inference at scale, you reach for AWQ through vLLM or TensorRT-LLM or SGLang. If you're on H100 or Blackwell hardware and you want lossless compression, FP8 through TransformerEngine. GPTQ for older checkpoints you encounter. SmoothQuant when you need W8A8 and you don't have FP8. GGUF when you don't have a GPU at all. Each constraint has its tool. Today's lab uses the very first row — bitsandbytes — for both int8 and int4 inference. At work, over your career, you would touch most of the other rows.

Section three. You've now got the vocabulary, and you've got the tool landscape. Before we look at any numbers from the lab, we have to set the measurement frame — what exactly we're measuring, why these five things and not others, and how to read each of them. Three short slides on framing, then we go to the numbers.

Six configurations is what you're going to produce in the lab. Two models from Week Three — the one hundred forty-nine million parameter encoder and the four hundred ninety-four million parameter decoder. Three precisions each — fp16 as your baseline, int8 via LLM.int8, int4 via NF4. For every one of those six configurations, you take five measurements. Macro F1 tells you aggregate quality. Peak VRAM tells you how much GPU memory you actually paid. Latency tells you how fast the model served at batch size thirty-two — median of five timed batches after three warmup batches, so the number is stable. Per-tier delta accuracy tells you WHERE the quality landed — head, mid, or tail of the class distribution. ECE tells you whether the model's confidence distribution still matches its empirical accuracy. Six times five is thirty numbers. They collapse cleanly into a summary table and two plots.

The Pareto frame is how you think about a multi-axis trade-off — worth investing thirty seconds on, because we come back to it. Six points, two axes. For each pair, ask: is one strictly better than the other on both axes? If yes, the worse one is Pareto-dominated — you'd never ship it, another option beats it on everything you care about. If no, both are on the Pareto frontier, and the right choice depends on which axis matters most for your deployment. Deployment decisions live on the frontier by definition. This is honestly the single most common conceptual frame in applied ML trade-off analysis. You'll see it for model size versus quality, accuracy versus latency, memory versus throughput. Same frame, different axes, every time.

The per-tier frame. Your aggregate macro F1 averages over all one hundred thirteen classes — wildly unequal classes. The top class has thirteen thousand training examples; the bottom few have under ten. Averaging a quality metric over such heterogeneous classes hides where the quality lives. So we bucket. Head is the top twenty classes by training frequency. Mid is the next forty. Tail is the bottom fifty-three. For each tier we compute the change in accuracy when we compress from fp16 to int8 or int4. The result tells you whether quantization damage is distributed evenly across tiers or concentrates somewhere. This is the core measurement of memo section two, and the rubric grades it heavily.

Recall the Week Four lesson. A model can be correct fifty-seven percent of the time, and at the same time it can systematically claim to be correct ninety percent of the time. That gap between the model's claimed confidence and its empirical accuracy is the expected calibration error, ECE. For deployments that gate on model confidence — for example, route high-confidence predictions to the model, low-confidence to a human reviewer — ECE matters as much as accuracy does. Temperature scaling is a beautifully simple, one-parameter post-hoc fix. You divide all the model's logits by a single scalar T, you fit that T to minimize negative log-likelihood on a calibration set, and the resulting re-softmaxed confidences usually track accuracy much better than they did before. Quantization can shift a model's confidence distribution. So your lab measures ECE before AND after temperature scaling at each precision, on both models. That's the two-by-three ECE table that goes into memo section three.

The final framing slide. Real deployment decisions are not "which configuration has the highest F1." They are constraint satisfaction problems. Given a hardware budget, a throughput requirement, a quality floor, and a calibration ceiling — which configurations meet ALL of the constraints simultaneously? The ones that do are your viable set. Within that viable set, you pick based on which you prefer, or which gives you the most headroom for growth, or which is cheapest to operate at scale. Today's hypothetical constraint envelope is on the slide. Single T4 GPU. A hundred requests per second sustained, which translates to roughly ten milliseconds per example at batch thirty-two. Macro F1 at least zero point two zero. Post-scaling ECE at most zero point zero five. Your homework checks each of the six configurations against each of the four constraints. Memo section four is where you make the actual deployment call and defend it.

Now we look at numbers. Everything on the next several slides is what the instructor verification produced on Kaggle T4 — these are the exact numbers you're going to see when you run the lab in about ninety minutes. Your bootstrap seed is fixed, so the tier delta numbers will match exactly. The latency numbers will move by five to fifteen percent run-to-run due to T4 shared-instance noise — that's expected and normal, you won't reproduce them to the millisecond, but the directions and ratios will hold.

This is the central table of the week. Six rows, one per configuration, five columns of measurement. You will see this table on your screen when you run the lab. Let me read across two of the rows so you have an anchor. Encoder fp16: accuracy twenty-one zero macro F1, two point three one milliseconds per example, zero point four one gigabytes of VRAM, ECE post-scaling at forty tenths of a percent. Decoder int8, three rows down: accuracy twenty-four oh-one macro F1, twelve point six five milliseconds per example, one point eight seven gigabytes of VRAM, ECE post-scaling at seventy tenths of a percent. Now take a few seconds and scan across all six rows. Notice anything yet? Several things in this table are immediately surprising once you start looking carefully, and we are about to spend the next five slides unpacking them one at a time.

First Pareto plot. X-axis is latency in milliseconds per example. Y-axis is macro F1. Higher is better on Y. Lower is better on X. The top-left corner is dominance. Blue points are the encoder, red points are the decoder. Circles are fp16, squares are int8, triangles are int4. Look at the encoder first. Blue circle at two point three milliseconds — that's the fastest. Blue square at five point eight milliseconds — slowest of the three. Blue triangle at three point two — middle. Within the encoder, fp16 is faster than int4, which is faster than int8. That is the exact opposite of what the folk pitch promised you. Same pattern on the decoder. Red circle fastest, red square slowest. Among all six points on this plot, nothing dominates fp16-encoder on latency at comparable F1. That is the Pareto claim you make in memo section one.

Second Pareto plot. Same axes and markers, X-axis is now peak VRAM in gigabytes. Look at the encoder. Blue circle at zero point four one — that's fp16, the LEAST memory of the three. Blue square at zero point five nine. Blue triangle at zero point six two. fp16 uses less memory than int8, which uses less than int4. The exact opposite of what the bit-count argument would predict. Same story on the decoder — fp16 at one point one five gigabytes, both int8 and int4 above one point eight. The weights in int8 are half as large; in int4, a quarter. And yet peak VRAM at inference went UP. Something else is consuming the memory — the bitsandbytes quantization state, the fp16 outlier branches, scratch space for the decomposed matmuls. At small model scale the overhead exceeds the weight savings. The second broken promise.

Pull it all together. Three promises, checked against your own numbers. For int8 via bitsandbytes on your T4, only accuracy held. Encoder F1 from twenty-one zero-nine to twenty zero-seven — basically noise. Decoder F1 from twenty-three ninety-nine to twenty-four-oh-one — unchanged to four decimal places. Promise one, kept. Promise two, VRAM drops — reversed: up forty-four percent on encoder, sixty-three percent on decoder. Promise three, latency drops — reversed: two and a half times slower on both models. Two of three promises broken at this scale on this tool. This is the toolbox thesis in a single observation: most quantization claims are true somewhere and false elsewhere. A production quantization path on modern hardware — AWQ or FP8 through vLLM or TensorRT-LLM on a larger model — may keep all three. The point isn't "same algorithm, better hardware." It's that the result depends on the tool, the model scale, and the hardware. Your measurement is specific to all three. That's the lesson, made empirical.

Per-tier damage chart. Two panels — encoder left, decoder right. Gray bars int8, red bars int4. Y-axis is delta accuracy versus fp16, with ninety-five percent CI whiskers on every bar. On the encoder, the gray int8 bars are all essentially at zero. None of those CIs cleanly exclude zero — int8 barely moved the tier-level predictions. The red int4 bars tell a different story: head down about three points, tail looks up about three points. On the decoder, int8 again is nothing statistically distinguishable. int4 on the decoder shows real damage — head and mid both down about three points, CIs cleanly exclude zero. Those are the asterisks you'll see in your lab table. But the asterisk has a limit. Next slide.

This is an important pedagogical slide. Pause on it. Your lab's tier table is going to show an asterisk next to the encoder tail delta at int4. The number is positive. The confidence interval is narrow. It cleanly excludes zero. By the simple read of your bootstrap output, this looks like a real effect — and a student who stops at "the CI excludes zero, therefore the effect is real" is going to land in the Satisfactory band on memo section two. So let me set up the stronger reading. The tail tier has two hundred ten validation examples spread across fifty-three classes. A three-percentage-point delta on that tier corresponds to roughly six flipped predictions, total. Six. A bootstrap confidence interval measures the sampling variance of those specific six flips under resampling the val set you have right now. It tells you, given those six flips, how much they would jitter around if you bootstrapped your existing data. It does NOT prove that a different stratified split — a different seed-forty-two draw of the data — would show the same six examples moving. That kind of generalization is called external validity, and bootstrap doesn't give you external validity for free. External validity requires either enough flipped predictions to make resampling robust, or actual reshuffle replication on a fresh val set. Six predictions, frankly, isn't enough. The rubric for memo section two explicitly rewards students who flag this kind of finding as directional, pending a reshuffle test.

Section five. We've covered accuracy, latency, and memory. The fifth measurement is calibration, and it has its own story. Calibration in Week Five is the payoff of Week Four's groundwork — you already know what ECE is, you know how to compute it, you know how to fix it with temperature scaling. Now what we check is what quantization does to it. Does compressing weights move the calibration picture, and does T-scaling still rescue you when it does?

Bar chart. Six configurations along the x-axis. Pre-scaling ECE in lighter shade, post-scaling in darker. Black dashed horizontal line at zero point zero five — the post-scaling ECE ceiling in today's deployment envelope. Notice two patterns. First, the encoder bars all sit low — around three to six percent pre-scaling, all three post-scaling under four percent. Second, the decoder bars. All three pre-scaling around ten percent. Post-scaling brings them to about seven — right near the ceiling but not under it. All three encoder configurations sit below the ceiling line. All three decoder configurations sit above it. The decoder fails the ECE ceiling at every precision. This finding will matter when we get to deployment in a few minutes.

Important homework exercise. In the lab you fit T on all of val and evaluated on all of val — an in-sample estimate that might overstate how much T-scaling actually helps. In the homework you fit T on half A of val and evaluate the resulting T on half B, then compare. On the decoder, T-scaling drops half B's ECE by about three and a third percentage points across all three precisions — nearly identical recovery to the full-val number. T from half A generalizes cleanly. On the encoder at int4, the recovery shrinks visibly — from about two points at fp16 down to less than one point at int4. That's a real finding for memo section three. It suggests that when the logit dynamic range is compressed by int4, a scalar temperature has less headroom to work with. The rubric rewards a specific deployment implication tied to that number.

Here's why this calibration work matters. Many production deployments gate on confidence. You set a threshold — say zero point eight — above which you trust the model, below which you route to human review. That design assumes calibration. Overconfident models send too few examples to review and miss errors. Underconfident models flood reviewers with low-confidence examples and waste their time. Quantization can break calibration in both directions. Three options at deployment time. One: re-fit temperature scaling for each quantized checkpoint you ship. Two: accept a shifted threshold. Three: pick a configuration whose calibration already meets your ceiling without scaling. Memo section three asks you to make this concrete — which option fits your chosen deployment, and why.

Section six. We've now measured everything we set out to measure. Now is the synthesis moment — where measurement discipline becomes actual engineering judgment. This is the work of memo section four, which is the heaviest-weighted section in the entire memo at thirty points out of a hundred. Get this section right and you've shown me you can do the job.

Four constraints, on the slide. Let me explain why each one. Hardware is a single T4 — that's a product decision about your cloud spend, given to you, not negotiated. Throughput is a hundred requests per second sustained, which at batch size thirty-two translates to ten milliseconds per example maximum latency. Quality is a macro F1 floor of zero point two zero — low enough to admit our models, high enough to rule out a truly broken configuration. Calibration is post-scaling ECE at most zero point zero five — a threshold that corresponds to being able to deploy a confidence-gated system without re-engineering the threshold logic for each model swap. For each of your six configurations, you check all four constraints. The output is a six-by-four table of true-false. Configurations that pass all four are your viable set.

Here is the check. Encoder fp16, encoder int8, encoder int4 — all three pass all four constraints. Decoder fp16 passes latency and F1, fails ECE. Decoder int8 additionally fails latency — twelve and a half milliseconds exceeds the ten-millisecond ceiling. Decoder int4 passes latency and F1, fails ECE. All three decoder configurations are out because of calibration. Post-scaling ECE around seven percent, ceiling zero point zero five — fail at every precision. Last week, the deployment memo that said "ship the decoder because it has higher F1" is reversed here. Not because the decoder got worse — its F1 is still higher. But because a constraint we weren't measuring in Week Three turns out to bind. The decoder is disqualified at every precision on an axis orthogonal to the one that made it look better a week ago.

So within the three viable configurations, which do you pick? The F1 numbers are so close — twenty-one zero, twenty zero seven, twenty-one two — that they're within measurement noise of each other. No reasonable story separates them on quality. On latency and VRAM, fp16 is the clear winner. Two point three milliseconds beats three point two and five point eight. Zero point four one gigabytes beats zero point five nine and zero point six two. Both Pareto axes point at fp16. Recommendation: encoder fp16. That's your memo section four answer, given the stated envelope. Notice what happened: the decoder's F1 advantage from Week Three disappeared under calibration scrutiny, and within the encoder set, quantization's supposed benefits reversed. The answer that survives is the simplest one — no quantization, smaller model. You earn the right to recommend the boring answer by having the measurements that show why.

Your answer is T4-plus-bitsandbytes-specific. Four axes shift it. First, if we moved to H100 with vLLM and AWQ, the AWQ int4 path becomes roughly four times faster than bitsandbytes int4 per the Lin twenty twenty-four benchmarks at the model scales they were measured, and more memory-efficient too — note those numbers are from a thirty-two billion parameter model, so the exact ratio doesn't transfer to our small models, but the qualitative point is that the int4 path likely becomes the dominant Pareto point, not fp16. Second, if we raise the ECE ceiling from zero point zero five up to zero point zero eight, the decoder configurations become viable, and decoder fp16 wins on F1 again. Third, if we moved to an L4 or A10 GPU, decoder int8's latency might come under the ten-millisecond ceiling, and if ECE relaxed too, decoder int8 becomes viable. Fourth, if we moved to on-device with no GPU, we'd be in GGUF territory and quantization becomes mandatory. Section four of the rubric rewards naming at least one of these axes. Pick the one most relevant to your imagined deployment and write a counterfactual paragraph.

And a quick Week Six preview. Distillation. You've been compressing within a model family this week — same model, fewer bits per weight. Distillation compresses across family. You take a teacher — here, your decoder — and you train a smaller student model to match the teacher's predictions on a dataset. The student keeps the teacher's quality at a fraction of the inference cost. It's another tool in the same toolbox — a different compression axis, but the same measurement discipline. Same per-tier checks. Same Pareto reasoning. Same calibration concern. Same deployment envelope. The mental model you build this week transfers directly to next week. We'll do it with your Week Three decoder as teacher and the encoder as student, and you'll measure what transfers, what doesn't, and where the distilled student sits on the Pareto frontier relative to both original models.

Lab logistics. Eighty minutes in the lab. T4 accelerator on, persistence on, internet on, HF token attached as a secret. The notebook installs bitsandbytes on top of pre-installed packages — no --upgrade flag, that's been a source of crashes. About twenty-five minutes is actual compute; the remaining fifty-five is reading, predicting, interpreting. For the homework, attach your lab notebook's output to the homework notebook via Kaggle's "Add Input" → "Your Work" — that makes the four output files visible without creating a separate dataset. The discovery code finds them regardless of what slug Kaggle assigns. Memo is embedded in the homework notebook, same structure as Week Four. Rubric at assessments/week5_memo_rubric.md — read it before you write. Due Wednesday morning before next week's class. Questions, then break, then lab at three thirty.

One sentence to take with you, and we're done. Quantization is not a single thing you apply to a model. It is a cabinet of named tools — and you've now met all six of them — and your job as a practitioner is to choose from that cabinet based on the constraint you're actually under, measure what your chosen tool actually does on your specific hardware and your specific model, and defend the choice with your own numbers against a specific deployment envelope. That's the week. That's the memo. That's the skill. See you in the lab at three thirty.

If we have time, three more slides as a bonus. The story of trying to live-load a fourteen billion parameter model at int4 on a single T4 — same hardware your encoder fits on six times over — and what it taught me about the tool we've been using all class. If we don't have time today, the slides are in the deck, you can read them like a blog post. Toolbox thesis applies either way.

So I wanted to give you a wow moment. The plan was simple. Live load a fourteen billion parameter model at int4 on a single T4, in front of you, while you watch. The model is twenty-nine gigabytes at fp16. Your T4 has fifteen point six gigabytes of VRAM. The model is literally almost twice as big as the GPU you'd run it on. And the demo is supposed to be — quantization makes this work. So I built the verification notebook. Uploaded it to Kaggle. The notebook tries to download the fp16 weights into the working directory. Kaggle gives you twenty point nine gigabytes of working directory. The fp16 weights are twenty-nine point four. The download died partway through with operating system error twenty-eight, no space left on device. I could not even DOWNLOAD the demonstration model on the GPU instance I was going to run it on. That's a story by itself. The constraint that bound was not VRAM at inference time. It was disk space at download time. Production constraints are weirder than the textbook prepares you for.

Pre-quantized variants exist for exactly this reason. The Unsloth team has already done the int4 quantization step, taken the resulting weights, and uploaded them to Hugging Face as a separate repo. The bytes on disk are already int4. The download is about ten gigabytes — well inside the working directory budget. You point your loader at the unsloth slash Qwen two point five fourteen B Instruct bnb dash four bit repo, transformers reads the config, sees that the weights are pre-quantized, loads them straight into bitsandbytes' four-bit format with no intermediate fp16 step. Twelve seconds of load time. Peak VRAM nine point nine three gigabytes on a single T4 with five point seven gigabytes of headroom. And the model works. That quote on the slide is the actual completion the model generated when I asked it to classify a credit card late fee complaint. Coherent answer. On task. A fourteen point seven billion parameter model running on the same hardware your one hundred forty-nine million parameter encoder fits on six times over. THAT is what bitsandbytes was built for. It's not the inference acceleration story I've been making fun of all class. It's the fitting-things-on-hardware-they-shouldn't-fit-on story. Today we deliberately misused the tool, in your lab, to teach you what its failure mode looks like. The other half of the picture is up here.

But — and this is the closer — you wouldn't ship this. Six point three tokens per second. That is slow enough that a user typing a message into a chat box and waiting for a response would feel the lag. It is not a product. Compare to the production stack slide from Act two. AWQ with Marlin via vLLM, seven hundred forty-one tokens per second on a thirty-two billion parameter model. Different model size, different hardware, so the numbers don't line up apples to apples — but the qualitative gap is enormous, and it points at the right shape of the answer. Bitsandbytes, here on the T4, fits the model. It got the fourteen billion parameter model loaded onto fifteen gigabytes of VRAM. It did its job — the job it was actually designed for. But it did NOT make the model fast. To make a fourteen billion parameter model fast at inference time you reach for AWQ, GPTQ, FP8 — the production stack. Different tool, different job. Same toolbox. That's the thesis we opened with — quantization is a toolbox, not a technique. You don't pick a technique. You pick a tool, you measure what it does on your hardware and your model, and you defend the choice. Now go to the lab.