Blog

Notes from building this course — what worked, what broke, what surprised us.

Distillation acted like a calibration regularizer. There was a cheaper one. I’ve been excited about Week 6 of this course — distillation — since the start of the term. Six weeks of fine-tuning, comparing, diagnosing, and compressing, and the experiments kept turning up th

Your int8 Quantization Is 2.5× Slower Than fp16 The LLM.int8 paper from 2022 told you this would happen. The blog tutorials skip that part.

No magic, but first: a transformers PR The core teaching arc of my applied deep learning course at CEU is simple: there’s no magic. Every part of a modern NLP pipeline — the data mix, the tokenizer, the model weights, the training re

When Better Means Worse I applied class weighting to a 113-category text classifier. Accuracy dropped five points. Macro F1 improved. Ten classes that the model had never once predicted correctly started getting nonzero F1.

You Can’t Scale Your Way Out of a Data Problem I fine-tuned a Qwen3-8B on a 113-category consumer complaint classification task with a severe long tail using the same basic LoRA classification setup I used on the smaller models (rank 16, alpha 32,

Half a Percent: Thoughts and Results on Decoders for Text Classification I put a classification head on Qwen2.5-0.5B and trained 0.46% of its parameters with LoRA. On a 113-class consumer complaints task with a brutal long tail,

44% Accuracy Without a Single Training Example I gave Opus 4.6 a list of 113 CFPB complaint categories and asked it to classify consumer complaints. No labeled examples in the prompt. No fine-tuning.

Your T4 Training Is 4x Slower Than It Should Be torch.cuda.is_bf16_supported() returns True on a Tesla T4. It’s telling you the truth. The problem is you didn’t ask the right question.

When Your “Ready to Use” Dataset Has the Same Category Listed Twice determined-ai/consumer_complaints_medium is a good teaching dataset. Pre-split, business-realistic, multiclass, compute-friendly. 64,000 training examples. 153 issue categories.

What Do You Teach When Compression Doesn’t Produce a Speedup? The plan for Week 5 was clean: students apply quantization to their model, benchmark the results, see the speedup, analyze whether compression hurts their weakest classes. Standard compression week. E

The Questions Model Cards Don’t Answer Model cards tell you GLUE scores, parameter counts, and license terms. They don’t tell you what happens when a non-expert runs the full workflow on constrained hardware.

Picking a Model for Teaching Picking a model for a research project: check the benchmarks, try it, move on. Picking a model for a teaching lab that 30 students will run for six weeks on free hardware: different exercise entirely.

Blog

Table of contents