Your T4 Training Is 4x Slower Than It Should Be

torch.cuda.is_bf16_supported() returns True on a Tesla T4. It’s telling you the truth. The problem is you didn’t ask the right question.

The bug

You’re fine-tuning a model on a Kaggle T4 notebook. You write the standard mixed-precision setup:

AMP_DTYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16

This checks if the GPU can do bfloat16. The T4 says yes. You proceed. Training runs. Everything looks normal.

Except it’s running at 1.29 seconds per batch instead of 0.30 seconds per batch.

Why it happens

The Tesla T4 is a Turing-architecture GPU (compute capability 7.5). Its tensor cores support fp16, not bf16. When you ask it to do bf16 math, PyTorch obliges — in software. The computation is correct. The results are identical. It’s just 4x slower because you’re not hitting the tensor cores.

PyTorch even warns you about this, buried in the inductor output:

Tesla T4 does not support bfloat16 compilation natively, skipping

But is_bf16_supported() still returns True, because the GPU can do bf16. It just can’t do it fast.

The Ampere architecture (A100, compute capability 8.0+) added native bf16 tensor core support. If you’re on an A100, bf16 is great. If you’re on a T4, it’s a silent 4x penalty.

The fix

Don’t trust is_bf16_supported(). Check the compute capability directly:

if torch.cuda.is_available():
    cc = torch.cuda.get_device_capability()
    AMP_DTYPE = torch.bfloat16 if cc[0] >= 8 else torch.float16

Compute capability 8.0+ means Ampere or newer — use bf16. Anything below — use fp16.

The impact

On our consumer complaints fine-tuning task (ModernBERT-base, batch_size=32, max_length=128, Kaggle T4):

	bf16 (broken)	fp16 (fixed)
Batch time	1.29s	0.30s
2 epochs (3,616 batches)	~78 min	~18 min
Speed	1x	4.3x faster

Same model. Same data. Same GPU. Same numerical results. Just the dtype flag.

Who’s affected

Anyone fine-tuning on a Turing-architecture GPU: Kaggle T4s, Google Colab T4 instances, AWS g4dn, RTX 2080s, Quadro RTX. If your training code uses is_bf16_supported() to pick the AMP dtype, you’re leaving 4x performance on the table.

The lesson

“Supported” and “accelerated” are different things. Running bf16 on a T4 is like putting racing fuel in a car and leaving it in second gear. The engine handles it fine. You just never touch the hardware that makes it fast.