Your T4 Training Is 4x Slower Than It Should Be
torch.cuda.is_bf16_supported() returns True on a Tesla T4. It’s telling you the truth. The problem is you didn’t ask the right question.
The bug
You’re fine-tuning a model on a Kaggle T4 notebook. You write the standard mixed-precision setup:
AMP_DTYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
This checks if the GPU can do bfloat16. The T4 says yes. You proceed. Training runs. Everything looks normal.
Except it’s running at 1.29 seconds per batch instead of 0.30 seconds per batch.
Why it happens
The Tesla T4 is a Turing-architecture GPU (compute capability 7.5). Its tensor cores support fp16, not bf16. When you ask it to do bf16 math, PyTorch obliges — in software. The computation is correct. The results are identical. It’s just 4x slower because you’re not hitting the tensor cores.
PyTorch even warns you about this, buried in the inductor output:
Tesla T4 does not support bfloat16 compilation natively, skipping
But is_bf16_supported() still returns True, because the GPU can do bf16. It just can’t do it fast.
The Ampere architecture (A100, compute capability 8.0+) added native bf16 tensor core support. If you’re on an A100, bf16 is great. If you’re on a T4, it’s a silent 4x penalty.
The fix
Don’t trust is_bf16_supported(). Check the compute capability directly:
if torch.cuda.is_available():
cc = torch.cuda.get_device_capability()
AMP_DTYPE = torch.bfloat16 if cc[0] >= 8 else torch.float16
Compute capability 8.0+ means Ampere or newer — use bf16. Anything below — use fp16.
The impact
On our consumer complaints fine-tuning task (ModernBERT-base, batch_size=32, max_length=128, Kaggle T4):
| bf16 (broken) | fp16 (fixed) | |
|---|---|---|
| Batch time | 1.29s | 0.30s |
| 2 epochs (3,616 batches) | ~78 min | ~18 min |
| Speed | 1x | 4.3x faster |
Same model. Same data. Same GPU. Same numerical results. Just the dtype flag.
Who’s affected
Anyone fine-tuning on a Turing-architecture GPU: Kaggle T4s, Google Colab T4 instances, AWS g4dn, RTX 2080s, Quadro RTX. If your training code uses is_bf16_supported() to pick the AMP dtype, you’re leaving 4x performance on the table.
The lesson
“Supported” and “accelerated” are different things. Running bf16 on a T4 is like putting racing fuel in a car and leaving it in second gear. The engine handles it fine. You just never touch the hardware that makes it fast.