TL;DR
Two recent developments are converging to potentially revolutionize how we write GPU kernels:
- Luminal’s megakernel compiler - Automatically fuses entire models into single kernels, eliminating GPU bubbles
- KernelBench v3 - A robust benchmark showing LLMs struggle with genuine kernel engineering (10.6% pass rate on novel architectures)
Part 1: Why Current GPU Programming is Inefficient
- Kernel Launch Overhead Every time a kernel finishes, the GPU sits idle while the CPU launches the next one. Even with CUDA Graphs, a dummy kernel that should take 2.1μs still takes 1.3μs due to launch overhead.
- Wave Quantization When kernel work can’t be evenly distributed across streaming multiprocessors (SMs), some SMs finish early and idle while others complete the remaining work.
- Inter-Instruction Memory Bubbles Every kernel must load its weights before computing. Even with perfect load-compute overlap during execution, the initial weight loading creates dead time where tensor cores sit idle.
What’s the solution?
Megakernels. Instead of launching separate kernels for each operation, fuse the entire model into a single kernel. Pioneered by Hazy Research, but Luminal is the first to make this automatic and compiler-driven.
Three advantages:
- Zero kernel launch overhead - One launch for entire forward pass
- Eliminate wave quantization - SMs that finish early immediately start the next op
- Continuous compute - Load weights for next op during epilogue of current op
This is done with fine grained dependencies:
- Each op increments its barrier at launch
- Executes the computation
- Decrements barrier when complete
- Consumers wait for barrier == 0 (no inflight producers)
Part 2: The reality check - KernelBench v3
Stanford’s original KernelBench claimed LLMs could write optimized CUDA kernels. It did exploit torch/cuBLAS instead of writing kernels, extract pre-computed reference results and all nasty solutions and when dropped, the average speedup went down from 3.13× to 1.49×.
KernelBench v3 from Elliot Arledge rebuilt it with clear design principles:
- Modern architectures only - H100 (Hopper) & B200 (Blackwell)
- 41 high-quality problems instead of 250 questionable ones
- Multi-seed correctness - 5 random seeds to catch caching exploits
- Real baselines - Level 4 uses production Triton kernels, not naive PyTorch
Together with graph based compilation to build megakernels and KernelBench v3 exploiting Modern architecture coverage and algorithmic changes, don’t you think there is a path toward AI-assisted production kernel engineering?
Maybe expose Luminal’s IR to LLMs, use KernelBench v3 to build progressive reward curriculum. A wild future ahead of us on kernels!
Resources:
Luminal:
- Blog post: Compiling Models to Megakernels - by Luminal and Joe Fioti
- GitHub: GitHub - luminal-ai/luminal: Deep learning at the speed of light.
KernelBench v3:
- Blog post: Elliot Arledge
- GitHub: GitHub - Infatoshi/KernelBench-v3
- Interactive Results: Elliot Arledge