Google Launches Multi-Token Prediction Drafters to Triple Gemma 4 Inference Speeds

Story Summary
Google has introduced Multi-Token Prediction (MTP) drafters for its Gemma 4 model family, enabling up to a 3x increase in inference speed without compromising output quality or reasoning capabilities. By utilizing speculative decoding, the architecture decouples token generation from verification, allowing a lightweight drafter to predict multiple tokens while the primary model performs parallel verification. This approach addresses the memory-bandwidth bottlenecks inherent in standard autoregressive LLM inference, which typically underutilizes compute resources. For developers, this advancement significantly reduces latency for on-device applications, coding assistants, and agentic workflows, allowing high-parameter models to run efficiently on consumer-grade hardware. The MTP drafters are compatible with frameworks including LiteRT-LM, MLX, Hugging Face Transformers, and vLLM.





