Google uses speculative decoding to speed up Gemma 4 by 3x

Google has released "drafter models" that use speculative decoding to roughly triple the inference throughput in its open weights Gemma 4 models.

The company’s DeepMind AI lab has shipped what it calls “multi-token prediction drafters” which increase the inference efficiency of its latest open source model Gemma 4.

Google engineer Maarten Grootendorst and product management director Olivier Lacombe said developers can expect "up to a 3x speedup without any degradation in output quality or reasoning logic."

With the cost of inference becoming a dominant conversation when it comes to analysing AI workloads, open source models are a cheaper alternative. Inference speed and efficiency while maintaining accuracy is a key differentiator for open models.

Get the full story: Subscribe for free

Join peers managing over $100 billion in annual IT spend and subscribe to unlock full access to The Stack’s analysis and events.

Subscribe now

Already a member? Sign in