BETA — Сайт у режимі бета-тестування. Можливі помилки та зміни.
UK | EN |
LIVE
Технології 🇺🇸 США

Google Found a Way to Make Local AI Up to 3x Faster—No New Hardware Required

Decrypt Jose Antonio Lanz 0 переглядів 5 хв читання
Google Found a Way to Make Local AI Up to 3x Faster—No New Hardware Required

In brief

  • Google released Multi-Token Prediction (MTP) drafters for Gemma 4, delivering up to a 3x speedup at inference without any degradation in output quality.
  • The technique—called speculative decoding—uses a lightweight "drafter" model to predict several tokens at once, which the main model then verifies in parallel, bypassing the one-token-at-a-time bottleneck.
  • MTP drafters are available on Hugging Face, Kaggle, and Ollama under the same Apache 2.0 license as Gemma 4, and work with tools like vLLM, MLX, and SGLang.

Running an AI model on your own computer is great—until it isn't.

The promise is privacy, no subscription fees, and no data leaving your machine. The reality, for most people, is watching a cursor blink for five seconds between sentences.

That bottleneck has a name: inference speed. And it has nothing to do with how smart the model is. It's a hardware problem. Standard AI models generate text one word fragment—called a token—at a time. The hardware has to shuttle billions of parameters from memory to its compute units just to produce each single token. It's slow by design. On consumer hardware, it's painful.

The workaround most people reach for is running smaller, weaker models—or heavily compressed versions, called quantized models, that sacrifice some quality for speed. Neither solution is great. You get something that runs, but it's not the model you actually wanted.

Now Google has a different idea. The company just released Multi-Token Prediction (MTP) drafters for its Gemma 4 family of open models—a technique that can deliver up to a 3x speedup without touching the model's quality or reasoning ability at all.

The approach is called speculative decoding, and it's been around as a concept for years. Google researchers published the foundational paper back in 2022. The idea didn't go mainstream until now because it required the right architecture to make it work at scale.

Here's the short version of how it works. Instead of making the big, powerful model do all the work alone, you pair it with a tiny "drafter" model. The drafter is fast and cheap—it predicts several tokens at once in less time than the main model would take to produce just one. Then the big model checks all of those guesses in a single pass. If the guesses are right, then you get the whole sequence for the price of one forward pass.

According to Google, "if the target model agrees with the draft, it accepts the entire sequence in a single forward pass—and even generates an additional token of its own in the process."

Nothing is sacrificed: The large model—Gemma 4's 31B dense version, for example—still verifies every token, and the output quality is identical. You're just exploiting idle compute power that was sitting unused during the slow parts.

Google says the drafter models share the target model's KV cache—a memory structure that stores already-processed context—so they don't waste time recalculating things the larger model already knows. For the smaller edge models designed for phones and Raspberry Pi devices, the team even built an efficient clustering technique to further cut generation time.

This isn't the only attempt the AI world has made at parallelizing text generation. Diffusion-based language models—like Mercury from Inception Labs—tried a completely different approach: Instead of predicting one token at a time, they start with noise and iteratively refine the entire output. That’s fast on paper, but diffusion LLMs have struggled to match the quality of traditional transformer models, leaving them more of a research curiosity than a practical tool.

Speculative decoding is different because it doesn't change the underlying model at all. It's a serving optimization, not an architecture replacement. The same Gemma 4 you'd already run gets faster.

The practical upside is real. A Gemma 4 26B model running on an Nvidia RTX Pro 6000 desktop GPU gets roughly twice the tokens per second with the MTP drafter enabled, according to Google's own benchmarks. On Apple Silicon, batch sizes of 4 to 8 requests unlock around 2.2x speedups. Not quite the 3x ceiling in every scenario, but still a meaningful difference between "barely usable" and "actually fast enough to work with."

The context matters here. When Chinese model DeepSeek shocked the market in January 2025—wiping $600 billion from Nvidia's market cap in a single day—the core lesson was that efficiency gains can hit harder than raw compute. Running smarter beats throwing more hardware at the problem. Google's MTP drafter is another move in that direction, except aimed squarely at the consumer end of the market.

The whole AI industry is right now a triangle that considers inference, training, and memory. Each breakthrough in either area tends to boost or shock the entire ecosystem. DeepSeek’s training approach (achieving powerful models with lower end hardware) was one example, while Google’s TurboQuant (shrinking AI memory without losing quality) paper was another. Both crashed the markets as companies tried to figure out what to do.

Google says the drafter unlocks "improved responsiveness: drastically reduce latency for near real-time chat, immersive voice applications and agentic workflows"—the kind of tasks that demand low latency to feel useful at all.

Use cases snap into focus quickly: A local coding assistant that doesn't lag; a voice interface that responds before you've forgotten what you asked; an agentic workflow that doesn't make you wait three seconds between steps. All of this, on hardware you already own.

The MTP drafters are available now on Hugging Face, Kaggle, and Ollama, under the Apache 2.0 license. They work with vLLM, MLX, SGLang, and Hugging Face Transformers out of the box.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.
Поділитися

Схожі новини