Skip to content
AI Article

Google Releases DiffusionGemma: Shifting the Local Inference Paradigm with 4x Faster Text Generation

By swapping sequential autoregressive decoding for parallel text diffusion, Google's experimental 26B MoE model delivers over 1,000 tokens per second on local hardware.

Mariana Souza
Mariana Souza
Senior Editor · Jun 10, 2026 · 4 min read

For years, the developer community has accepted a fundamental limitation of Large Language Models (LLMs): they generate text like a typewriter. This sequential, token-by-token processing is highly efficient in high-concurrency cloud environments where thousands of requests can be batched together. However, when running models locally for a single developer, this approach leaves powerful GPUs underutilized, spending most of their cycles waiting on memory bandwidth for the next "keystroke."

Google has introduced an experimental model designed to break this bottleneck: DiffusionGemma. Released under a permissive Apache 2.0 license, this 26B Mixture of Experts (MoE) model abandons sequential autoregressive decoding in favor of text diffusion. By generating entire blocks of text simultaneously, DiffusionGemma achieves up to 4x faster text generation on dedicated GPUs, opening up new possibilities for highly interactive, local developer workflows.

Breaking the Autoregressive Bottleneck

Traditional autoregressive models generate text from left to right, predicting one token at a time. DiffusionGemma reverses this paradigm. Instead of a typewriter, think of it as a printing press that stamps an entire 256-token block of text onto a canvas simultaneously.

This shift fundamentally changes how the model interacts with hardware. In local, single-user scenarios, autoregressive models are heavily memory-bandwidth bound. DiffusionGemma shifts the bottleneck from memory bandwidth to compute by giving the GPU's processor a massive chunk of parallel work to perform at once.

The performance gains are stark:

  • NVIDIA H100: Delivers over 1,000 tokens per second.
  • NVIDIA GeForce RTX 5090: Delivers over 700 tokens per second.

Because the throughput advantage is strongest at low-to-medium batch sizes on a single accelerator, DiffusionGemma is uniquely optimized for local execution. In high-query-per-second (QPS) cloud environments, where batching already saturates compute, parallel decoding offers diminishing returns and can actually increase serving costs.

Under the Hood: MoE and Iterative Refinement

DiffusionGemma is built upon the intelligence-per-parameter foundation of Google's Gemma 4 family and cutting-edge Gemini Diffusion research, integrating a novel diffusion head designed to maximize generation speed.

To keep the hardware footprint accessible for local workstations, the model is structured as a 26B total parameter Mixture of Experts (MoE). Crucially, it only activates 3.8B parameters during inference. When quantized, the model fits comfortably within the 18GB VRAM limits of high-end dedicated consumer GPUs.

Rather than predicting words in order, the text diffusion process works similarly to how diffusion-based image generators refine visual static into a clear picture:

  1. The Canvas: The model begins with a canvas of random placeholder tokens.
  2. Iterative Refinement: It makes multiple parallel passes, locking in correct tokens and using them as context clues to refine the remaining placeholders.
  3. Final Polish: The text rapidly converges into a coherent, high-quality output.

The Superpower of Bi-Directional Attention

One of the most compelling aspects of DiffusionGemma is its use of bi-directional attention. Because the model generates 256 tokens in parallel with each forward pass, every token in that block can attend to all other tokens—both preceding and succeeding.

This is a massive departure from the causal (left-to-right) attention used in standard LLMs, and it unlocks unique advantages for non-linear domains:

  • In-line Editing and Code Infilling: The model can naturally evaluate context before and after an insertion point to generate seamless code or text.
  • Structured Data and Complex Syntax: It can easily handle non-linear structures like mathematical graphs, amino acid sequences, or perfectly closing complex markdown formatting in near real-time.
  • Intelligent Self-Correction: DiffusionGemma can evaluate an entire text block at once, allowing it to identify and fix mistakes iteratively during the generation process.

To demonstrate this capability, Unsloth fine-tuned DiffusionGemma to play Sudoku. This is a task that traditional autoregressive models notoriously struggle with because solving a puzzle requires a token's value to depend on future, unwritten tokens. DiffusionGemma's bi-directional attention makes solving these non-linear dependencies significantly easier.

Trade-offs and the Local Dev Ecosystem

As with any experimental architecture, DiffusionGemma comes with important trade-offs. Because it prioritizes parallel layout generation and raw speed, its overall output quality is lower than standard, autoregressive Gemma 4 models. For production applications demanding maximum output quality, Google still recommends deploying standard Gemma 4. However, developers can fine-tune DiffusionGemma to dramatically improve its performance on highly specific, speed-critical tasks.

For developers eager to experiment, the model weights are available now on Hugging Face. The ecosystem has moved quickly to support the release, with efficient serving integrations available across several popular developer tools:

  • vLLM: Supported with integration assistance from Red Hat.
  • MLX: For optimized execution on Apple Silicon.
  • Hugging Face Transformers: For standard Python-based pipelines.

For those looking to customize the model, Google has also released a fine-tuning tutorial using Hackable Diffusion, a modular JAX-based toolbox designed for composability and rapid experimentation.

While Google first demoed text diffusion concepts at its developer conference about a year ago, the release of DiffusionGemma marks a concrete, open-source milestone. It provides developers with a highly practical sandbox to explore the next generation of ultra-low-latency, local AI applications.

Sources & further reading

  1. DiffusionGemma: 4x Faster Text Generation — blog.google
  2. DiffusionGemma: 4x faster text generation — deepmind.google
  3. Google’s DiffusionGemma is 4x faster than its other Gemma models — thenewstack.io
Mariana Souza
Written by
Mariana Souza · Senior Editor

Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.

Discussion 5

Join the discussion

Sign in or create an account to comment and vote.

Gabe Morales @gpu_poor_gabe · 2 days ago

i'm stuck running models on a gpu so old it's basically a potato, but it's cool to see google pushing the boundaries with diffusiongemma - 4x faster text gen is a game changer, even if it's still out of reach for us gpu poor folks

Ken Abe @perf_obsessed_ken · 3 days ago

so what does this do to p99 latency? can we finally get under 10ms for local text gen? the 4x speedup is nice but i'm more interested in how it affects the tail of the distribution 🚀

Leo Fontaine @ai_optimist_leo · 3 days ago

@perf_obsessed_ken that's the million dollar question - i'd love to see some actual p99 numbers but if diffusion gemma can really parallelize text gen like that, it's not hard to imagine we could finally crack that 10ms barrier, the implications for interactive apps would be huge

Nina Petrova @night_owl_nina · 2 days ago

@perf_obsessed_ken that's the million dollar question, right? i mean, 4x faster is awesome but if the p99 latency is still hovering around 50ms then it's not like we're changing the game or anything, diffusiongemma needs to bring that tail in for it to be a real breakthrough 🚀

Bob Feldman @benchmark_bob · 2 days ago

@night_owl_nina exactly, and what was the baseline hardware they used for these benchmarks? was it a high end gpu or something more modest, that's gonna make a huge difference in whether this 4x speedup is actually achievable in real world usage

Related Reading