WorkProjectsBlogAboutContact

Every Language Model Writes Like a Typewriter. This One Doesn't.

May 25, 2026·
machine-learningnlpdiffusion-modelsllmtransformers

Every language model you've used writes the way a typewriter works. One token at a time. Left to right. Commit to each word before you know where the sentence ends.

That's not a flaw. It's a fundamental property of autoregressive (AR) generation, the architecture powering GPT, Claude, Llama, and every other major LLM today. Eight tokens means eight sequential steps through the entire model, each step a full computation to produce one word, and you can't start the next until the current one is done. The cost accumulates linearly, and you can't process multiple tokens at the same time because each new word depends on everything that came before it.

NVIDIA's Nemotron-Labs-Diffusion takes a different approach: a single model that can run as a traditional LLM or as a diffusion model, switching modes at inference time by flipping one internal setting. The novel claim is that you don't have to trade accuracy for parallelism, and that for certain tasks, sequential generation is architecturally the wrong approach, not just a slow one. To test that claim concretely, I ran it against 8 NYT Connections puzzles, a task that makes left-to-right generation's weaknesses impossible to ignore.


The Cost of Going Left to Right

Let's make the AR bottleneck concrete before talking about what replaces it.

Each word requires a full run through the entire model. The cursor can't move to the next position until the current one is committed. Nothing runs in parallel. At low concurrency, you're just waiting, word after word, in sequence.

This works fine for many tasks. Left-to-right is a natural order for writing. But there's a whole class of problems where sequential commitment to each token is conceptually the wrong approach. More on that shortly.


What Diffusion Does Instead

Before getting to the model, it helps to understand what "autoregressive" actually means as a word. Breaking it down: auto means self-referencing, and regressive refers to predicting from prior values. Each new word is predicted from all the words before it. Predict word 1, feed it back, predict word 2, repeat.

Diffusion language models abandon this loop entirely.

Instead of building left to right, a diffusion model starts from noise across all positions at once and iteratively denoises toward the correct output. Think about how you actually write something difficult. You don't start at word one and commit through word fifty in order. You have a rough shape, a fuzzy sense of the whole thing, then you sharpen it in passes.

That's diffusion. Three passes, all tokens in parallel. The output crystallizes simultaneously rather than accumulating sequentially.

The natural assumption: more parallel must mean less precise. If you're not committing word by word, you must be making lower-quality guesses at each position.

That turns out to be the wrong frame. The right question is whether you can train a model that does both, and whether the right architecture depends on the task, not just the speed target.


One Model, Three Modes

What NVIDIA did with Nemotron-Labs-Diffusion is train a single model jointly on both AR and diffusion objectives. Same weights. Same parameters. The mode is selected at inference time, not by loading a different model, but by changing the attention mask (a setting inside the model that controls which words it's allowed to look at when predicting each new word).

The three modes, concretely:

Autoregressive (AR) uses a restricted attention mask: each word can only look backward at the words before it. Classical left-to-right generation.

Diffusion uses an open attention mask: all words can look at all other words simultaneously. The model denoises the entire sequence in parallel passes.

Self-Speculation is where it gets interesting. Diffusion runs first: all tokens drafted simultaneously in one pass. Then the model flips its own attention mask to the restricted (backward-only) mode and uses AR to verify the draft left to right. Accepted tokens are kept. Rejected tokens get regenerated. You get parallel draft speed with sequential verification quality, and because it's the same model doing both, there's no overhead from a separate draft network, no extra parameters.

from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "nvidia/Nemotron-Labs-Diffusion-8B" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, ).cuda() prompt = "Explain diffusion language models in one paragraph." input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda() # Mode 1: Autoregressive: left to right, one token at a time out_ids, nfe = model.ar_generate(input_ids, max_new_tokens=128) # Mode 2: Diffusion: all tokens in parallel, iteratively denoised out_ids, nfe = model.generate(input_ids, steps=10) # Mode 3: Self-Speculation: diffusion drafts, AR verifies out_ids, nfe = model.linear_spec_generate(input_ids, max_new_tokens=128) print(f"Model ran {nfe} times to produce {out_ids.shape[1]} tokens")

All three methods return the same shape output; the difference is how many times the model had to run (nfe) to get there. Mode 3 consistently produces the lowest nfe while matching Mode 1's output quality.


The Numbers

From NVIDIA's technical report [1]:

  • more words generated per model run vs Qwen3-8B (Alibaba's comparable open-source model), at matched accuracy
  • real throughput on GB200 with SGLang (NVIDIA's latest data center chip, paired with an optimized serving framework)
  • 2.7× on DGX Spark (NVIDIA's desktop AI workstation)
  • 76% of theoretical throughput ceiling still untapped

That last number is the one I keep coming back to. This isn't a mature, optimized system. It's a research release at roughly a quarter of where it could go.


Why I Tested It on NYT Connections

Benchmarks tell you a model is faster. I wanted to show when a different architecture is the correct mental model for the task itself.

NYT Connections gives you 16 words and asks you to find 4 hidden groups of 4. The categories are often tricky: homophones, anagrams, words that contain hidden numbers, shared pop culture references. The key property of the task is that you cannot solve it left to right. Every word constrains every other word. The correct grouping only emerges from holistic reasoning across the entire grid.

An AR model solving Connections sequentially is architecturally mismatched to the problem. It commits to early categorizations before reasoning about the full set. A diffusion model, which sees all 16 words simultaneously, does something closer to how you'd actually approach this puzzle yourself.

I ran 8 NYT Connections puzzles, same model on both sides, and built a side-by-side visualization: AR filling tiles one at a time on the left, Self-Speculation snapping all 16 at once on the right.


Results: 8 Puzzles

PuzzleAR TimeSelf-Spec TimeAR CorrectSelf-Spec Correct
___ MUSTARD / KINDS OF PIES / BUTTS / TENNIS6.3s1.25s16/1616/16
FRUIT ANAGRAMS / HOMOPHONES / RUPTURE / MLB ___S5.9s1.08s16/1615/16
"SCHOOL" MODIFIERS / CONDUIT / SWINDLE / TEA VERBS6.1s1.08s16/1616/16
BIT OF ADVICE / FORTITUDE / SPEND TIME AT / WORDS ENDING IN NUMBERS6.7s1.09s16/1616/16
CHANGE STATES / REPLACEMENT / SLANGY PROFESSIONS / MARIAH CAREY6.3s0.92s16/1616/16
BIOLOGICAL BUILDING BLOCKS / BABY PURCHASES / INSTRUMENTS / ___ TAG6.5s0.76s16/1616/16
YEARN / MAGAZINES / BOND CHARACTERS / ___POP GENRES6.1s1.08s16/1616/16
ROMANTIC RAPPORT / WEB BROWSER / SINGLE ROTATION / ___ BELL6.4s1.09s16/1616/16
Average6.4s1.05s16/1615.9/16

Seven out of eight perfect. One miss on puzzle 2: the word PÈRE, which has an accent mark, landed on a slightly different category assignment in speculative mode. This is a known property of speculative decoding. When you verify and correct a parallel draft, borderline tokens can resolve differently than pure sequential generation would. One word out of 128, across a task that requires holistic reasoning across the full grid.

The speed gap held consistently: Self-Speculation averaged 1.05 seconds. AR averaged 6.4 seconds. That's a 6× gap across every puzzle type, not a cherry-picked result.

By puzzle 6, Self-Speculation finished in 0.76 seconds. AR was still working for another 5.7 seconds.


Conclusion

The throughput numbers are real, but speed isn't the main story. The more interesting claim is architectural: some tasks are the wrong shape for left-to-right generation. Connections is a clean example, but the same logic applies to any problem where every answer constrains every other: grouping, structured classification, anything that requires seeing the whole before committing to any part.

Forcing those problems through a sequential pipeline isn't just slow. It's a mismatch between the structure of the problem and the structure of the computation. Diffusion, starting from the full output and sharpening in passes, is a better fit.

What Nemotron-Labs-Diffusion demonstrates is that you don't have to pick a side. One model, the same weights, and a mask flip at inference time gives you both. The 6× efficiency gain on parallel hardware and the 76% headroom still on the table suggest this is early innings.

The model is available in three sizes on Hugging Face: 3B (runs on a free Colab T4), 8B (A100 40GB), and 14B (80GB+). The Connections experiment is fully reproducible; the notebook is available here.


References

[1] NVIDIA, Nemotron-Labs-Diffusion Technical Report (2025), huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B