← all posts

Which local models stay fast at long context

Agentic coding lives at long context, the agent reads a file, runs a tool, reads the output and tries again until the window is stuffed, and most models that feel quick on a short prompt have slowed to a crawl by the time they are 150K tokens deep, so this whole thing started as a hunt for a local model that stays fast that far out on a pair of RTX PRO 6000s. It started as a Reddit post.

24 June 2026· updated 29 June 2026 · origin story

The target was a "local Sonnet", something good enough to sit behind the agent all day without a per-token meter running, and for that kind of work the number that actually matters is not tok/s on a fresh prompt, it is decode speed once the context is already large, because that is where the loop spends its time, tool output after tool output piling into the window while you wait on every turn.

It comes down to attention

Whether a model holds up out there comes down to how it handles attention. MiMo 2.5 stays fast on these cards because it uses the same 5-to-1 local/global sliding-window attention that Gemma 3 does, most layers only look at recent tokens while a few still read the full context, so it keeps its speed as the window grows without losing the plot, whereas the dense-attention models re-read everything on every single token and pay for it more and more as the context climbs.

20406080100120decode speed (tok/s)050K100K150K180Kcontext length (tokens)40 tok/s usable floorMiMo 2.5Step 3.7 FlashMiniMax M2.7MiniMax M3
Decode speed measured on the two-card box as the context fills up. The two sliding-window models, MiMo 2.5 and Step 3.7 Flash, hold at or above the ~40 tok/s usable floor all the way out to 180k, while the dense MiniMax M2.7 and M3 sink into the teens. DeepSeek V4, once it ran, sits flat and far above every line here, which is the next post. Full curves in the data.

MiMo 2.5 is still north of 60 tok/s out past 170K context, comfortably usable, while MiniMax M2.7 and M3, both dense, drop under 20, which for something you are sitting in front of waiting on is the go-make-a-coffee, then lunch, water the plants, watch the grass grow range.

The best models are still stuck behind software

The genuinely annoying part is that the models everyone actually wants, DeepSeek V4 and MiniMax M3, lean on custom GPU kernels that nobody has written for consumer Blackwell yet, they are built for the datacenter cards (SM100, the B200 class) and not the SM120 in an RTX PRO 6000 or a 5090, so MiniMax M3 quietly falls back to dense attention and slows to a crawl, and DeepSeek V4's ops drop to CPU and grind down to around 14 tok/s, which is a non-starter for interactive work.

Plenty of dead ends went into confirming that, SGLang and vLLM with NVFP4 quants run a little faster at baseline but the attention still tanks the same way once the context is long, and NVFP4 on SM120 is buggy enough right now that it mostly spat out garbage on long generations anyway.

So for a while the practical answer was just the models built on the "older" attention approach. MiMo 2.5 and Step 3.7 Flash, which uses the same trick with a 3-to-1 hybrid instead of 5-to-1, hold around 40 tok/s even out at 178K context, which was enough to keep the agent moving.

Around Sonnet, at first

Quality was the surprising bit. In a small private coding benchmark Opus nailed the task including a nasty edge case, Sonnet got the core of it right, and the sliding-window locals that ran fast enough to bother with landed around Sonnet's level, so "local Sonnet" was genuinely on the table, and MiMo solved it in about 4 minutes, the same ballpark as Opus and Sonnet, where dense MiniMax M3 took closer to 40. That first read was rosier than what held up later once the same models got hammered across the full task set, but it was enough to keep going.

Then DeepSeek actually ran

That is roughly where this was supposed to end, sliding-window models as the answer and the shiny dense ones written off as stuck behind missing software, except sharing the numbers on r/LocalLLaMA turned up a tip that changed the whole picture: DeepSeek V4 Flash does run on these cards after all, not through llama.cpp where it fell back to CPU, but on a community vLLM build carrying the right kernels, and once it was up it held north of 170 tok/s and dead flat, from nothing out past 250K tokens of context, and finished the same benchmark clean in 2 minutes 25 seconds at Sonnet quality, faster than Claude Code manages it.

So the sliding-window models got the whole thing rolling and they are still the easy answer if you just want something that works without fighting the software, but the moment DeepSeek V4 ran properly the question stopped being "what stays fast enough" and turned into "how does a local model actually stack up against the API", which is where the next post picks up.

Update: what the full benchmark showed

The decode chart up top made MiMo 2.5 look like the answer, and on raw speed at long context it is, but decode tok/s turned out to be half the story at most. Weeks later, once these same models had been through a proper benchmark of real coding tasks, every attempt graded against the fix that actually shipped and timed from start to finish, a messier picture fell out.

MissBasicSolidCompleteDeepSeek V4mean 2.04MiniMax M3mean 1.92MiniMax M2.7mean 1.58Step 3.7mean 1.36MiMo 2.5mean 1.30
How each model's attempts land across the four grades; the fill breaks wherever a grade got no attempts. MiMo 2.5 splits into a lump at Miss and a lump at Complete with a clear gap between, a bullseye-or-flub model that skips Basic and Solid almost entirely. MiniMax M2.7 sits as one lump on Solid and rarely reaches Complete. MiniMax M3 lands the most Complete of any local, and DeepSeek V4, once it ran, is both strong and consistent.

The speed winner was not the quality winner. MiMo, the sliding-window model that stays quick deep into a long context, is a coin flip on quality, it either nails a task outright or flubs it entirely with almost nothing in between. MiniMax M2.7 has the opposite temperament, mostly landing Solid but rarely getting all the way to Complete, dependable without being brilliant. The slow dense M3 quietly turned in the best quality of any local here, which would make it the obvious pick, if it were not for the other axis.

1m2m5m10m15m20m30m40m60m80mDeepSeek V4MiniMax M2.7MiMo 2.5Step 3.7MiniMax M3
Each model's spread of wall-clock time to finish a task, in real minutes, each task weighted equally so uneven run counts don't skew it; dashed line is the median, fastest on top. DeepSeek V4 clusters around 2 minutes a task, MiniMax M3, the quality leader, closer to 20.

On this set of tasks the tradeoff was lopsided: the best local model on quality took 20 minutes a task, the fastest-decoding one was a coin flip, and the only one that was both quick and good was DeepSeek V4.

What it runs on

weightsKV cache reserved (trim with context)24486496128192DeepSeek V4186.7 GBMiniMax M2.7185.1 GBMiniMax M3177.9 GBStep 3.7 Flash161.1 GBMiMo 2.5144.2 GB
VRAM while serving, split into fixed weights (solid) and reserved KV cache (faint). The dense models (MiniMax M3, M2.7) and the bigger sliding-window ones (Step 3.7 Flash, MiMo 2.5) sit between 144 and 186 GB total, all fitting on two cards. Reference lines mark common total capacities so you can map it onto your own cards.

Reproducing any of this comes down to a compose file and a serve config per model, with the odd custom dockerfile thrown in, and that all lives in how to run it, while the grading, the per-task weighting and the thin spots in the coverage are written up in the about section.