← all posts

DeepSeek V4 Flash finishes coding tasks faster than Sonnet and Opus

The last post ended with DeepSeek V4 finally running fast on the two-card box, which left the obvious thing to actually check, how a local model holds up against the hosted API once both are pointed at the same real work, so the same coding tasks went through DeepSeek, Sonnet and Opus, all graded the same way and timed from start to finish, and the short version is that the local box kept finishing first.

29 June 2026 · local vs API

The full story of getting there is its own post, a speed test for a local model that turned, one question at a time, into a whole benchmark. What matters here is the comparison itself: set each model up the way you would actually run it, hand them all the same bugs, and the local DeepSeek V4 not only lands in Sonnet territory on quality but finishes the work faster in wall-clock than Sonnet or Opus over the API. There are caveats, and they are left in rather than sanded off.

Is it even good enough?

Speed is worth nothing if the diffs are wrong, so quality came first and the timing waited, every model getting the same 8 real tasks with each attempt graded 0-3, Miss then Basic then Solid then Complete, against the fix that actually shipped.

MissBasicSolidCompleteFable 5mean 2.53Opus 4.8mean 2.51Sonnet 5mean 2.22DS V4 Maxmean 2.21DS V4 Highmean 2.04Sonnet 4.5mean 1.90Qwen 27Bmean 1.84Qwen 35Bmean 1.46
How each model's attempts land across the four grades. Opus leads on Complete; the local DeepSeek V4 sits in Sonnet territory, above both local Qwens. Full detail in the data.

Opus and Fable win it, which was expected. What was not is how close DeepSeek V4 came to the middle of the pack, above Sonnet 4.5, a hair under Sonnet 5, and well clear of both local Qwens. A local model doing frontier-adjacent work stopped being the thing to prove and became the thing to take for granted.

Worth noting that the difference between Solid and Complete is really the line between "this type-checks and technically works" and "this is a diff I would actually merge". A few attempts got moved down a tier for being correct but sloppy. Reasonable people would draw that line elsewhere. The 8 tasks, and what each grade means for them, are laid out in the tasks list.

Then I timed the whole task

The first post was focused on decode speed, tok/s and long context, but that doesn't predict the thing you actually sit around waiting for, which is how long the task takes to finish. So this time it is the full wall-clock that counts, the whole run across dozens of turns, tool round-trips and all, plus a network hop on every call for the hosted models.

30s1m2m5m10m20mDS V4 HighSonnet 4.5DS V4 MaxQwen 35BOpus 4.8Fable 5Qwen 27BSonnet 5
Each model's spread of wall-clock time to finish a task, in real minutes (log scale, each task weighted equally so uneven run counts don't skew it); dashed line is the median, fastest on top. DeepSeek is the quickest of all; Sonnet 5 the slowest.

It paints a different picture: once the agent starts bouncing between tool calls the local model just keeps going, while the hosted ones spend real time waiting on the network every turn. The result that took a re-run to believe was Sonnet 5 at the bottom: a strong model, just genuinely slow here, it generates a lot and takes a lot of turns to land, but very consistent results.

Quality against speed

Put the two on one plot and the shape of the tradeoff falls out. Two things jump out.

1.41.61.82.02.22.42.6150s200s250s300s400s500squality, better →task time, faster ↑fast & capableslow & weakDS V4 HighDS V4 MaxOpus 4.8Fable 5Sonnet 5Sonnet 4.5Qwen 27BQwen 35B
Quality (right = better) against speed (up = faster). DeepSeek V4 sits high-right, fast and capable; Opus is far-right but slow; Sonnet 5 is strong but slowest.

First, DeepSeek V4 sits up in the fast-and-capable corner, next to models it has no business keeping up with. Second, Opus and Fable are out on their own to the right, the best diffs by a clear margin, just slower than the local box to get there. For the single best answer that is where to go. For the day-to-day, where you mostly want a capable agent that turns things around quickly and doesn't meter you, the local model is the one to reach for.

Where this is unfair, but realistic

This is the caveat that cost the most sleep, so, bluntly: it is not a controlled, model-only comparison. The local models run under the opencode agent on vLLM; the hosted models run under Claude Code. Different scaffolds. So part of every gap is the model and part is the harness around it, and the two can't be fully separated. That was left in on purpose, the question was never which raw model wins in a vacuum, it was what you actually get when each is set up the way people really run it.

What it runs on

weightsKV cache reserved (trim with context)24486496128192DeepSeek V4186.7 GBQwen 27B87 GBQwen 35B86.2 GB
VRAM while serving, split into fixed weights (solid) and reserved KV cache (faint). DeepSeek V4 is nearly all weights; the Qwens are mostly trim-able KV headroom sitting on top of tiny weights. Reference lines mark common total capacities so you can map it onto your own cards.

The exact setup for each model, the compose files, serve configs and the odd custom dockerfile, is in how to run it; the full method, the grading and the coverage, is in the about section.