The scorecard

Every model graded on the same 8 coding tasks across 445 runs, each attempt scored 0-3 against the fix that actually shipped. All of it generated straight from the benchmark data.

Scorecard

MissBasicSolidComplete

#	Model	Mean	Complete%	n
1	Fable 5 API ref	2.53	65%	17
2	Opus 4.8 API ref	2.51	74%	38
3	Sonnet 5 API ref	2.22	56%	18
4	DeepSeek V4 Flash (Think-Max) vLLM b12x	2.21	54%	24
5	DeepSeek V4 Flash (Think-High) vLLM b12x	2.04	49%	41
6	MiniMax M3 IQ3_S llama.cpp	1.92	46%	13
7	Sonnet 4.5 API ref	1.90	32%	31
8	Qwen3.6-27B INT4 vLLM	1.84	42%	38
9	MiniMax M2.7 (Q5) llama.cpp	1.58	8%	12
10	Qwen3.6-35B-A3B AWQ vLLM	1.46	21%	39
11	Step 3.7 Flash llama.cpp	1.36	29%	14
12	Huihui Qwen3.6-35B abliterated (Q8) llama.cpp	1.33	22%	18
13	MiMo-V2.5 llama.cpp	1.30	40%	40

The shape behind each score: how every model's attempts spread across the four grades, with gaps where a grade never happened.

Time per task

Spread of wall-clock time to finish a task, in real minutes, with each task weighted equally so uneven run counts don't skew it; dashed line is the median, fastest on top.

Speed & taper

Decode speed as context fills, on the two-card box. Sliding-window and MoE models stay flat; dense models taper hard.

Qwen 27BQwen 35BDS V4 HighMiMo 2.5M3HuiHui 35BM2.7Step 3.7

Working style

How much context each model ingests versus how many tokens it generates to solve a task. Up-and-right = reads and writes a lot.

VRAM

What each model needs while serving, split into fixed weights (solid) and reserved KV cache (faint) that scales with how much context you keep live. Reference lines mark common total capacities, so it maps onto whatever cards you have rather than any one setup.

The tasks

Each task is a real bug from a production TypeScript/GraphQL codebase, reverted to just before its fix and generalized to an archetype.

The aim was tasks that separate models rather than flatter them, so each one had to clear a few bars:

Small diff, big search: the actual fix is only a handful of lines, but finding it takes real digging through the codebase, so it rewards understanding over typing.
More than one way to do it, not all equal: several fixes will pass, but only some are clean and complete, which is exactly what separates Solid from Complete.
An objective fix that actually shipped: each is a real bug with a known-good merged fix, so attempts get graded against ground truth, not taste.
Contamination-controlled: a clean just-before-fix checkout with history stripped, so nothing about the solution leaks into the prompt.

The set spans difficulty on purpose, from near-floor tasks almost everyone completes to discriminators only the strongest models get right.

#1 Idempotent-update guard Discriminator

A save that changes nothing still fires an expensive downstream state transition; separately, a class of legacy records silently skips an audit step. Two orthogonal defects, graded independently - and a blanket guard that also suppresses legitimate updates doesn't count as a fix.

#2 Change-detection before a side effect Discriminator

Re-saving a parent record re-runs a costly re-processing side effect on a child field that didn't actually change. The guard has to compare on normalized values (whitespace / line-ending-only differences shouldn't count as a change) and run before the side effect fires.

#3 Search tokenization Discriminator

Terms with internal punctuation (initials, separators) return nothing, because the query and the index tokenize punctuation differently. The real fix aligns the two at the tokenizer; patching the single failing query string is brittle.

#4 Query scoping filter Floor (excluded from headline)

A status filter is scoped too broadly and surfaces records that should be excluded - re-scope the underlying query condition. Straightforward; every model solves it, so it's kept only as a floor (tiers barely separate).

#5 Over-strict invariant Near-floor

A safety assertion is too strict and fires on a legitimate edge case, blocking an operation that should succeed. The fix has to narrow the assertion to what it's actually meant to guard - without deleting the safety check.

#6 Cross-transaction atomicity Hardest discriminator

Two related writes run in separate transactions; if the second fails (timeout, contention) they're left permanently out of sync. It compiles and passes the happy path - the desync only manifests under failure, so you have to reason about the failure path, not the common case.

#7 Boundary / off-by-one Dead floor (excluded)

An unclamped boundary value produces wrong behaviour right at the limit (a rounding / comparison-direction bug). A one-line boundary fix - every model localizes it, so it's kept only as a floor (core is already near-complete, no real split).

#8 Right-surface + recompute Discriminator (surface trap)

The change must land on the correct one of two similar internal surfaces (a routing trap), then trigger a follow-up recompute + refresh - a second step that's easy to miss. Putting it on the wrong surface compiles fine but is wrong.

How to run it

What it takes to stand up each model yourself. Click a row for the full setup: the weights, the serving config, and the compose file (plus a custom dockerfile, where there is one). The hosted ones just need the API.

Model	Backend	Weights / quant	Serving
Opus 4.8	API ref	(hosted API)	hosted	Setup →
Fable 5	API ref	(hosted API)	hosted	Setup →
Sonnet 5	API ref	(hosted API)	hosted	Setup →
DeepSeek V4 Flash (Think-High)	vLLM b12x	deepseek-ai/DeepSeek-V4-Flash	MTP=1 ON (speculative, b12x)	Setup →
Qwen3.6-35B-A3B AWQ	vLLM	QuantTrio/Qwen3.6-35B-A3B-AWQ	MTP OFF	Setup →
Huihui Qwen3.6-35B abliterated (Q8)	llama.cpp	huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-MTP-GGUF	no speculative (MTP heads present, unused - matches AWQ baseline)	Setup →
Qwen3.6-27B INT4	vLLM	Lorbus/Qwen3.6-27B-int4-AutoRound	MTP OFF (needs vLLM nightly)	Setup →
MiMo-V2.5	llama.cpp	unsloth/MiMo-V2.5-GGUF (UD-IQ4_XS)	no speculative (llama.cpp)	Setup →
DeepSeek V4 Flash (Think-Max)	vLLM b12x	deepseek-ai/DeepSeek-V4-Flash	MTP=1 ON (speculative)	Setup →
Sonnet 4.5	API ref	(hosted API)	hosted	Setup →
MiniMax M2.7 (Q5)	llama.cpp	unsloth/MiniMax-M2.7-GGUF (UD-Q5_K_XL)	no speculative (llama.cpp)	Setup →
MiniMax M3 IQ3_S	llama.cpp	unsloth/MiniMax-M3-GGUF (UD-IQ3_S)	no speculative (llama.cpp)	Setup →
Step 3.7 Flash	llama.cpp	unsloth/Step-3.7-Flash-GGUF (UD-Q6_K)	no speculative (llama.cpp)	Setup →

About

Every model solved the same 8 tasks, run autonomously as an agent, and each resulting diff was graded 0-3 against the fix that actually shipped by a strong LLM judge, with a small penalty for avoidable inefficiency. Local models run under the opencode agent on vLLM or llama.cpp; the hosted ones run under Claude Code. Different scaffolds on purpose: the question is what you get when each model is set up the way people actually run it, not which raw model wins in a vacuum.

Coverage is deliberately uneven. The discriminator tasks, the ones that actually separate models, are run hardest, up to 13 times on a single model; the easy floor tasks and the slowest local models get as few as 1, so read the thin cells as directional rather than precise. Scores are the mean tier across a model's runs, never best-of-N. Task time is wall-clock with each task weighted equally, so no single slow task dominates. Every chart on this page is drawn from that same benchmark data, 445 graded runs exported as JSON.

Download the dataset

scorecard.json · ~86 KB · the aggregated data behind every chart on this page