The scorecard

Every model graded on the same 8 coding tasks across 445 runs, each attempt scored 0-3 against the fix that actually shipped. All of it generated straight from the benchmark data.

Scorecard

MissBasicSolidComplete
#ModelMeanComplete%nDistribution
1 Fable 5 API ref2.5365%17
2 Opus 4.8 API ref2.5174%38
3 Sonnet 5 API ref2.2256%18
4 DeepSeek V4 Flash (Think-Max) vLLM b12x2.2154%24
5 DeepSeek V4 Flash (Think-High) vLLM b12x2.0449%41
6 MiniMax M3 IQ3_S llama.cpp1.9246%13
7 Sonnet 4.5 API ref1.9032%31
8 Qwen3.6-27B INT4 vLLM1.8442%38
9 MiniMax M2.7 (Q5) llama.cpp1.588%12
10 Qwen3.6-35B-A3B AWQ vLLM1.4621%39
11 Step 3.7 Flash llama.cpp1.3629%14
12 Huihui Qwen3.6-35B abliterated (Q8) llama.cpp1.3322%18
13 MiMo-V2.5 llama.cpp1.3040%40

The shape behind each score: how every model's attempts spread across the four grades, with gaps where a grade never happened.

MissBasicSolidCompleteFable 5mean 2.53Opus 4.8mean 2.51Sonnet 5mean 2.22DS V4 Maxmean 2.21DS V4 Highmean 2.04M3mean 1.92Sonnet 4.5mean 1.90Qwen 27Bmean 1.84M2.7mean 1.58Qwen 35Bmean 1.46Step 3.7mean 1.36HuiHui 35Bmean 1.33MiMo 2.5mean 1.30

Time per task

Spread of wall-clock time to finish a task, in real minutes, with each task weighted equally so uneven run counts don't skew it; dashed line is the median, fastest on top.

30s1m2m5m10m15m20m30m40m60m80mDS V4 HighHuiHui 35BSonnet 4.5DS V4 MaxQwen 35BOpus 4.8Fable 5Qwen 27BM2.7Sonnet 5MiMo 2.5Step 3.7M3

Speed & taper

Decode speed as context fills, on the two-card box. Sliding-window and MoE models stay flat; dense models taper hard.

20406080100120140160180200220decode speed (tok/s)050K100K150K200K250Kcontext length (tokens)40 tok/s usable floor
Qwen 27BQwen 35BDS V4 HighMiMo 2.5M3HuiHui 35BM2.7Step 3.7

Working style

How much context each model ingests versus how many tokens it generates to solve a task. Up-and-right = reads and writes a lot.

10K20K30K20K40K60K80Kcontext ingested (tokens)output tokensFable 5Sonnet 5DS V4 HighQwen 35BHuiHui 35BQwen 27BMiMo 2.5DS V4 MaxM2.7M3Step 3.7

VRAM

What each model needs while serving, split into fixed weights (solid) and reserved KV cache (faint) that scales with how much context you keep live. Reference lines mark common total capacities, so it maps onto whatever cards you have rather than any one setup.

weightsKV cache reserved (trim with context)24486496128192DS V4 High186.7 GBM2.7185.1 GBM3177.9 GBStep 3.7161.1 GBMiMo 2.5144.2 GBQwen 27B87 GBQwen 35B86.2 GBHuiHui 35B38 GB

The tasks

Each task is a real bug from a production TypeScript/GraphQL codebase, reverted to just before its fix and generalized to an archetype.

The aim was tasks that separate models rather than flatter them, so each one had to clear a few bars:

  • Small diff, big search: the actual fix is only a handful of lines, but finding it takes real digging through the codebase, so it rewards understanding over typing.
  • More than one way to do it, not all equal: several fixes will pass, but only some are clean and complete, which is exactly what separates Solid from Complete.
  • An objective fix that actually shipped: each is a real bug with a known-good merged fix, so attempts get graded against ground truth, not taste.
  • Contamination-controlled: a clean just-before-fix checkout with history stripped, so nothing about the solution leaks into the prompt.

The set spans difficulty on purpose, from near-floor tasks almost everyone completes to discriminators only the strongest models get right.

#1 Idempotent-update guard Discriminator

A save that changes nothing still fires an expensive downstream state transition; separately, a class of legacy records silently skips an audit step. Two orthogonal defects, graded independently - and a blanket guard that also suppresses legitimate updates doesn't count as a fix.

#2 Change-detection before a side effect Discriminator

Re-saving a parent record re-runs a costly re-processing side effect on a child field that didn't actually change. The guard has to compare on normalized values (whitespace / line-ending-only differences shouldn't count as a change) and run before the side effect fires.

#3 Search tokenization Discriminator

Terms with internal punctuation (initials, separators) return nothing, because the query and the index tokenize punctuation differently. The real fix aligns the two at the tokenizer; patching the single failing query string is brittle.

#4 Query scoping filter Floor (excluded from headline)

A status filter is scoped too broadly and surfaces records that should be excluded - re-scope the underlying query condition. Straightforward; every model solves it, so it's kept only as a floor (tiers barely separate).

#5 Over-strict invariant Near-floor

A safety assertion is too strict and fires on a legitimate edge case, blocking an operation that should succeed. The fix has to narrow the assertion to what it's actually meant to guard - without deleting the safety check.

#6 Cross-transaction atomicity Hardest discriminator

Two related writes run in separate transactions; if the second fails (timeout, contention) they're left permanently out of sync. It compiles and passes the happy path - the desync only manifests under failure, so you have to reason about the failure path, not the common case.

#7 Boundary / off-by-one Dead floor (excluded)

An unclamped boundary value produces wrong behaviour right at the limit (a rounding / comparison-direction bug). A one-line boundary fix - every model localizes it, so it's kept only as a floor (core is already near-complete, no real split).

#8 Right-surface + recompute Discriminator (surface trap)

The change must land on the correct one of two similar internal surfaces (a routing trap), then trigger a follow-up recompute + refresh - a second step that's easy to miss. Putting it on the wrong surface compiles fine but is wrong.

How to run it

What it takes to stand up each model yourself. Click a row for the full setup: the weights, the serving config, and the compose file (plus a custom dockerfile, where there is one). The hosted ones just need the API.

Model
Opus 4.8Setup →
Fable 5Setup →
Sonnet 5Setup →
DeepSeek V4 Flash (Think-High)Setup →
Qwen3.6-35B-A3B AWQSetup →
Huihui Qwen3.6-35B abliterated (Q8)Setup →
Qwen3.6-27B INT4Setup →
MiMo-V2.5Setup →
DeepSeek V4 Flash (Think-Max)Setup →
Sonnet 4.5Setup →
MiniMax M2.7 (Q5)Setup →
MiniMax M3 IQ3_SSetup →
Step 3.7 FlashSetup →

About

Every model solved the same 8 tasks, run autonomously as an agent, and each resulting diff was graded 0-3 against the fix that actually shipped by a strong LLM judge, with a small penalty for avoidable inefficiency. Local models run under the opencode agent on vLLM or llama.cpp; the hosted ones run under Claude Code. Different scaffolds on purpose: the question is what you get when each model is set up the way people actually run it, not which raw model wins in a vacuum.

Coverage is deliberately uneven. The discriminator tasks, the ones that actually separate models, are run hardest, up to 13 times on a single model; the easy floor tasks and the slowest local models get as few as 1, so read the thin cells as directional rather than precise. Scores are the mean tier across a model's runs, never best-of-N. Task time is wall-clock with each task weighted equally, so no single slow task dominates. Every chart on this page is drawn from that same benchmark data, 445 graded runs exported as JSON.

Download the dataset

scorecard.json · ~86 KB · the aggregated data behind every chart on this page