How Far Have Open-Weight LLMs Come? Benchmarking Red- and Green-Flag Extraction on SEC 8-Ks
We run red-flag extraction over SEC filings as a product feature: a model reads an 8-K and returns the material warning signs an analyst should look at first. We run the mirror too, green-flag extraction, pulling the positive signals a filing carries: earnings beats, buybacks, contract wins, regulatory approvals.
What we wanted to test is how far the open-weight models have actually come, GLM especially, given all the talk of them catching the closed frontier. So we ran 5 frontier LLMs plus a deterministic keyword baseline over real 8-Ks, scored without a gold answer key, 3 runs each, twice: once for red flags, once for green.
The gap depends on the task. On green flags the closed-weight lead is about 0.10 in recall, too small for 3 runs to separate from noise. On red flags it's +0.17, positive under all 3 graders, with every interval clear of zero.
The second finding splits the open camp. The newest GLM roughly doubled its green-flag score over the slightly older GLM-5 and gained 0.27 on red, while Kimi moved a little on red and stayed flat on green. How fast the open frontier is closing depends on which open model you mean.
The task and the setup
A red flag in an 8-K is something an investor would act on: a going-concern warning, a debt default, a toxic convertible, an auditor walking out. The job is to read the full filing and return those, each with a severity, a category, and a verbatim quote as evidence. A green flag is the positive mirror, so the green run doubles as a replication check on the red one.
There's no labeled answer key for this at scale, and hand-labeling thousands of filings is the work we're automating in the first place. So the eval is reference-free: outputs get judged against the filing itself and against a frozen pooled consensus, 3 different ways.
The contestants are Opus 4.8, Sonnet 4.6, GPT-5.5, GLM-5.2, and Kimi K2.6, plus slightly older versions of both open models (GLM-5, Kimi K2.5) to measure how fast each family is moving, plus a deterministic keyword baseline: a no-LLM grep that emits the matched span as verbatim evidence and costs nothing.
Each filing runs through our production screener prompts (financial, governance, and general; the matching positive-signal set for green), findings merged and deduped by category, full text, no truncation. Every model runs every filing 3 times, with per-call cost recorded.
The red corpus is 23 real 8-Ks: 16 from companies our production detector had already flagged heavily, 7 from a random recent sample. The green corpus is 20 real 8-Ks picked the same way plus 2 built clean controls. Under the consensus reference, 13 red filings carry 39 material flags and 10 come back clean; on green it's 16 filings with 34 flags and 6 clean. A clean filing is a negative control by definition: the right answer on it is silence.
Three ways to grade with no gold labels
evidence_grounding is deterministic: every cited quote must be a verbatim substring of the filing. It catches the most dangerous failure, a fabricated quote that makes a wrong finding look sourced.
An LLM judge (Gemini 3.1 Pro, deliberately not a contestant) scores free-form coverage and relevance.
The workhorse is pooled_coverage. Gemini 3.1 Pro reads the full filing plus the union of every contestant's material findings and synthesizes one deduplicated list of the material flags the filing actually supports. The list is built once and frozen; every run of every model is graded on the fraction of it found, and the baseline is graded against it without ever feeding it. The synthesizer can also add a flag it reads in the filing that no analyst reported; those penalize every contestant equally, and the pool-bias checks below drop them.
The honest caveat: the model that writes the reference also grades against it. The stress tests below attack that from 3 sides: an open-weight grader from a different family, a no-LLM lexical grader, and a manual audit of all 39 red flags against the filing text.
Three runs and error bars
One run of a stochastic model is a draw, and the draws differ: Kimi K2.6 pulled 0.46, 0.41, and 0.28 on 3 identical red runs. So everything here reports as a mean and an interval over 3 runs.
Every grader is positive on both tasks; on red every interval clears zero, on green every interval crosses it. The lexical grader carries the most weight: it credits a hit only when the words overlap, has no model in the loop, and still puts the closed models ahead.
The corpus is fixed and the reference frozen, so these are run-to-run reproducibility bands. We'd call the red result reproducible and directionally solid rather than settled.
The leaderboard on both tasks
Opus and Sonnet top red flags, GLM-5.2 ties GPT-5.5 at 0.68, and green reshuffles the order: GLM-5.2 climbs to 2nd while Sonnet falls to 4th. Green runs harder for the whole field.
The flag matrix is the detailed view: every consensus flag as a column, cell intensity showing in how many of the 3 runs each model found it, switchable between tasks and sortable by difficulty or severity.
The right tail is where the gap lives: the tax receivable agreement, shareholder illiquidity, and distressed PIK terms get caught almost exclusively by the closed trio. Opus covers 28 or 29 of the 39 in every run, 21 of them the same flags each time; Kimi K2.6 swings between 11 and 18.
The alarms column is the trigger story. On red, Sonnet alone fires on clean filings: 12 material findings over 3 runs against 0 for everyone else. Half of those filings come from the detector-flagged stratum, so some of the 12 may be real flags the conservative consensus dropped; we read it as an upper bound on over-flagging, and it matters wherever a false alarm costs analyst time. Toggled to green, the mirror flips: there it's GLM-5.2 (4 findings) and the keyword grep (3) firing on 1 consensus-clean filing.
Recall per dollar
The grep prices the floor: $0 a run, it catches the obvious going-concern and default language, misses everything subtle (nepotism, related-party structures, dilution in a financing footnote), and its evidence is verbatim by construction. The frontier models recover 3 to 6 times as much of the consensus, so the real question is what that margin costs.
On cost, the open pick is hard to argue with. GLM-5.2 ties GPT-5.5 on red at $0.34 a run against $1.91, and beats it outright on green at $0.41 against $2.41. The recall crown costs $2.27 a run at Opus.
How fast the open frontier is moving
GLM gained +0.27 on red between GLM-5 and GLM-5.2, more than the entire closed-minus-open gap, and doubled on green, crossing GPT-5.5 on the way up. Kimi gained +0.11 on red and stayed flat on green.
The open frontier moves at 2 speeds, so benchmark the specific model you plan to ship; category labels tell you very little.
Stress-testing the gap
We tried to break the red gap 4 ways; it survived all 4.
Graders: the open-weight GLM grader, from a different family than the judge, reads the gap at +0.18, wider than Gemini's own +0.17; the no-LLM lexical grader reads +0.10. Pool composition: counting only flags at least 2 models found gives +0.22, and leave-one-out scoring gives +0.19; both also remove the synthesizer-added flags no model covered. Severity calibration: closed and open rate severity within 0.01 of each other, so the material-pool filter favors neither side.
The manual audit: we read all 13 flagged red filings end to end and checked every one of the 39 consensus flags against the text. All 39 evidence quotes are verbatim, 34 flags verified clean, 4 carry a debatable severity, and 1 is mislabeled; dropping the mislabeled flag moves scores by under 0.03.
The audit also found 2 material flags the consensus missed. A pooled reference is a floor, so a coverage of 1.0 would still mean "found everything the pool surfaced" rather than "found everything."
What we'd actually conclude
Which model: on green flags GLM-5.2, 2nd on recall at a sixth of GPT's cost. On red flags the closed models lead, GLM-5.2 ties GPT-5.5 at about a sixth of its cost, and Opus holds the recall crown at $2.27 a run. Sonnet ties Opus on red recall but is the only model that fired on red's clean filings; that trade works where false alarms are cheap and hurts where analysts chase every one.
Closed versus open: the gap is real where the task is hard. Green shows an edge too small for 3 runs to confirm; red shows +0.17 under every grader, surviving every control above. The green run doubles as replication: same direction, smaller size, which is the difficulty-dependence a single benchmark would have hidden.
Set against that, GLM gained +0.27 in a jump between 2 versions, more than the entire gap. We read the picture as a real closed lead with a short half-life, for one open family at least.
The limits, stated plainly: 39 red flags on 13 filings, 34 green on 16; the intervals measure run-to-run reproducibility on fixed corpora of 23 red and 22 green filings; the reference is LLM-written, frozen, lexically cross-checked, and hand-audited, holding up with 1 mislabel in 39.
Contamination is handled by construction. The eval is reference-free, so there's no gold label a model could have memorized, and the full filing sits in the prompt at inference, so having seen it in training confers no edge.
The lesson for anyone standing up an LLM eval: put a deterministic baseline in the lineup, benchmark on real data rather than synthetic fixtures (planted flags are easy for every model, and an easy task hides gaps), build the reference with a strong model and freeze it, run the stochastic models more than once, and draw the error bars before you draw conclusions. Where you can, run a mirror task; a second benchmark that should replicate is the cheapest way to catch a finding that only holds on one dataset.
Where we sit
We build Bollwerk for the second line of defense: risk, compliance, and financial-crime teams. Flag extraction over filings, on the warning side and the positive side, is one of the model-backed features inside that product, which is why the model choice gets an eval rather than a vibe.
The methodology here is the same one we'd want behind any model-risk decision: a deterministic floor, repeated runs, a mirror task, and error bars that decide what's real before the leaderboard does. If your team is making model-selection calls for compliance workloads and would find it useful to compare notes, write to hello@bollwerk.ai.