How Far Have Open-Weight LLMs Come? Benchmarking Red- and Green-Flag Extraction on SEC 8-Ks

June 28, 2026 • 8 min read • Arkadij Kummer

#LLM #Benchmarks #AI #Evaluation #SEC Filings #Red Flags #Green Flags #Open Weight Models #Frontier Models #Model Risk #Langfuse #Methodology

We run red-flag extraction over SEC filings as a product feature: a model reads an 8-K and returns the material warning signs an analyst should look at first. We run the mirror too, green-flag extraction, pulling the positive signals a filing carries: earnings beats, buybacks, contract wins, regulatory approvals.

What we wanted to test is how far the open-weight models have actually come, GLM especially, given all the talk of them catching the closed frontier. So we ran 5 frontier LLMs plus a deterministic keyword baseline over real 8-Ks, scored without a gold answer key, 3 runs each, twice: once for red flags, once for green.

The gap depends on the task. On green flags the closed-weight lead is about 0.10 in recall, too small for 3 runs to separate from noise. On red flags it's +0.17, positive under all 3 graders, with every interval clear of zero.

The second finding splits the open camp. The newest GLM roughly doubled its green-flag score over the slightly older GLM-5 and gained 0.27 on red, while Kimi moved a little on red and stayed flat on green. How fast the open frontier is closing depends on which open model you mean.

The task and the setup

A red flag in an 8-K is something an investor would act on: a going-concern warning, a debt default, a toxic convertible, an auditor walking out. The job is to read the full filing and return those, each with a severity, a category, and a verbatim quote as evidence. A green flag is the positive mirror, so the green run doubles as a replication check on the red one.

There's no labeled answer key for this at scale, and hand-labeling thousands of filings is the work we're automating in the first place. So the eval is reference-free: outputs get judged against the filing itself and against a frozen pooled consensus, 3 different ways.

The contestants are Opus 4.8, Sonnet 4.6, GPT-5.5, GLM-5.2, and Kimi K2.6, plus slightly older versions of both open models (GLM-5, Kimi K2.5) to measure how fast each family is moving, plus a deterministic keyword baseline: a no-LLM grep that emits the matched span as verbatim evidence and costs nothing.

Each filing runs through our production screener prompts (financial, governance, and general; the matching positive-signal set for green), findings merged and deduped by category, full text, no truncation. Every model runs every filing 3 times, with per-call cost recorded.

The red corpus is 23 real 8-Ks: 16 from companies our production detector had already flagged heavily, 7 from a random recent sample. The green corpus is 20 real 8-Ks picked the same way plus 2 built clean controls. Under the consensus reference, 13 red filings carry 39 material flags and 10 come back clean; on green it's 16 filings with 34 flags and 6 clean. A clean filing is a negative control by definition: the right answer on it is silence.

Three ways to grade with no gold labels

evidence_grounding is deterministic: every cited quote must be a verbatim substring of the filing. It catches the most dangerous failure, a fabricated quote that makes a wrong finding look sourced.

An LLM judge (Gemini 3.1 Pro, deliberately not a contestant) scores free-form coverage and relevance.

The workhorse is pooled_coverage. Gemini 3.1 Pro reads the full filing plus the union of every contestant's material findings and synthesizes one deduplicated list of the material flags the filing actually supports. The list is built once and frozen; every run of every model is graded on the fraction of it found, and the baseline is graded against it without ever feeding it. The synthesizer can also add a flag it reads in the filing that no analyst reported; those penalize every contestant equally, and the pool-bias checks below drop them.

The honest caveat: the model that writes the reference also grades against it. The stress tests below attack that from 3 sides: an open-weight grader from a different family, a no-LLM lexical grader, and a manual audit of all 39 red flags against the filing text.

Three runs and error bars

One run of a stochastic model is a draw, and the draws differ: Kimi K2.6 pulled 0.46, 0.41, and 0.28 on 3 identical red runs. So everything here reports as a mean and an interval over 3 runs.

Closed minus open gap · both tasks · 95% CI, N=3red flagsgreen flags

The closed edge is wider on red flags, where every grader clears zero, including the no-LLM lexical one. On green flags the edge stays near +0.10 and every interval crosses zero at 3 runs.

Closed-weight minus open-weight pooled_coverage by grader, on red flags (red) and green flags (green): point estimate and 95% confidence interval across 3 runs against each task's frozen reference. On red flags all 3 graders sit at +0.10 to +0.18 and every interval clears zero. On green flags the edge stays near +0.10 and every interval crosses zero.

Every grader is positive on both tasks; on red every interval clears zero, on green every interval crosses it. The lexical grader carries the most weight: it credits a hit only when the words overlap, has no model in the loop, and still puts the closed models ahead.

The corpus is fixed and the reference frozen, so these are run-to-run reproducibility bands. We'd call the red result reproducible and directionally solid rather than settled.

The leaderboard on both tasks

pooled_coverage by task · mean ±std, N=3closedopenolderbaselinered rankgreen rank

Both task panels share the model order; the tabs switch it between red-flag rank and green-flag rank. Opus (0.73) and Sonnet (0.70) top red flags with GLM-5.2 and GPT-5.5 tied at 0.68; on green flags GLM-5.2 moves to 2nd while Sonnet and both Kimi versions drop sharply. The keyword baseline is the floor on each (0.13 red, 0.09 green).

pooled_coverage per model on each task, mean plus-or-minus stdev across 3 runs against the frozen reference. Both panels share the model order, and the tabs sort it by red-flag or green-flag rank. Opus and Sonnet top red flags with GLM-5.2 and GPT-5.5 tied just behind; on green flags GLM-5.2 moves to 2nd while Sonnet and both Kimi versions drop sharply.

Opus and Sonnet top red flags, GLM-5.2 ties GPT-5.5 at 0.68, and green reshuffles the order: GLM-5.2 climbs to 2nd while Sonnet falls to 4th. Green runs harder for the whole field.

The flag matrix is the detailed view: every consensus flag as a column, cell intensity showing in how many of the 3 runs each model found it, switchable between tasks and sortable by difficulty or severity.

Every consensus flag · runs found of 3by difficultyby severityred flagsgreen flags

runs found:0-3criticalhighmediumlow

Red flags: the left half is common ground (the grep catches 5 of the first 6); the right tail is caught almost only by the closed trio, and the last column escapes every model (a flag the synthesizer took straight from the filing text; no analyst reported it). Sonnet's 12 alarms split 5, 4, 3 per run. Per-model totals reproduce the frozen scores except 1 borderline match (GPT-5.5, run 1).

Green flags: 3 consensus flags escape every model (synthesizer-added from the text, or reworded past their source finding), and Opus alone catches the deal fine print (reverse termination fees, voting agreements). GLM-5.2 (4) and the grep (3) both fired on the same consensus-clean filing, Sleep Number, a detector-flagged pick whose consensus came back empty; the 2 built controls stayed silent for all 8 models. Totals reproduce the frozen scores except 3 borderline matches.

Every consensus flag as a column (ticker and short name); each cell shows in how many of the 3 runs the model covered it. The strip on top marks severity, and the right column counts material findings on the consensus-clean filings, where the right answer is silence. Switch tasks and sort order with the tabs.

The right tail is where the gap lives: the tax receivable agreement, shareholder illiquidity, and distressed PIK terms get caught almost exclusively by the closed trio. Opus covers 28 or 29 of the 39 in every run, 21 of them the same flags each time; Kimi K2.6 swings between 11 and 18.

The alarms column is the trigger story. On red, Sonnet alone fires on clean filings: 12 material findings over 3 runs against 0 for everyone else. Half of those filings come from the detector-flagged stratum, so some of the 12 may be real flags the conservative consensus dropped; we read it as an upper bound on over-flagging, and it matters wherever a false alarm costs analyst time. Toggled to green, the mirror flips: there it's GLM-5.2 (4 findings) and the keyword grep (3) firing on 1 consensus-clean filing.

Recall per dollar

Extraction cost vs recall · N=3red flagsgreen flags

Extraction cost only (screener calls; the fixed judge overhead is excluded, so the baseline is $0). Faded points are the previous open generation. GLM-5.2 ($0.34) ties GPT-5.5 (closed, $1.91) on recall at about a sixth of the cost.

Extraction cost only (screener calls; the fixed judge overhead is excluded, so the baseline is $0). Faded points are the previous open generation. GLM-5.2 ($0.41) beats GPT-5.5 (closed, $2.41) on recall at a sixth the cost.

Extraction cost per run against pooled_coverage, with run-to-run error bars. Toggle between the red-flag and green-flag task. On red, GLM-5.2 ties GPT-5.5 at about a sixth of the cost; on green it beats it outright. The keyword baseline is free.

The grep prices the floor: $0 a run, it catches the obvious going-concern and default language, misses everything subtle (nepotism, related-party structures, dilution in a financing footnote), and its evidence is verbatim by construction. The frontier models recover 3 to 6 times as much of the consensus, so the real question is what that margin costs.

On cost, the open pick is hard to argue with. GLM-5.2 ties GPT-5.5 on red at $0.34 a run against $1.91, and beats it outright on green at $0.41 against $2.41. The recall crown costs $2.27 a run at Opus.

How fast the open frontier is moving

Jump from slightly older version · red vs greenGLMKimi

GLM gained +0.27 on red flags (0.40 to 0.68) and doubled on green (+0.30, from below GPT-5.5 to well above it). Kimi gained +0.11 on red and did not move on green (K2.6 equals K2.5 at 0.23).

pooled_coverage of each open model against a slightly older version of itself, red panel and green panel on the same scale. The dashed line marks GPT-5.5 on each task. GLM gains +0.27 on red and +0.30 on green; Kimi gains +0.11 on red and stays flat on green.

GLM gained +0.27 on red between GLM-5 and GLM-5.2, more than the entire closed-minus-open gap, and doubled on green, crossing GPT-5.5 on the way up. Kimi gained +0.11 on red and stayed flat on green.

The open frontier moves at 2 speeds, so benchmark the specific model you plan to ship; category labels tell you very little.

Stress-testing the gap

We tried to break the red gap 4 ways; it survived all 4.

Graders: the open-weight GLM grader, from a different family than the judge, reads the gap at +0.18, wider than Gemini's own +0.17; the no-LLM lexical grader reads +0.10. Pool composition: counting only flags at least 2 models found gives +0.22, and leave-one-out scoring gives +0.19; both also remove the synthesizer-added flags no model covered. Severity calibration: closed and open rate severity within 0.01 of each other, so the material-pool filter favors neither side.

The manual audit: we read all 13 flagged red filings end to end and checked every one of the 39 consensus flags against the text. All 39 evidence quotes are verbatim, 34 flags verified clean, 4 carry a debatable severity, and 1 is mislabeled; dropping the mislabeled flag moves scores by under 0.03.

The audit also found 2 material flags the consensus missed. A pooled reference is a floor, so a coverage of 1.0 would still mean "found everything the pool surfaced" rather than "found everything."

What we'd actually conclude

Which model: on green flags GLM-5.2, 2nd on recall at a sixth of GPT's cost. On red flags the closed models lead, GLM-5.2 ties GPT-5.5 at about a sixth of its cost, and Opus holds the recall crown at $2.27 a run. Sonnet ties Opus on red recall but is the only model that fired on red's clean filings; that trade works where false alarms are cheap and hurts where analysts chase every one.

Closed versus open: the gap is real where the task is hard. Green shows an edge too small for 3 runs to confirm; red shows +0.17 under every grader, surviving every control above. The green run doubles as replication: same direction, smaller size, which is the difficulty-dependence a single benchmark would have hidden.

Set against that, GLM gained +0.27 in a jump between 2 versions, more than the entire gap. We read the picture as a real closed lead with a short half-life, for one open family at least.

The limits, stated plainly: 39 red flags on 13 filings, 34 green on 16; the intervals measure run-to-run reproducibility on fixed corpora of 23 red and 22 green filings; the reference is LLM-written, frozen, lexically cross-checked, and hand-audited, holding up with 1 mislabel in 39.

Contamination is handled by construction. The eval is reference-free, so there's no gold label a model could have memorized, and the full filing sits in the prompt at inference, so having seen it in training confers no edge.

The lesson for anyone standing up an LLM eval: put a deterministic baseline in the lineup, benchmark on real data rather than synthetic fixtures (planted flags are easy for every model, and an easy task hides gaps), build the reference with a strong model and freeze it, run the stochastic models more than once, and draw the error bars before you draw conclusions. Where you can, run a mirror task; a second benchmark that should replicate is the cheapest way to catch a finding that only holds on one dataset.

Where we sit

We build Bollwerk for the second line of defense: risk, compliance, and financial-crime teams. Flag extraction over filings, on the warning side and the positive side, is one of the model-backed features inside that product, which is why the model choice gets an eval rather than a vibe.

The methodology here is the same one we'd want behind any model-risk decision: a deterministic floor, repeated runs, a mirror task, and error bars that decide what's real before the leaderboard does. If your team is making model-selection calls for compliance workloads and would find it useful to compare notes, write to hello@bollwerk.ai.