Werkstatt Is Open Source: Skill-based Discipline for AI Coding Agents

April 28, 2026 • 9 min read • Oisin Maher

#Announcement #Open Source #Werkstatt #AI Agents #Engineering #Claude Code #Tooling

We've open-sourced Werkstatt, the coding harness we use internally at Bollwerk. It's a set of skills, hooks, and workflow defaults that take an off-the-shelf coding agent — Claude Code, Codex, OpenCode, Cursor — and make it behave more like a careful engineer: brainstorm before coding, plan before executing, write tests first, review work between steps, verify before claiming completion. The repository is MIT-licensed and lives at Bollwerkio/werkstatt.

Release · 2026-04-28MIT licensed

Now open source

Werkstattv1

The coding harness we use internally at Bollwerk — auto-triggered skills, hooks, and workflow defaults that turn an off-the-shelf coding agent into a careful engineer.

Runs onClaude CodeCodexOpenCodeCursorCopilot CLI

View on GitHubBollwerkio/werkstatt

Werkstatt v1 — open-source under MIT, runs on Claude Code, Codex, OpenCode, Cursor, and Copilot CLI

A skill, in this context, is a small, named bundle of instructions that the agent loads automatically when a trigger condition matches a piece of work. Skills override the model's defaults and make engineering practices — test-driven development, design-before-implementation, evidence-based completion claims — mandatory at the points where they actually matter. The practical effect is that the agent slows down in exactly the places where slowing down pays off, and stays out of the way for the rest.

Werkstatt began as a fork of Jesse Vincent's Superpowers (also MIT-licensed). We've kept the core idea — auto-triggered skills as a workflow substrate — and continued tuning it for the kind of harder engineering work we run into: stronger default model selection, stricter review gates, clearer platform mappings across coding-agent harnesses, and fork-owned installation paths. Credit for the original architecture and a substantial amount of the skill content goes to Jesse and the team at Prime Radiant; we owe them the original idea and most of the early implementation.

The gap we kept hitting

Off-the-shelf coding agents are, by default, very willing. You ask one to implement a feature and it starts implementing. That responsiveness is a real strength on tasks where the right answer is obvious, the surface area is small, and the cost of being wrong is low. On harder work, it is a liability. The kinds of failure modes we kept watching:

Implementation before specification. The agent writes code that solves a plausible interpretation of the request — not the actual one — and surfaces the misunderstanding in a working pull request, after the work is done and the rollback cost is real.
No real test discipline. Tests get written after the implementation, often as confirmation rather than verification. They pass on the first run, which usually means they were shaped to match the code rather than the spec.
Drift across long sessions. A reasonable plan in turn three becomes a different plan by turn forty, with no checkpoint that distinguishes the two.
Premature completion claims. "Fixed" turns out to mean "I made a change that I believe should fix it," not "I ran the failing case and watched it succeed." The verification step is implied, then skipped.
Speculative refactoring. The agent tidies adjacent code that wasn't part of the task, which expands the diff, lengthens review, and occasionally introduces regressions in code that was working fine.

These are not exotic problems. They are the same things you would coach a junior engineer through. We wanted a way to make those coaching points part of the system, rather than something that had to be re-prompted in every session.

Same Prompt, Two Agents"Add export-to-CSV to the watchlist."

Default agent— implementation-first

Reads the request once
Plausible interpretation; never confirmed.
Writes the implementation
No spec, no test plan, no edge cases.
failure mode
Adds tests after the fact
Tests pass on the first run — they were shaped to match the code.
failure mode
Refactors adjacent code
Diff doubles in size; review takes longer.
failure mode
"Done"
No verification command run; "done" means "I think it works."
failure mode

With Werkstatt— skills enforce the gates

Brainstorm fires first
Asks: which columns, which formats, what about deleted rows?
caught here
Plan written, approved
Three tasks, ~5 min each, file paths spelled out.
Failing test before code
Red → green → refactor on every task.
Reviewer agent compares to plan
Off-plan refactors flagged and reverted.
caught here
Verified before merging
Test suite output captured; "passing" backed by evidence.
caught here

Same prompt, two agents: failure modes a default agent will hit, and the skill gates that catch them in Werkstatt

What a skill actually is

Each skill is a small file that ships with the harness. It has a name, a description, a trigger condition, and a body of instructions; typically a checklist or a process the agent must follow. When the user's request matches the trigger, the harness loads the skill into the conversation. From that point on, the skill's contents take precedence over the agent's defaults wherever they conflict.

The triggers are deliberately broad. Any creative work — adding a feature, modifying behaviour, building a component — fires the brainstorming skill. Any encounter with a bug, test failure, or unexpected behaviour fires systematic-debugging. The intent is that the agent does not have to decide whether to be careful. The harness decides for it, and gets it right by default.

Skills come in two flavours. Some are rigid and explicitly tell the agent not to adapt them away under pressure. Others are flexible, i.e. patterns to apply with judgment. The skill itself states which type it is, so the agent knows how strictly to follow the contents.

Crucially, skills override the model's defaults but yield to the user. If a project's CLAUDE.md, AGENTS.md, or GEMINI.md says "do not use TDD here," the agent follows the user's instructions. The harness is opinionated; it is not authoritarian.

Skills versus plugins

The two terms come up close together in the documentation, and they're easy to confuse, so it's worth separating them.

A plugin is the unit of distribution. It's the package the harness installs, updates, and removes — a manifest, a versioned bundle, the wiring that exposes content to the agent. Werkstatt itself ships as one plugin per harness. When you run /plugin install werkstatt@werkstatt-marketplace, you're installing a plugin.

A skill is the unit of behaviour. It's a single file with a name, a trigger condition, and a body of instructions that the agent loads when the trigger matches. Plugins are made up of skills; one plugin typically contains many. Werkstatt's plugin contains the dozen-or-so skills listed below.

Put simply: plugins are how the content gets onto your machine; skills are what actually changes how the agent behaves once it's there. Updating the Werkstatt plugin updates all of the skills inside it; disabling the plugin disables all of them at once.

The default workflow

The intended path through Werkstatt for a non-trivial piece of work is roughly seven steps. None of them are optional in a fresh session, but each one terminates with a checkpoint that the user can short-circuit if the situation doesn't warrant it.

Werkstatt Workflow— auto-triggered skills, in order

1Step 1
Brainstormbrainstorming
Refine the rough request into a real spec before writing any code.
2Step 2
Isolateusing-git-worktrees
Fresh branch in a clean worktree, baseline tests verified green.
3Step 3
Planwriting-plans
Break the spec into 2–5 minute tasks with file paths and verification steps.
4Step 4
Executesubagent-driven-development
A fresh subagent per task, two-stage review built in.
5Step 5
Test-driventest-driven-development
Red → green → refactor at every step. Code without a test gets discarded.
6Step 6
Reviewrequesting-code-review
Reviewer agent compares output against the plan; severity-rated findings.
7Step 7
Finishfinishing-a-development-branch
Verify tests, present merge / PR / keep / discard, clean up worktree.

verification-before-completion runs throughout — every "done", "fixed", or "passing" claim has to be backed by a real command and its real output.

The seven steps of the default Werkstatt workflow, with the auto-triggered skill at each gate

1. Brainstorm. Before writing any code, the agent runs the brainstorming skill: a Socratic refinement loop that interrogates the rough request until it becomes a usable specification. It asks the questions a careful engineer would ask — what's the actual user-visible behaviour, what's already in the codebase, where are the edge cases — and presents the spec back in chunks the user can read and approve. The skill exists because implementation before specification was the most expensive failure mode we observed; this is the gate that prevents it.

2. Set up isolation. The using-git-worktrees skill creates a dedicated git worktree on a fresh branch, runs the project's setup, and verifies that the test suite is green before any new code is added. Tests that were already failing on the parent branch get caught here rather than blamed on the new work.

3. Plan. The writing-plans skill turns the approved spec into a detailed implementation plan — bite-sized tasks of two to five minutes each, with exact file paths, expected code, and verification steps. The plan is written to be executable by a fresh agent that has no prior context. That constraint matters: it means the plan has to encode the design decisions explicitly, not rely on conversation history that a subagent will not see.

4. Execute. The agent runs the plan via either subagent-driven-development (a fresh subagent per task, with two-stage review built in) or executing-plans (batch execution with explicit human checkpoints). A subagent is a child agent dispatched with a self-contained brief and no shared memory; the parent agent reviews the result and moves on. The two-stage review is spec-compliance first ("did the subagent build what the plan asked for?") and code-quality second ("is it well-written?"). Order matters: a subagent that built the wrong thing well is worse than one that built the right thing badly.

5. Test-driven development throughout. The TDD skill enforces red-green-refactor at the granularity of every change: write a failing test, watch it fail, write the minimum code to make it pass, watch it pass, then refactor. Code written before its corresponding test gets discarded. This is the rigid-skill end of the catalogue — it does not adapt.

6. Code review between tasks. Once a task is implemented, the requesting-code-review skill runs the work past a reviewer agent that compares it against the plan and the project's conventions. Issues are reported by severity, and critical ones block progress. The receiving-code-review skill, when feedback comes back, requires the agent to verify each suggestion before applying it — performative agreement is explicitly called out as a failure mode.

7. Finish the branch. When all tasks are complete, the finishing-a-development-branch skill runs the test suite, presents the user with structured options (merge, open a PR, keep the branch open, discard), and cleans up the worktree. The verification-before-completion skill is loaded throughout: any "done", "fixed", or "passing" claim has to be backed by an actual command run and its actual output.

The whole sequence is designed so that the agent can be left alone for long stretches. In practice, on the right kind of work, we get autonomous runs of a couple of hours without the agent drifting from the plan.

The skill catalogue

The repository ships about a dozen skills, organised by what they are for:

Category	Skill	What it does
Process	brainstorming	Socratic spec refinement before any creative work
Process	writing-plans	Implementation plans detailed enough for a junior engineer
Process	executing-plans	Batch execution with human checkpoints
Process	subagent-driven-development	Per-task fresh subagents with two-stage review
Process	dispatching-parallel-agents	Concurrent subagents when tasks are independent
Testing	test-driven-development	Red-green-refactor, anti-patterns reference included
Debugging	systematic-debugging	Four-phase root-cause process
Quality	requesting-code-review	Pre-review checklist and reviewer dispatch
Quality	receiving-code-review	Verifying feedback before applying it
Quality	verification-before-completion	Evidence required for any completion claim
Workflow	using-git-worktrees	Isolated branches, clean test baselines
Workflow	finishing-a-development-branch	Merge, PR, or cleanup decision flow
Meta	writing-skills	How to author and test new skills
Meta	using-werkstatt	Bootstrapping skill that establishes the rules

Each skill is a single markdown file with a stable contract: a frontmatter block declaring its name, description, and trigger conditions; a body that contains the instructions and any required checklists. New skills can be added by forking the repository and following the writing-skills meta-skill.

When not to use it

Werkstatt is not a free upgrade for every session. It is built for high-stakes, complex work — features that touch multiple files, refactors with reasoning behind them, debugging that has resisted a quick fix. For small edits, single-file bug fixes, or trivial scaffolding, the full workflow tends to add needless friction. The brainstorming step in particular feels heavy when the request is e.g. "rename this variable across the file."

The harness is also token-hungry. Every session loads skill content; every plan dispatches subagents; every code-review pass runs a second model. None of that is free. For lightweight sessions, we either disable the plugin or skip the skill prefix and re-enable Werkstatt when we are back on something that warrants the discipline.

The rule of thumb we use: if the cost of getting it subtly wrong is higher than the cost of running an extra few thousand tokens, Werkstatt earns its keep. If the cost of getting it wrong is "I'll just edit it again," it does not.

Platforms

Werkstatt installs into several coding-agent harnesses. The Claude Code path is the most polished, because that is where most of our internal use lives:

Claude Code — register the marketplace with /plugin marketplace add Bollwerkio/werkstatt, then /plugin install werkstatt@werkstatt-marketplace.
Codex (app and CLI) — the app supports adding the repository directly; the CLI installs by telling Codex to fetch and follow .codex/INSTALL.md from the raw GitHub URL. Detailed steps are in docs/README.codex.md.
OpenCode — same pattern as Codex CLI; tell OpenCode to fetch and follow .opencode/INSTALL.md from the raw GitHub URL. Platform-specific docs in docs/README.opencode.md.
GitHub Copilot CLI — copilot plugin marketplace add Bollwerkio/werkstatt, then install the plugin.
Cursor — a plugin manifest is included in the repository. Once Werkstatt is published to the public Cursor Marketplace it should install via /add-plugin werkstatt; for now, the Claude Code, Codex, or OpenCode paths are the supported routes.
Gemini CLI — metadata is included for experimentation, but we do not currently consider Gemini a suitable harness for serious Werkstatt sessions. The skills depend on reliable multi-step development and exact tool mapping; Gemini's harness does not yet meet that bar. We will revisit it (maybe). Gemini is great, but not for coding (yet?).

The platform differences mostly show up in tool names — Bash versus shell, WebFetch versus the Codex equivalent — and the repository ships small reference files (references/copilot-tools.md, references/codex-tools.md) that map between them.

Why we open-sourced it

Most of the value in Werkstatt is the skill content, and most of the skill content is generic — TDD, brainstorming, code review, debugging discipline are practices that apply far beyond our domain. Keeping them private would have been a strange call: we did not invent them, the original Superpowers framework was MIT-licensed, and the skill files are easier to refine when other people use them on different problems and report back.

We also don't sell coding harnesses or tools.

Where to start

Open Sourcegithub.com

Bollwerkio/werkstatt

Skills, hooks, and workflow discipline for Claude Code, Codex, Cursor & OpenCode. Turn an AI coding assistant into a rigorous engineer.

License:MITSkills:14Forked from:obra/superpowers

claude-codecodexopencodecursortddagents

View on GitHubBollwerkio/werkstatt

The repository at github.com/Bollwerkio/werkstatt — MIT-licensed, contributions welcome

The repository is at github.com/Bollwerkio/werkstatt. The README has installation paths for each supported harness; the skills/ directory has the full set of skill files; docs/ has the platform-specific guides. If you have a coding agent set up already, the fastest way to feel the difference is to install the plugin, start a session, and ask for something that warrants the discipline — a feature, a bug, a refactor — and watch the brainstorming skill fire before any code gets written.

If you build a skill we should know about, or find a sharp edge in one of the existing skills, the contributing guide in the repository covers the workflow. The writing-skills meta-skill also walks an agent through authoring a new skill end to end, which is the path we use ourselves when adding to the catalogue.