A stateful workflow CLI that turns ClawBio bioinformatics skills into DAG-ordered, reproducible runs. Same "the LLM orchestrates but does not improvise" pitch as ClawBio 1.x, but this time a tiny Python CLI enforces it instead of prose asking nicely.
ClawBio 1.x is a library of 64 skills across GWAS, variant calling, scRNA-seq, pharmacogenomics, and more. 2.0 keeps the skills and rewires the spine.
The default pattern for Claude Code Agent Skills today is: write a long SKILL.md, drop it into the agent's context, hope it follows the steps in order. For one-shot skills ("format this code", "draft this email") that's fine. For workflows where step 3 eats step 2's output and skipping step 1 silently corrupts the result, it isn't.
1.x was 64 carefully written SKILL.md files. But they can potentially run into the following issues:
- The agent skips prereqs. Instructions are suggestions; it re-reads, reorders, or drops steps based on what it decides the user really wants.
- The agent hallucinates outputs. When a step is awkward, it invents a plausible-looking result and moves on. Downstream steps consume the fiction.
- Same skill, same inputs, different commands. "Reproducibility" was a sentence in the prose, not a property of the system.
/compactand/cleardestroy progress. The skill's instructions survive (they get re-loaded), but where the agent was in the workflow doesn't.
I tried out several things (better prompting, LangGraph, XState, RAG-on-the-workflow, ...). None of them addressed the actual problem, which is that prose isn't a contract. You can't enforce a DAG by asking nicely in markdown.
So 2.0 moves the workflow out of the prompt and into a small external state machine:
- Each skill ships a
pipeline.yamldeclaring steps, prereqs, inputs, outputs. - The CLI is the only legal way to advance:
start → next → open → done → bundle. openrefuses if prereqs aren't met.donerefuses if declared outputs are missing.bundlerefuses if any step is unfinished. A Claude CodeStophook refuses to let the session end without a bundle.SKILL.mdcollapses from hundreds of lines of prose to ~30 lines of "callclawbio next". The agent stops trying to remember a workflow and starts asking what's next.
Given a clean spec, generating the next command is something LLMs are genuinely good at, and they stop bluffing through a workflow half-remembered from a markdown file. Tighter boundaries does more for output quality than any amount of prompt-wrangling I tried before.
Bioinformatics is the testbed because dependent steps hurt most here: a hallucinated VCF normalization quietly poisons every downstream analysis. But the pattern generalizes to anywhere you have prereqs, side effects, and a need for reproducibility. Data engineering, security audits, ML pipelines, build/release.
Each skill ships a pipeline.yaml that declares its steps, their prereqs, and the inputs/outputs each step needs. The CLI:
- validates the pipeline statically at
starttime (unknown placeholders, forward refs, undeclaredrequires, duplicate ids all fail loudly), - holds a single
.clawbio-run/state.jsonplus an append-only event log, - refuses to
opena step whoseprereqsorrequiresare unmet, - hashes every declared output at
donetime, - emits a reproducibility bundle (
commands.sh,checksums.sha256,environment.yml) that mirrors actual execution order.
The same state.json serves three audiences: the agent (what's next?), the human reviewer (was it reproducible?), and the benchmark harness (did the agent follow the DAG?). One file, three jobs.
pip install -e .That gives you the clawbio script. Python ≥ 3.9, only runtime dependency is pyyaml.
The repo ships a placeholder skill in example/variant_annotation/ that uses cp and echo (no bcftools or VEP required), so you can exercise the CLI end-to-end without installing the bio stack.
# 1. start a run from a skill directory
clawbio start example/variant_annotation \
--input vcf_path=/tmp/x.vcf \
--input reference=/tmp/ref.fa
# 2. ask what's ready
clawbio next # -> normalize
# 3. open it (renders the command, marks in_progress)
clawbio open normalize
# 4. run the command yourself, then report outputs
clawbio done normalize --output normalized_vcf=.clawbio-run/work/normalize/normalized.vcf
# 5. repeat until `clawbio next` says nothing is ready, then bundle
clawbio bundleclawbio status prints a self-sufficient human-readable summary, written so it can fully re-orient an agent after /compact or /clear.
For the full driver guide (every verb, every refusal, the contract an agent needs to follow), see USAGE.md. Each skill's SKILL.md links there rather than re-explaining the CLI.
| Command | What it does |
|---|---|
start <skill-dir> --input k=v ... |
Load pipeline.yaml, initialize .clawbio-run/state.json. |
status [--check-bundle] |
Print run state. With --check-bundle, exits 2 if a finished run is unbundled. |
next |
Print the id of the next step whose prereqs are satisfied. |
open <step-id> |
Render {placeholders} in command_template, mark the step in_progress. |
done <step-id> --output k=path ... |
Hash outputs, advance state, recompute downstream blocks. |
bundle |
Emit commands.sh, checksums.sha256, environment.yml. Refuses if any step is unfinished. |
skill: variant_annotation
version: 1
inputs:
- name: vcf_path
required: true
- name: reference
required: true
steps:
- id: normalize
summary: "bcftools-norm placeholder"
requires_inputs: [vcf_path, reference]
emits: [normalized_vcf]
command_template: "cp {vcf_path} {out}/normalized.vcf"
- id: annotate
prereqs: [normalize]
requires: {normalized_vcf: present}
emits: [annotated_vcf]
command_template: "cp {normalized_vcf} {out}/annotated.vcf"The schema is deliberately small: no if/when, no loops, no fan-out, no templating beyond {var} substitution. If a step needs branching, write a Python script and make that script the step body. The machine-readable draft-07 JSON Schema lives at pipeline.schema.json.
Drop hooks/stop-bundle.json into your ~/.claude/settings.json (or a project-level settings.json) so Claude Code refuses to end a turn while a finished run is still unbundled.
clawbio/- the v0 CLI (~500 lines:cli.py,commands.py,pipeline.py,state.py,bundle.py).skills/- ported ClawBio 1.x skills withpipeline.yaml. Currently:gwas_lookup(11-step federated variant lookup).example/- runnable demos that exercise the CLI without external tools. Currently:variant_annotation(cp/echoplaceholder, not a real port).hooks/- Claude Code hook snippets.tests/- pytest harness covering pipeline loading, schema conformance, and an end-to-end run. Real filesystem, realstate.json, no mocking: 1.x got burned by mock/prod divergence.USAGE.md- agent-facing driver guide for the CLI.pipeline-schema.md- authoritativepipeline.yamlreference (for skill authors).pipeline.schema.json- JSON Schema draft-07 artifact.
ClawBio 2.0 is in progress. Not a finished product.
- The CLI (v0) works. Exercised end-to-end against two demo skills (
variant_annotation,gwas_lookup) covering the fullstart → next → open → done → bundlecycle, schema validation, block recomputation, and the reproducibility bundle. - Porting 1.x skills is the current focus. All 64 ClawBio 1.x skills need a
pipeline.yamland a trimmedSKILL.md. scRNA-seq will stress the schema hardest and is the next real port aftervariant_annotation. - Benchmarking is required, not optional. Every claim in "Why I rewrote it" above (fewer skipped prereqs, fewer hallucinated outputs, real reproducibility, recovery from
/compact) is a hypothesis until the numbers say otherwise. The two-axis benchmark plan (scientific correctness via the existingclawbio_benchharness + new behavioral metrics queried directly fromstate.json) is sketched in bench-pipeline-brainstorm.md. Running it head-to-head against 1.x (same skills, same inputs, same model) is what will tell me whether 2.0 is actually the win it looks like on paper. BixBench (current SOTA 52.2%, Biomni Lab) is the external comparison target. - Expect breaking changes.
pipeline.yamlis at v1 but not frozen; CLI flags andstate.jsonshape may shift as real skills surface edge cases the demos miss.