Skip to content

Abrar-Abir/clawbio2.0

Repository files navigation

ClawBio 2.0

A stateful workflow CLI that turns ClawBio bioinformatics skills into DAG-ordered, reproducible runs. Same "the LLM orchestrates but does not improvise" pitch as ClawBio 1.x, but this time a tiny Python CLI enforces it instead of prose asking nicely.

ClawBio 1.x is a library of 64 skills across GWAS, variant calling, scRNA-seq, pharmacogenomics, and more. 2.0 keeps the skills and rewires the spine.

Why I rewrote it

The default pattern for Claude Code Agent Skills today is: write a long SKILL.md, drop it into the agent's context, hope it follows the steps in order. For one-shot skills ("format this code", "draft this email") that's fine. For workflows where step 3 eats step 2's output and skipping step 1 silently corrupts the result, it isn't.

1.x was 64 carefully written SKILL.md files. But they can potentially run into the following issues:

  • The agent skips prereqs. Instructions are suggestions; it re-reads, reorders, or drops steps based on what it decides the user really wants.
  • The agent hallucinates outputs. When a step is awkward, it invents a plausible-looking result and moves on. Downstream steps consume the fiction.
  • Same skill, same inputs, different commands. "Reproducibility" was a sentence in the prose, not a property of the system.
  • /compact and /clear destroy progress. The skill's instructions survive (they get re-loaded), but where the agent was in the workflow doesn't.

I tried out several things (better prompting, LangGraph, XState, RAG-on-the-workflow, ...). None of them addressed the actual problem, which is that prose isn't a contract. You can't enforce a DAG by asking nicely in markdown.

So 2.0 moves the workflow out of the prompt and into a small external state machine:

  • Each skill ships a pipeline.yaml declaring steps, prereqs, inputs, outputs.
  • The CLI is the only legal way to advance: start → next → open → done → bundle.
  • open refuses if prereqs aren't met. done refuses if declared outputs are missing. bundle refuses if any step is unfinished. A Claude Code Stop hook refuses to let the session end without a bundle.
  • SKILL.md collapses from hundreds of lines of prose to ~30 lines of "call clawbio next". The agent stops trying to remember a workflow and starts asking what's next.

Given a clean spec, generating the next command is something LLMs are genuinely good at, and they stop bluffing through a workflow half-remembered from a markdown file. Tighter boundaries does more for output quality than any amount of prompt-wrangling I tried before.

Bioinformatics is the testbed because dependent steps hurt most here: a hallucinated VCF normalization quietly poisons every downstream analysis. But the pattern generalizes to anywhere you have prereqs, side effects, and a need for reproducibility. Data engineering, security audits, ML pipelines, build/release.

What it does

Each skill ships a pipeline.yaml that declares its steps, their prereqs, and the inputs/outputs each step needs. The CLI:

  • validates the pipeline statically at start time (unknown placeholders, forward refs, undeclared requires, duplicate ids all fail loudly),
  • holds a single .clawbio-run/state.json plus an append-only event log,
  • refuses to open a step whose prereqs or requires are unmet,
  • hashes every declared output at done time,
  • emits a reproducibility bundle (commands.sh, checksums.sha256, environment.yml) that mirrors actual execution order.

The same state.json serves three audiences: the agent (what's next?), the human reviewer (was it reproducible?), and the benchmark harness (did the agent follow the DAG?). One file, three jobs.

Install

pip install -e .

That gives you the clawbio script. Python ≥ 3.9, only runtime dependency is pyyaml.

Quickstart

The repo ships a placeholder skill in example/variant_annotation/ that uses cp and echo (no bcftools or VEP required), so you can exercise the CLI end-to-end without installing the bio stack.

# 1. start a run from a skill directory
clawbio start example/variant_annotation \
  --input vcf_path=/tmp/x.vcf \
  --input reference=/tmp/ref.fa

# 2. ask what's ready
clawbio next                   # -> normalize

# 3. open it (renders the command, marks in_progress)
clawbio open normalize

# 4. run the command yourself, then report outputs
clawbio done normalize --output normalized_vcf=.clawbio-run/work/normalize/normalized.vcf

# 5. repeat until `clawbio next` says nothing is ready, then bundle
clawbio bundle

clawbio status prints a self-sufficient human-readable summary, written so it can fully re-orient an agent after /compact or /clear.

For the full driver guide (every verb, every refusal, the contract an agent needs to follow), see USAGE.md. Each skill's SKILL.md links there rather than re-explaining the CLI.

Subcommands

Command What it does
start <skill-dir> --input k=v ... Load pipeline.yaml, initialize .clawbio-run/state.json.
status [--check-bundle] Print run state. With --check-bundle, exits 2 if a finished run is unbundled.
next Print the id of the next step whose prereqs are satisfied.
open <step-id> Render {placeholders} in command_template, mark the step in_progress.
done <step-id> --output k=path ... Hash outputs, advance state, recompute downstream blocks.
bundle Emit commands.sh, checksums.sha256, environment.yml. Refuses if any step is unfinished.

pipeline.yaml in one glance

skill: variant_annotation
version: 1

inputs:
  - name: vcf_path
    required: true
  - name: reference
    required: true

steps:
  - id: normalize
    summary: "bcftools-norm placeholder"
    requires_inputs: [vcf_path, reference]
    emits: [normalized_vcf]
    command_template: "cp {vcf_path} {out}/normalized.vcf"

  - id: annotate
    prereqs: [normalize]
    requires: {normalized_vcf: present}
    emits: [annotated_vcf]
    command_template: "cp {normalized_vcf} {out}/annotated.vcf"

The schema is deliberately small: no if/when, no loops, no fan-out, no templating beyond {var} substitution. If a step needs branching, write a Python script and make that script the step body. The machine-readable draft-07 JSON Schema lives at pipeline.schema.json.

Claude Code integration

Drop hooks/stop-bundle.json into your ~/.claude/settings.json (or a project-level settings.json) so Claude Code refuses to end a turn while a finished run is still unbundled.

Repo layout

  • clawbio/ - the v0 CLI (~500 lines: cli.py, commands.py, pipeline.py, state.py, bundle.py).
  • skills/ - ported ClawBio 1.x skills with pipeline.yaml. Currently: gwas_lookup (11-step federated variant lookup).
  • example/ - runnable demos that exercise the CLI without external tools. Currently: variant_annotation (cp/echo placeholder, not a real port).
  • hooks/ - Claude Code hook snippets.
  • tests/ - pytest harness covering pipeline loading, schema conformance, and an end-to-end run. Real filesystem, real state.json, no mocking: 1.x got burned by mock/prod divergence.
  • USAGE.md - agent-facing driver guide for the CLI.
  • pipeline-schema.md - authoritative pipeline.yaml reference (for skill authors).
  • pipeline.schema.json - JSON Schema draft-07 artifact.

Status & ongoing work

ClawBio 2.0 is in progress. Not a finished product.

  • The CLI (v0) works. Exercised end-to-end against two demo skills (variant_annotation, gwas_lookup) covering the full start → next → open → done → bundle cycle, schema validation, block recomputation, and the reproducibility bundle.
  • Porting 1.x skills is the current focus. All 64 ClawBio 1.x skills need a pipeline.yaml and a trimmed SKILL.md. scRNA-seq will stress the schema hardest and is the next real port after variant_annotation.
  • Benchmarking is required, not optional. Every claim in "Why I rewrote it" above (fewer skipped prereqs, fewer hallucinated outputs, real reproducibility, recovery from /compact) is a hypothesis until the numbers say otherwise. The two-axis benchmark plan (scientific correctness via the existing clawbio_bench harness + new behavioral metrics queried directly from state.json) is sketched in bench-pipeline-brainstorm.md. Running it head-to-head against 1.x (same skills, same inputs, same model) is what will tell me whether 2.0 is actually the win it looks like on paper. BixBench (current SOTA 52.2%, Biomni Lab) is the external comparison target.
  • Expect breaking changes. pipeline.yaml is at v1 but not frozen; CLI flags and state.json shape may shift as real skills surface edge cases the demos miss.

About

Stateful workflow CLI for ClawBio bioinformatics skills — DAG-ordered, reproducible runs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages