Reinforcement learning agent for Slay the Spire — targeting Watcher Ascension 20 with >96% winrate.
| Component | Location | Description |
|---|---|---|
| Engine | packages/engine/ |
Pure Python game engine, 100% Java parity (RNG, damage, enemies, maps, shops, cards, relics, events) |
| Engine (Rust) | packages/engine-rs/ |
Rust CombatEngine + PyO3 bindings (scaffold, 63 tests) |
| Training | packages/training/ |
PPO + GAE pipeline with action-encoded observations, MLX inference, multi-turn combat solver |
| Dashboard | packages/viz/ |
React 19 + Vite live dashboard with floor curves, episode analysis |
| Native App | packages/viz/macos/ |
Swift/WKWebView macOS wrapper |
| Server | packages/server/ |
WebSocket bridge for dashboard |
| Parity | packages/parity/ |
Seed catalog + parity verification |
# Install
uv sync
# Run tests (6100+ tests)
uv run pytest tests/ -q
# Rust engine tests
export PATH="$HOME/.cargo/bin:$PATH" PYO3_PYTHON=.venv/bin/python3
cargo test --lib --manifest-path packages/engine-rs/Cargo.toml
# Start training
./scripts/training.sh start
# Dashboard (WebSocket + Vite)
./scripts/services.sh start # localhost:5174
# Native macOS app
./scripts/app.sh build && ./scripts/app.sh runfrom packages.engine import GameRunner, GamePhase
runner = GameRunner(seed="SEED", ascension=20)
while not runner.game_over:
actions = runner.get_available_action_dicts()
runner.take_action_dict(actions[0])- Model: StrategicNet (3M params, hidden=768, 4 transformer blocks)
- Pipeline: COLLECT 100 games -> TRAIN PPO epochs -> SYNC -> repeat
- Inference: Centralized MLX batch server (M4 Mac Mini, 10 cores)
- Observations: 260-dim state + 512-dim action encoding (available actions mask)
- Combat: TurnSolver (30ms) with multi-turn lookahead
- Rewards: Floor milestones, combat outcomes, HP preservation, PBRS shaping (hot-reloadable)
- Engine parity: 100% across all core mechanics (13 RNG streams, 66 enemies, 51 events, 168 powers, 172 relics)
- Tests: 6100+ passing (pytest) + 63 Rust
- Training: Active — PPO with action encoding, mixed exploit/explore temperature
- Best trajectories: Floor 16 (200 saved), iterating toward Act 2+
- bottled_ai — 52% Watcher A0 baseline
- CommunicationMod — Bot communication protocol
- StSRLSolver — Prior RL solver attempt