Conversation
- Semi-formal reasoning templates per cognitive frame - Memory quality gates for learn_fact and record_decision - Complexity gate to skip scaffolds for trivial interactions - Future: DB-driven templates and CSTP extension - Inspired by Paper #11 (Semi-Formal Reasoning for Code Analysis)
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Changes: - Drop conversation + initiation scaffolds (keep natural) - Rename Research Frame → Research Subtask Scaffold (inject via build_subtask_prefix) - Soften OPTIONS from '2 alternatives required' to 'enumerate if viable' - Add turn-aware scaffold compression (full → 1-line reminder) - Merge Phase 2 into existing Brain validation (no parallel prompt gates) - Fix complexity gate: skip only conversation/initiation, structured frames always scaffold - Fix Phase 3 schema: remove agents FK, define concrete domain values - Add run_python interaction section - Add token efficiency metric - Update revision history with all changes
🏗️ Architecture Review — F014: Frame Reasoning ScaffoldsOverall Assessment: ✅ Sound design, a few structural concerns to addressThe phased approach is well thought out and the spec demonstrates strong understanding of the existing architecture. The core idea — injecting structured reasoning scaffolds per frame — is architecturally clean and aligns with Nous's frame-centric design. Below are my detailed findings. 1. Architectural Fit ✅Good alignment with existing patterns:
2. Coupling & Separation of Concerns
|
| Component | Current | With Scaffold (Turn 1) | With Scaffold (Turn 2+) |
|---|---|---|---|
| Frame (budgeted) | ~80 tok | ~80 tok | ~80 tok |
| Tool instructions (unbudgeted) | ~100 tok | ~100 tok | ~100 tok |
| Scaffold (unbudgeted) | 0 tok | ~250 tok | ~40 tok |
| Total frame-related | ~180 tok | ~430 tok | ~220 tok |
The Turn 1 cost (+250 tokens) is a ~138% increase in frame-related prompt overhead. For the decision frame with a 12K total budget, this is ~2% — acceptable. For conversation (3K budget, no scaffold) — N/A. For question (6K budget) — ~4% — fine.
The compression strategy (65% reduction by turn 2) is the right call. Multi-turn cost stays manageable.
One gap: The spec doesn't address what happens when budget.frame truncation in ContextEngine cuts the frame description short. If truncation removes questions_to_ask, the scaffold might reference reasoning steps that conflict with truncated frame guidance. This is unlikely with current 500-token frame budgets but worth a comment.
5. Scaffold ↔ Deliberation ↔ run_python Interaction ✅
Well-defined boundaries:
-
Scaffolds → Deliberation: The spec correctly positions Phase 2 enhancements as extensions to the existing
_should_deliberate()gate (which already checks_DELIBERATION_FRAMES = {"decision", "debug"}). The scaffold doesn't create a parallel deliberation system — it enriches the existing one. -
Scaffolds → run_python: The spec explicitly states scaffolds influence what to compute in run_python, not how the sandbox executes. The
run_pythonsandbox (intools.py) has its ownlearn_factwrapper with write caps — scaffolds don't bypass that. Clean boundary. -
Scaffolds → record_decision: The existing
_check_safety_net()in runner.py already logs warnings when decision frames don't callrecord_decision. Scaffolds reinforce this at the prompt level (step 7: DECIDE → record_decision). Good layered defense.
One suggestion: Phase 2 proposes "check if scaffold reasoning steps were followed" in _should_deliberate(). Be careful here — _should_deliberate() currently returns a simple bool based on frame_id. Adding scaffold-step verification would change its contract from "should we deliberate?" to "is the deliberation well-formed?". Consider a separate validate_deliberation() method to keep should_deliberate clean.
6. Phase 3 Schema Review ⚠️
The reasoning_templates schema is reasonable, with two concerns:
-
No
compressed_formcolumn. Turn 2+ compression currently relies on generating a 1-line summary from step names. If Phase 3 allows custom templates with long step names, auto-compression might produce oversized reminders. Add an optionalcompressed_form TEXTcolumn so template authors can specify their own compressed version. -
Domain selection mechanism undefined. The spec says "domain-specific template selection based on intent signals" but doesn't specify what intent signals are or where they come from. Currently,
FrameSelectionhasdefault_categoryanddefault_stakes— are these the intent signals? Clarifying this now prevents Phase 3 from requiring a new classification layer.
7. Open Questions — My Input
- Should scaffolds be visible in the agent's response, or purely internal?
Purely internal (thinking block). Visible scaffolds would make responses feel robotic. The value is in structured thinking, not structured output.
- Template inheritance — should subtasks inherit the parent's scaffold?
Own scaffold based on frame_type. Parent scaffold context may not apply to a child subtask in a different frame. The current build_subtask_prefix() already takes frame_type independently — keep that pattern.
- Should Phase 2 fact quality checks become blocking?
Advisory for at least 2 release cycles, then evaluate based on false-positive rate. Blocking without tuning will cause frustrating fact storage failures.
- How should scaffold compression handle frame switches?
Reset turn counter on frame change (as the spec already proposes). If the user switches from conversation → decision → conversation → decision, the second decision entry gets the full scaffold again. This is correct — context may have shifted.
Summary
| Area | Rating | Key Finding |
|---|---|---|
| Architectural fit | ✅ Good | Clean extension of existing patterns |
| Coupling | Two frame injection paths; scaffolds bypass budget system | |
| Phase 1→3 path | ✅ Good | No throwaway work; clean migration |
| Token budget | Compression strategy is sound; document unbudgeted overhead | |
| Component interaction | ✅ Good | Well-defined boundaries between scaffolds, deliberation, run_python |
| Phase 3 schema | Add compressed_form column; clarify domain selection signals |
Verdict: Ready to proceed to implementation with the above items addressed (or documented as intentional). No blocking architectural issues.
🏗️ ArchitectureReviewer
🔍 CorrectnessReviewer — Technical Accuracy ReviewReviewed the F014 spec against the current codebase ( ✅ Correct Assumptions
🔴 P1 — Must Fix Before Implementation1. Research scaffold won't inject —
Passing Fix options:
2.
Three viable options:
Option (a) is cleanest since 3. Heart file path wrong in Affected Files The spec says 🟡 P2 — Should Fix4. The spec says Phase 2 will "check if the scaffold's reasoning steps were followed" in
The spec should specify which approach to take. 5. Fact quality "NOVEL" check has performance implications The spec proposes a
6. Complexity gate is effectively a no-op The gate says: skip scaffold when ALL of (frame is conversation/initiation) AND (message < 20 chars). But the design already says conversation and initiation frames get NO scaffold. So the message-length check never triggers for frames that would have scaffolds. Either:
7. Phase 3 schema uses Most of the codebase identifies agents by UUID 🟢 P3 — Nice to Have8. Open Question #4 is already answered by the design The spec asks how compression handles frame switches, but the proposed 9. Config flag The spec mentions gating behind 10. Scaffold text module location not specified Step 1.1 says "Create shared scaffold definitions" — but where? Options: a new Summary
The spec is well-structured and the core approach (appending scaffolds to |
😈 Devil's Advocate Review — F014: Frame Reasoning ScaffoldsI'm here to poke holes. Some of these may have answers — that's the point. If they don't survive scrutiny, they shouldn't ship. 🎭 1. The Cargo Cult Risk (Critical)LLMs are pattern-completion engines. When you inject a scaffold like
Question for the authors: How do you distinguish genuine scaffold-aided reasoning from mechanical form-filling? Phase 2's quality checks claim to detect this, but see concern #5. 📊 2. The Complexity Gate Is Miscalibrated (High)The current gate logic:
This means:
The gate trusts the frame classifier to be perfect. But frame classification is probabilistic. Every misclassification now carries a heavier penalty — instead of just wrong tool availability, you get wrong tool availability AND an inappropriate reasoning scaffold inflating the response. Suggestion: Add a complexity heuristic within structured frames, not just at the frame boundary. Short, direct requests in any frame should get compressed or skipped scaffolds.
|
| # | Concern | Severity | TL;DR |
|---|---|---|---|
| 1 | Cargo cult reasoning | 🔴 Critical | Models will fill scaffold forms mechanically |
| 2 | Complexity gate too permissive | 🟠 High | Simple tasks get full scaffolds |
| 3 | Paper #10 warning unaddressed | 🟠 High | No model-capability or task-complexity adaptation |
| 4 | Phase 2 quality checks unfeasible | 🟠 High | Can't verify reasoning quality without creating perverse incentives |
| 5 | Redundant instructions | 🟡 Medium | Tool nudges + scaffold steps overlap |
| 6 | Self-referential metrics | 🟡 Medium | Measuring compliance, not quality |
| 7 | Token budget underestimate | 🟡 Medium | Output inflation not accounted for |
| 8 | Compression edge cases | 🟡 Low-Med | Topic changes within same frame |
| 9 | Phase 3 premature abstraction | 🟢 Low | Don't build DB until Phase 1 proves out |
| 10 | No A/B or rollback plan | 🟢 Low | Can't measure negative impact |
Constructive Recommendations
- Start with ONE frame (decision) as a pilot. Measure calibration improvement over 2 weeks before expanding.
- Merge tool nudges into scaffold steps instead of appending — eliminates redundancy.
- Add intra-frame complexity sensing — even in structured frames, trivial requests should skip/compress scaffolds.
- Define "success" before shipping — what specific calibration improvement justifies the token cost?
- Measure output token inflation — track average response length before/after scaffolds per frame.
- Drop Phase 2 quality checks or redesign them as advisory logging (which v1 already does), not enforcement. The cure is worse than the disease.
Signed: 😈 Devil's Advocate
To be clear — the core idea of reasoning scaffolds is sound. Paper #11 results are real. But the gap between "scaffolds help on benchmarks" and "scaffolds help an agentic system in production" is where this review lives. Ship the concept, but ship it carefully.
…ction, compliance removal Changes from review: - Must Fix 1: Research scaffold → build_subtask_prefix() (not frame system) - Must Fix 2: Scaffolds REPLACE tool nudges, not layer on top - Must Fix 3: Fixed file path to nous/nous/heart/heart.py - Recommend 4: Decision-only pilot for 2 weeks with success criteria - Recommend 5: Removed Phase 2 compliance checking (perverse incentives) - Nice 6: Extracted to nous/nous/cognitive/scaffolds.py from day 1 - Nice 7: Added intra-frame complexity gate (question < 30 chars)
😈 Devil's Advocate (Round 2) — F014 v3 Re-ReviewThe v3 revision was written in response to Round 1 concerns. Let me score the revisions, then probe for new risks. Scorecard: Were Round 1 Concerns Addressed?
Score: 2 fully addressed, 4 partially, 4 not addressed (1 regressed) Deep Dive: What's Still Broken🔴 Cargo Cult Risk — Better Mitigated, Not Eliminatedv3 makes the right moves:
But the scaffold also says: "don't skip RECALL or CALIBRATE" This creates a targeted cargo cult for exactly these two steps. The model will always call The pilot will not catch this. Why? Because the success metrics don't measure whether RECALL and CALIBRATE steps were genuinely useful on a per-decision basis. They measure aggregate patterns (avg reason count, calibration error). A model that dutifully calls Suggestion: Remove "don't skip RECALL or CALIBRATE." Let the model decide which steps to skip based on the situation. If you must mandate something, mandate the output artifact ("you must call 🟠 Complexity Gate Still Only Covers One Framev3 adds: _ALWAYS_SCAFFOLD = {"decision", "debug", "task"}
_COMPLEXITY_GATED = {"question"}So These are simple, direct requests that don't need structured reasoning. The complexity gate exists but only gatekeeps the easiest frame (question). The heaviest frames (decision=8 steps, debug=8 steps) are ungated. Frame misclassification makes this worse. If the classifier puts a casual message in the task frame (which happens — the classifier isn't perfect), you get full scaffolding for Suggestion: Apply the 30-char gate to ALL frames, not just question. Or better: let the turn-1 scaffold include a preamble like "If this is a simple/direct request, skip the scaffold and respond directly." The model is smart enough to self-gate if given permission. 🟠 Output Token Inflation: The Elephant Not In the Roomv3 token budget:
This counts input only. But the scaffold instructs 8 explicit steps. The model will now write:
That's easily 400-800 output tokens per scaffolded decision, vs maybe 150-250 for an unscaffolded one. The output cost could be 2-3x the input scaffold cost. The success criterion says "No increase in scaffold-related token costs > 15% per decision turn" — but does this measure input+output? Just input? Just the scaffold injection? This needs to be explicit, and it should include output tokens. 🆕 New Risks Introduced by v3NEW 1: No Kill Switch (Regression from v2)v2 mentioned def _get_frame_instructions(self, turn_context):
scaffold = get_scaffold(frame_id, turn_in_frame, message_length)
if scaffold:
return scaffold # No way to disable this except removing the codeIf scaffolds cause problems in production, the only way to turn them off is a code change + deploy. That's a 30-minute recovery vs a 30-second env var change. Suggestion: Re-add the config flag. NEW 2: Phase 1.5 Expands 3 Frames With No Independent Success Criteriav3 says:
Good gate. But then Phase 1.5 adds debug, task, and question scaffolds with zero independent measurement criteria. How do you know the debug scaffold works? The task scaffold? You're measuring decision-frame calibration, then extrapolating to debug-frame performance. Debug frame success looks different from decision frame success. You need: root cause identification speed, fix success rate, regression rate. Task frame success: completion rate, accuracy, user satisfaction. These aren't in the spec. Suggestion: Each frame scaffold needs its own success criteria before expansion. At minimum: token cost delta + one quality metric specific to that frame's purpose. NEW 3: Research Scaffold Omits Memory RecallThe Research subtask scaffold says:
Notice what's missing: no RECALL step. The decision scaffold has RECALL ( This is exactly the kind of subtle omission that scaffolds are supposed to prevent. Suggestion: Add a step 1.5: NEW 4: No Failure Criteria (What Kills the Project?)The success criteria say what good looks like. But there's no definition of what bad looks like:
Without explicit failure criteria, the pilot can always be declared a success by cherry-picking the metric that improved. Suggestion: Add explicit failure criteria:
What v3 Got Right (Credit Where Due)
Remaining Action Items
Bottom Linev3 is a meaningful improvement over v2. The decision-only pilot, replace-not-layer approach, and compliance check removal directly address the most actionable Round 1 concerns. The spec demonstrates genuine engagement with the criticism. But the harder problems — cargo cult detection, output inflation measurement, kill-switch infrastructure, and failure criteria — were either partially addressed or not addressed. These are harder because they're harder, not because they were ignored. Some (like cargo cult detection) may not have clean solutions. My recommendation: Ship the decision-frame pilot with the 🔴 items above fixed. The two high-priority fixes are small (re-add config flag, soften the RECALL/CALIBRATE mandate). Then let the 2-week pilot generate real data before over-engineering solutions to theoretical problems. The spec is 80% there. That last 20% is the difference between "we shipped scaffolds" and "we shipped scaffolds that we can actually evaluate and roll back." Signed: 😈 Devil's Advocate (Round 2) The best part of this revision is what was removed (compliance checking). Knowing what NOT to build is an underrated skill. |
🔍 CorrectnessReviewer (Round 2) — v3 Spec Re-ReviewRe-reviewed the v3 spec against the current codebase ( Round 1 P1 Resolution Status
Round 1 P2 Resolution Status
🟡 NEW P2 Issues in v3P2-1: Affected files table references wrong file for The affected files table says:
But
As-is, the table will send an implementer to the wrong file. P2-2: The spec's integration code in message_length=len(turn_context.user_message or ""),But Fix options:
Option (b) is cleanest — P2-3: v3 says "Add
Concrete suggestion: Add if frame.frame_id != meta.current_frame_id:
meta.current_frame_id = frame.frame_id
meta.frame_turn_count = 1
else:
meta.frame_turn_count += 1Then pass 🟢 P3 — Minor / InformationalP3-1: Phase 1 shows a complete P3-2: Scaffold REPLACE semantics are clean v3's approach of scaffolds replacing P3-3: Open Question 3 answer is implied by P2-3 fix Open Question 3 ("How should scaffold compression handle frame switches?") proposes "Reset turn count per frame switch." The P2-3 implementation above naturally handles this — Summary
Overall: v3 is a significant improvement over v2. The major architectural concerns (research injection, compliance checking, complexity gate) are all properly resolved. The remaining P2 issues are implementation-path clarity items — they won't cause silent failures, but an implementer will need to make decisions the spec doesn't currently specify. The P2-3 fix suggestion above (~6 lines) would close the last open P1 from Round 1. Recommendation: Address P2-1 (wrong file reference) and P2-2 ( |
🏗️ Architecture Review (Round 2) — F014 v3: Frame Reasoning ScaffoldsOverall Assessment: ✅ Ready for implementationThe v3 revision addressed the critical concerns from Round 1 cleanly. The decision to pilot Decision frame first, remove compliance checking, and merge tool nudges into scaffolds (rather than layering) shows good architectural judgment. Two minor items remain for documentation; no blockers. Round 1 Concern Resolution Scorecard
Score: 8/11 fully resolved, 2/11 mostly resolved, 0 unresolved. Strong revision. §1. Remaining: Budget Documentation (Minor)The dual injection path still exists — scaffolds flow through
Recommendation: Add a one-line note in the # Scaffolds are fixed overhead (~280 tok turn 1, ~45 tok turn 2+),
# intentionally outside ContextBudget. See F014 token budget analysis.This is a documentation task, not an architectural concern. The actual numbers are well within bounds (2-4% of frame budget). §2. Remaining: Output Token Inflation (Minor)The token budget section counts input costs only. The Devil's Advocate flagged that scaffold-structured responses will be longer (model writes through each step). This is inherent to the approach and hard to estimate upfront, but the spec should acknowledge it: Recommendation: Add a line to the Token Budget section:
§3.
|
| Area | Round 1 | Round 2 | Notes |
|---|---|---|---|
| Architectural fit | ✅ | ✅ | Clean extension point, proper separation |
| Coupling / injection paths | ✅ (with doc note) | Replace semantics resolve overlap; document budget bypass | |
| Phase progression | ✅ | ✅ | Even cleaner with Decision-only pilot |
| Token budget | ✅ (with doc note) | Input costs well-analyzed; acknowledge output inflation | |
| Component interaction | ✅ | ✅ | Compliance checking removal was the right call |
| Schema (Phase 2) | ✅ | compressed column added, domain values concrete | |
| Pilot/rollout | N/A | ✅ | Measure-first approach; collect baseline before starting |
Verdict: No architectural blockers. Three minor documentation items (§1, §2, §3) and one pre-pilot task (baseline metrics). All can be addressed during implementation without spec revision.
🏗️ ArchitectureReviewer (Round 2)
…re criteria Must fixes: - Re-added NOUS_REASONING_SCAFFOLDS env var kill switch (regression from v2) - Fixed turn_context.user_message → message_length (user_message doesn't exist) Should fixes: - Affected files: context.py → schemas.py for TurnContext - Turn tracking site clarified: runner.py owns both fields - Research scaffold: added RECALL step before SEARCH Nice to have: - Failure criteria alongside success metrics - Baseline metrics promoted to concrete pre-pilot task - Open question #4 removed (now addressed in spec)
Summary
Add structured reasoning templates (semi-formal scaffolds) to each of Nous cognitive frames.
Motivation
Paper #11 showed structured reasoning templates improve accuracy by up to 30%. Each frame should supply a reasoning scaffold — not telling the agent what to conclude, but what steps to take.
Phases
CSTP Extension
Research Basis
Papers #10, #11 from the LLM agent memory/cognition collection
Files