F014: Frame Reasoning Scaffolds by tfatykhov · Pull Request #114 · tfatykhov/nous

tfatykhov · 2026-03-04T13:01:49Z

Summary

Add structured reasoning templates (semi-formal scaffolds) to each of Nous cognitive frames.

Motivation

Paper #11 showed structured reasoning templates improve accuracy by up to 30%. Each frame should supply a reasoning scaffold — not telling the agent what to conclude, but what steps to take.

Phases

Phase 1 — Static Scaffolds (~3h): 7 frame-specific templates + complexity gate
Phase 2 — Memory Quality Gates (~2h): Pre-commit checks on learn_fact and record_decision
Phase 3 — DB-Driven Templates (~4h): reasoning_templates table, domain-specific variants

CSTP Extension

Template serving to connected agents
Reasoning audit trails
Domain template marketplace

Research Basis

Papers #10, #11 from the LLM agent memory/cognition collection

Files

docs/features/F014-reasoning-scaffolds.md (+297 lines)

- Semi-formal reasoning templates per cognitive frame - Memory quality gates for learn_fact and record_decision - Complexity gate to skip scaffolds for trivial interactions - Future: DB-driven templates and CSTP extension - Inspired by Paper #11 (Semi-Formal Reasoning for Code Analysis)

chatgpt-codex-connector · 2026-03-04T13:01:55Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Changes: - Drop conversation + initiation scaffolds (keep natural) - Rename Research Frame → Research Subtask Scaffold (inject via build_subtask_prefix) - Soften OPTIONS from '2 alternatives required' to 'enumerate if viable' - Add turn-aware scaffold compression (full → 1-line reminder) - Merge Phase 2 into existing Brain validation (no parallel prompt gates) - Fix complexity gate: skip only conversation/initiation, structured frames always scaffold - Fix Phase 3 schema: remove agents FK, define concrete domain values - Add run_python interaction section - Add token efficiency metric - Update revision history with all changes

tfatykhov · 2026-03-04T13:54:25Z

🏗️ Architecture Review — F014: Frame Reasoning Scaffolds

Overall Assessment: ✅ Sound design, a few structural concerns to address

The phased approach is well thought out and the spec demonstrates strong understanding of the existing architecture. The core idea — injecting structured reasoning scaffolds per frame — is architecturally clean and aligns with Nous's frame-centric design. Below are my detailed findings.

1. Architectural Fit ✅

Good alignment with existing patterns:

Scaffolds extend _get_frame_instructions() in runner.py, which is already the designated extension point for frame-specific prompt content
The phased approach (static strings → brain integration → DB-driven) follows Nous's iterative pattern
Research subtask scaffold correctly uses build_subtask_prefix() instead of the frame system — proper separation
Config flag NOUS_REASONING_SCAFFOLDS=true follows existing feature flag patterns

FrameType enum alignment concern (minor):
The spec references 7 frames including initiation, but schemas.py::FrameType only defines 6 (task, question, decision, creative, conversation, debug). The initiation frame exists only in FRAME_TOOLS in runner.py (line 45), not in the FrameType enum. The spec correctly skips initiation scaffolds, but Phase 3's DB schema should align with however initiation is formalized.

2. Coupling & Separation of Concerns ⚠️

Critical: Two separate frame content injection paths

There are currently two independent paths for injecting frame-related content into the system prompt:

ContextEngine._format_frame() in context.py — produces frame description + questions, goes through _truncate_to_budget() with budget.frame (500 tokens)
Runner._get_frame_instructions() in runner.py — appends tool instructions AFTER the budget-controlled system prompt, bypassing the budget system entirely

The spec proposes adding scaffolds (150-300 tokens) to path #2. This means scaffold tokens are invisible to the budget system. Today this is manageable because tool instructions are ~80-120 tokens, but after adding scaffolds, path #2 could inject 300-500 tokens completely outside budget control.

Recommendation: Either:

(a) Account for scaffolds in the budget by increasing budget.frame and moving scaffold injection into ContextEngine, or
(b) Document this as intentional (scaffolds are "fixed overhead" not subject to dynamic budgeting) and add a comment in _build_system_prompt() noting the unbudgeted cost

Option (b) is pragmatically fine for Phase 1 since scaffold tokens are bounded and predictable, but it should be explicit.

3. Phased Approach & Tech Debt Analysis ✅

Phase 1 → Phase 3 migration path is clean:

Phase 1 scaffolds are simple string constants — easy to extract into DB rows
The steps JSONB schema in Phase 3 maps directly to the Phase 1 step format (NAME: instruction)
frame_turn_count on TurnContext is needed regardless — no throwaway work

One tech debt concern: scaffold string location

The spec says "Create shared scaffold definitions (used by both _get_frame_instructions() and build_subtask_prefix())" but doesn't specify where. If scaffolds land as constants in runner.py, they'll need extraction during Phase 3.

Recommendation: Put scaffold string constants in a dedicated module (e.g., nous/cognitive/scaffolds.py) from Phase 1. Both runner.py and tools.py import from there. When Phase 3 adds DB-driven templates, this module becomes the fallback/default layer. Zero rewrite needed.

4. Token Budget Analysis ⚠️

The compression strategy is good but the budget math needs checking:

Current budget.frame is 500 tokens (all frames). The _format_frame() output is typically 50-100 tokens (frame name + description + questions). This leaves 400-450 tokens of headroom within the budget system.

But as noted in §2, scaffolds are injected via _get_frame_instructions() which is outside the budget. So the real question is total system prompt growth:

Component	Current	With Scaffold (Turn 1)	With Scaffold (Turn 2+)
Frame (budgeted)	~80 tok	~80 tok	~80 tok
Tool instructions (unbudgeted)	~100 tok	~100 tok	~100 tok
Scaffold (unbudgeted)	0 tok	~250 tok	~40 tok
Total frame-related	~180 tok	~430 tok	~220 tok

The Turn 1 cost (+250 tokens) is a ~138% increase in frame-related prompt overhead. For the decision frame with a 12K total budget, this is ~2% — acceptable. For conversation (3K budget, no scaffold) — N/A. For question (6K budget) — ~4% — fine.

The compression strategy (65% reduction by turn 2) is the right call. Multi-turn cost stays manageable.

One gap: The spec doesn't address what happens when budget.frame truncation in ContextEngine cuts the frame description short. If truncation removes questions_to_ask, the scaffold might reference reasoning steps that conflict with truncated frame guidance. This is unlikely with current 500-token frame budgets but worth a comment.

5. Scaffold ↔ Deliberation ↔ run_python Interaction ✅

Well-defined boundaries:

Scaffolds → Deliberation: The spec correctly positions Phase 2 enhancements as extensions to the existing _should_deliberate() gate (which already checks _DELIBERATION_FRAMES = {"decision", "debug"}). The scaffold doesn't create a parallel deliberation system — it enriches the existing one.
Scaffolds → run_python: The spec explicitly states scaffolds influence what to compute in run_python, not how the sandbox executes. The run_python sandbox (in tools.py) has its own learn_fact wrapper with write caps — scaffolds don't bypass that. Clean boundary.
Scaffolds → record_decision: The existing _check_safety_net() in runner.py already logs warnings when decision frames don't call record_decision. Scaffolds reinforce this at the prompt level (step 7: DECIDE → record_decision). Good layered defense.

One suggestion: Phase 2 proposes "check if scaffold reasoning steps were followed" in _should_deliberate(). Be careful here — _should_deliberate() currently returns a simple bool based on frame_id. Adding scaffold-step verification would change its contract from "should we deliberate?" to "is the deliberation well-formed?". Consider a separate validate_deliberation() method to keep should_deliberate clean.

6. Phase 3 Schema Review ⚠️

The reasoning_templates schema is reasonable, with two concerns:

No compressed_form column. Turn 2+ compression currently relies on generating a 1-line summary from step names. If Phase 3 allows custom templates with long step names, auto-compression might produce oversized reminders. Add an optional compressed_form TEXT column so template authors can specify their own compressed version.
Domain selection mechanism undefined. The spec says "domain-specific template selection based on intent signals" but doesn't specify what intent signals are or where they come from. Currently, FrameSelection has default_category and default_stakes — are these the intent signals? Clarifying this now prevents Phase 3 from requiring a new classification layer.

7. Open Questions — My Input

Should scaffolds be visible in the agent's response, or purely internal?

Purely internal (thinking block). Visible scaffolds would make responses feel robotic. The value is in structured thinking, not structured output.

Template inheritance — should subtasks inherit the parent's scaffold?

Own scaffold based on frame_type. Parent scaffold context may not apply to a child subtask in a different frame. The current build_subtask_prefix() already takes frame_type independently — keep that pattern.

Should Phase 2 fact quality checks become blocking?

Advisory for at least 2 release cycles, then evaluate based on false-positive rate. Blocking without tuning will cause frustrating fact storage failures.

How should scaffold compression handle frame switches?

Reset turn counter on frame change (as the spec already proposes). If the user switches from conversation → decision → conversation → decision, the second decision entry gets the full scaffold again. This is correct — context may have shifted.

Summary

Area	Rating	Key Finding
Architectural fit	✅ Good	Clean extension of existing patterns
Coupling	⚠️ Minor	Two frame injection paths; scaffolds bypass budget system
Phase 1→3 path	✅ Good	No throwaway work; clean migration
Token budget	⚠️ Minor	Compression strategy is sound; document unbudgeted overhead
Component interaction	✅ Good	Well-defined boundaries between scaffolds, deliberation, run_python
Phase 3 schema	⚠️ Minor	Add compressed_form column; clarify domain selection signals

Verdict: Ready to proceed to implementation with the above items addressed (or documented as intentional). No blocking architectural issues.

🏗️ ArchitectureReviewer

tfatykhov · 2026-03-04T13:54:55Z

🔍 CorrectnessReviewer — Technical Accuracy Review

Reviewed the F014 spec against the current codebase (runner.py, tools.py, schemas.py, layer.py, deliberation.py, heart.py, models.py). Overall the spec is well-researched and architecturally sound. A few issues need attention before implementation.

✅ Correct Assumptions

Claim	Verified
`_get_frame_instructions()` returns tool-use nudges, not scaffolds	✅ `runner.py:1107` — exactly as described
`build_subtask_prefix()` accepts `frame_type` and can be extended	✅ `tools.py:625` — already checks `frame_type in FRAME_TOOLS`
`TurnContext` is the right place for frame turn tracking	✅ `schemas.py:107` — Pydantic BaseModel, easy to extend
`should_deliberate()` gates on frame_id	✅ `deliberation.py:195` — checks `frame.frame_id in {"decision", "debug"}`
Conversation + initiation exclusion from scaffolds	✅ Makes sense given their tool sets and purposes
Scaffolds appended to existing instructions, not replacing	✅ `_build_system_prompt()` at `runner.py:1089` concatenates parts cleanly

🔴 P1 — Must Fix Before Implementation

1. Research scaffold won't inject — "research" not in FRAME_TOOLS

build_subtask_prefix() checks if frame_type and frame_type in FRAME_TOOLS before adding any frame instruction. The current FRAME_TOOLS dict has: conversation, question, decision, creative, task, debug, initiation — no "research" key.

Passing frame_type="research" to build_subtask_prefix() will silently skip the frame instruction block, meaning the Research scaffold would never be injected.

Fix options:

(a) Add "research" to FRAME_TOOLS with its tool allowlist, or
(b) Handle research scaffold injection as a special case in build_subtask_prefix() before the FRAME_TOOLS check

2. frame_turn_count tracking — implementation path not specified

TurnContext is constructed in layer.py:399 inside pre_turn(), but pre_turn() doesn't have access to the Conversation object (that lives in runner.py). The spec needs to clarify where the frame-consecutive-turn tracking lives.

Three viable options:

(a) In runner.py — has access to Conversation.turn_contexts, can count consecutive frames and pass count to _get_frame_instructions(). Simplest, no changes to cognitive layer.
(b) In layer.py — add a _session_frame_tracker: dict[str, tuple[str, int]] similar to existing _session_metadata.
(c) In SessionMetadata — add last_frame_id + frame_turn_count fields.

Option (a) is cleanest since _get_frame_instructions() is already in runner.py and runner.py already has the conversation history.

3. Heart file path wrong in Affected Files

The spec says nous/memory/heart.py — actual path is nous/heart/heart.py. Small but will cause confusion during implementation.

🟡 P2 — Should Fix

4. should_deliberate() can't check scaffold compliance with current signature

The spec says Phase 2 will "check if the scaffold's reasoning steps were followed" in should_deliberate(). But this method only receives a FrameSelection object — it has no access to the LLM's response text or thinking blocks. Checking scaffold compliance requires either:

Changing the method signature to accept the response/thinking blocks, or
Moving the compliance check to post_turn() instead, which already receives TurnResult (including thinking_blocks)

The spec should specify which approach to take.

5. Fact quality "NOVEL" check has performance implications

The spec proposes a recall_deep check before every fact commit. recall_deep involves embedding generation + vector similarity search + DB queries. This adds non-trivial latency to every learn_fact call. The spec should either:

Specify this as an async/non-blocking check (store first, check after)
Use a lighter-weight dedup (the facts.learn() method already has some dedup logic — see heart.py:197 which delegates to self.facts.learn())
Or explicitly accept the latency trade-off with a note about expected ms impact

6. Complexity gate is effectively a no-op

The gate says: skip scaffold when ALL of (frame is conversation/initiation) AND (message < 20 chars). But the design already says conversation and initiation frames get NO scaffold. So the message-length check never triggers for frames that would have scaffolds. Either:

Remove the complexity gate (it adds no value), or
Redefine it to apply to structured frames too (e.g., skip task scaffold for "ok" or "thanks")

7. Phase 3 schema uses agent_name but codebase uses agent_id (UUID)

Most of the codebase identifies agents by UUID agent_id (e.g., deliberation.py, layer.py, heart.py). Using agent_name VARCHAR(100) is inconsistent. Consider agent_id UUID with a note that it's a logical reference (no FK since no agents table yet).

🟢 P3 — Nice to Have

8. Open Question #4 is already answered by the design

The spec asks how compression handles frame switches, but the proposed frame_turn_count reset on frame change already answers this: the new frame gets turn 1 (full scaffold). This can be removed from open questions and stated as a design decision.

9. Config flag NOUS_REASONING_SCAFFOLDS not wired to Settings

The spec mentions gating behind NOUS_REASONING_SCAFFOLDS=true but doesn't specify integration with the existing Settings class (which is the standard config pattern in Nous). Should be a one-liner: add a reasoning_scaffolds: bool = True field to Settings with env var mapping.

10. Scaffold text module location not specified

Step 1.1 says "Create shared scaffold definitions" — but where? Options: a new nous/cognitive/scaffolds.py, constants at the top of runner.py, or a separate nous/api/scaffolds.py. Worth specifying to avoid ambiguity.

Summary

Severity	Count	Items
🔴 P1	3	Research frame missing from FRAME_TOOLS, frame_turn_count tracking path, heart.py path wrong
🟡 P2	4	should_deliberate signature, NOVEL check perf, complexity gate no-op, agent_name vs agent_id
🟢 P3	3	Open Q4 resolved, config flag wiring, scaffold module location

The spec is well-structured and the core approach (appending scaffolds to _get_frame_instructions(), turn-aware compression, subtask injection) will work. The P1 items need resolution before implementation starts — particularly the Research/FRAME_TOOLS gap which would silently fail.

tfatykhov · 2026-03-04T13:55:08Z

😈 Devil's Advocate Review — F014: Frame Reasoning Scaffolds

I'm here to poke holes. Some of these may have answers — that's the point. If they don't survive scrutiny, they shouldn't ship.

🎭 1. The Cargo Cult Risk (Critical)

LLMs are pattern-completion engines. When you inject a scaffold like CONTEXT → RECALL → OPTIONS → TRADEOFFS → EVIDENCE → CONFIDENCE → DECIDE, the model will dutifully fill in each step whether or not it has anything meaningful to say. This is the most dangerous failure mode:

The agent produces scaffold-shaped output that looks like structured reasoning but isn't.
Example: OPTIONS: (1) Use approach A, (2) Don't use approach A — technically two alternatives, zero actual analysis.
The more structured the scaffold, the easier it is for the model to satisfy the form without the substance.
Paper feat: add runtime layer — REST API, MCP server, agent runner (005) #11's 30% accuracy improvement was measured on code analysis benchmarks, not open-ended agentic tasks. Benchmark gains don't always transfer.

Question for the authors: How do you distinguish genuine scaffold-aided reasoning from mechanical form-filling? Phase 2's quality checks claim to detect this, but see concern #5.

📊 2. The Complexity Gate Is Miscalibrated (High)

The current gate logic:

Skip scaffolds only for conversation + initiation frames
All other frames (task, debug, decision, question, creative) always get scaffolds
The 20-char message length check only applies to already-exempt frames (!)

This means:

"What's 2+2?" in the question frame → full Knowledge Retrieval scaffold (CLASSIFY → RECALL → VERIFY → SUPPLEMENT → ANSWER)
"Thanks!" misclassified as task by the frame classifier → full Task Execution scaffold
"Rename this variable" in task frame → UNDERSTAND → RECALL → PLAN → EXECUTE → VERIFY → LEARN

The gate trusts the frame classifier to be perfect. But frame classification is probabilistic. Every misclassification now carries a heavier penalty — instead of just wrong tool availability, you get wrong tool availability AND an inappropriate reasoning scaffold inflating the response.

Suggestion: Add a complexity heuristic within structured frames, not just at the frame boundary. Short, direct requests in any frame should get compressed or skipped scaffolds.

⚠️ 3. Paper #10 Warning Is Acknowledged But Not Addressed (High)

The spec cites Paper #10 (ToM/IB): "scaffolds should be adaptive, not forced on simple tasks. Model capability is dominant factor — scaffolds amplify strengths but can confuse weak models."

But the implementation does exactly what Paper #10 warns against:

Scaffolds are always injected for structured frames regardless of task complexity
There's no model-capability awareness — the same scaffolds go to Claude Opus, Sonnet, and Haiku
The only "adaptation" is turn-aware compression (full → 1-line), which is about token cost, not task appropriateness

If Nous ever supports multiple models (or if the model degrades on certain tasks), scaffolds could actively harm performance. The spec needs a story for this.

🔍 4. Phase 2 Quality Checks: Quis Custodiet Ipsos Custodes? (High)

Phase 2 proposes:

"Before recording a decision, check if the scaffold's reasoning steps were followed"

How? The spec doesn't say. The options are:

String matching — Look for scaffold step names in the output. Trivially gamed by the model (just write CONTEXT: before your pre-determined answer).
LLM-based evaluation — Use an LLM to judge if another LLM followed the scaffold. Adds latency, cost, and its own failure modes.
Heuristic checks — Count the number of tool calls, check if recall_deep was called before record_decision. Crude but gameable.

Any verification mechanism creates a perverse incentive: the agent optimizes for passing the check rather than for actual reasoning quality. The scaffold becomes a compliance exercise, not a reasoning aid.

The confidence > 0.9 scrutiny is a nice idea but has the same problem — the agent will learn to output 0.88 instead of 0.92.

🔗 5. Interaction with Existing Frame Instructions (Medium)

Current _get_frame_instructions() returns tool-use nudges like:

"You are in a DECISION frame. You MUST call record_decision..."
"Use recall_deep to search for relevant past decisions."

The scaffold also says:

"2. RECALL: Search for similar past decisions (recall_deep)"
"7. DECIDE: Choose and record with record_decision"

This is redundant instruction. The model now gets the same guidance twice in different formats. At best this wastes tokens. At worst, subtle differences between the two instruction sets create confusion:

Tool instructions say "You MUST call record_decision"
Scaffold says "DECIDE: Choose and record with record_decision" (step 7 of 7)
Does the model record the decision immediately (tool instruction) or only after completing all 7 steps (scaffold)?

Suggestion: Phase 1 should refactor _get_frame_instructions() to merge tool nudges INTO the scaffold steps, not append scaffolds alongside them.

📈 6. Success Metrics Are Self-Referential (Medium)

"Decisions with documented alternatives increase from ~30% to ~70%+"
">90% of decisions in decision frames follow the scaffold steps"

Who's measuring this? The same agent that's being scaffolded. If you tell the model "enumerate alternatives," it will enumerate alternatives. That's not evidence of better reasoning — it's evidence of instruction-following.

The only metric that actually measures reasoning quality is:

"Mean absolute calibration error decreases"

But this requires a large sample of decisions with tracked outcomes, which means you won't know if scaffolds actually help for months. Everything else is measuring compliance, not quality.

💾 7. Token Budget Underestimates Real Cost (Medium)

The budget analysis says 150-300 tokens per scaffold on first turn. But consider:

The scaffold also increases output length — the model now produces scaffold-structured responses, which are longer than direct answers
The scaffold interacts with thinking blocks — structured scaffolds may cause the model to "think aloud" through each step, consuming thinking tokens
In a 5-turn debug session, the output inflation (model writing SYMPTOM → REPRODUCE → ISOLATE etc.) may cost more than the input scaffold

The token analysis only counts input cost. The output cost of scaffold-compliant responses could be 2-3x what's estimated.

🔄 8. Turn Compression Edge Cases (Low-Medium)

The spec lists frame-switch handling as an open question but proceeds with implementation anyway. Specific scenarios:

conversation → decision → conversation — Does decision get full scaffold? (Spec says yes, because turn counter resets on frame change. Good.)
decision turn 1 → decision turn 2 (different topic) — Agent gets compressed scaffold for a completely new decision because it's "turn 2 in decision frame"
Rapid frame switching — Each switch resets the counter, so the agent gets full scaffolds on every turn. In a mixed conversation, this could mean full scaffold injection 80%+ of the time, defeating the compression benefit.

Scenario #2 is the real risk — the compression assumes topic continuity within a frame, but frames are about cognitive mode, not topic.

🏗️ 9. Phase 3 DB-Driven Templates: Premature Abstraction (Low)

Phase 3 designs a full DB schema with agent_name, domain matching, versioning, and A/B testing before Phase 1 has proven the concept. If scaffolds don't measurably improve reasoning (see concerns #1 and #6), the entire Phase 3 infrastructure is waste.

The reasoning_templates table also introduces a new operational dependency — template fetching must be fast (cached) but also consistent. Template versioning means you need migration logic for active conversations mid-template-change.

Suggestion: Mark Phase 3 as explicitly contingent on Phase 1 success metrics being met. Don't design the DB schema until scaffolds prove their worth.

🧪 10. No Rollback Plan or A/B Framework (Low)

The spec mentions NOUS_REASONING_SCAFFOLDS=true as a config flag (good), but there's no plan for:

A/B testing scaffolded vs non-scaffolded responses
Gradual rollout (e.g., scaffold only in debug frame first)
Automated rollback if token costs spike or response quality drops
Measuring the negative case — tasks where scaffolds made things worse

Without this, you're deploying a system-wide prompt change with no way to measure if it actually helped beyond anecdotal observation.

Summary

#	Concern	Severity	TL;DR
1	Cargo cult reasoning	🔴 Critical	Models will fill scaffold forms mechanically
2	Complexity gate too permissive	🟠 High	Simple tasks get full scaffolds
3	Paper #10 warning unaddressed	🟠 High	No model-capability or task-complexity adaptation
4	Phase 2 quality checks unfeasible	🟠 High	Can't verify reasoning quality without creating perverse incentives
5	Redundant instructions	🟡 Medium	Tool nudges + scaffold steps overlap
6	Self-referential metrics	🟡 Medium	Measuring compliance, not quality
7	Token budget underestimate	🟡 Medium	Output inflation not accounted for
8	Compression edge cases	🟡 Low-Med	Topic changes within same frame
9	Phase 3 premature abstraction	🟢 Low	Don't build DB until Phase 1 proves out
10	No A/B or rollback plan	🟢 Low	Can't measure negative impact

Constructive Recommendations

Start with ONE frame (decision) as a pilot. Measure calibration improvement over 2 weeks before expanding.
Merge tool nudges into scaffold steps instead of appending — eliminates redundancy.
Add intra-frame complexity sensing — even in structured frames, trivial requests should skip/compress scaffolds.
Define "success" before shipping — what specific calibration improvement justifies the token cost?
Measure output token inflation — track average response length before/after scaffolds per frame.
Drop Phase 2 quality checks or redesign them as advisory logging (which v1 already does), not enforcement. The cure is worse than the disease.

Signed: 😈 Devil's Advocate

To be clear — the core idea of reasoning scaffolds is sound. Paper #11 results are real. But the gap between "scaffolds help on benchmarks" and "scaffolds help an agentic system in production" is where this review lives. Ship the concept, but ship it carefully.

…ction, compliance removal Changes from review: - Must Fix 1: Research scaffold → build_subtask_prefix() (not frame system) - Must Fix 2: Scaffolds REPLACE tool nudges, not layer on top - Must Fix 3: Fixed file path to nous/nous/heart/heart.py - Recommend 4: Decision-only pilot for 2 weeks with success criteria - Recommend 5: Removed Phase 2 compliance checking (perverse incentives) - Nice 6: Extracted to nous/nous/cognitive/scaffolds.py from day 1 - Nice 7: Added intra-frame complexity gate (question < 30 chars)

tfatykhov · 2026-03-05T00:07:11Z

😈 Devil's Advocate (Round 2) — F014 v3 Re-Review

The v3 revision was written in response to Round 1 concerns. Let me score the revisions, then probe for new risks.

Scorecard: Were Round 1 Concerns Addressed?

#	Round 1 Concern	Verdict	Notes
1	🔴 Cargo cult reasoning	🟡 Partially	Pilot + compliance removal help, but core detection gap remains
2	🟠 Complexity gate	🟡 Partially	Question frame now gated (30 chars), but decision/task/debug still ALWAYS scaffold
3	🟠 Paper #10 warning	❌ Not addressed	Still no model-capability or task-complexity adaptation
4	🟠 Phase 2 quality checks	✅ Addressed	Compliance checking explicitly removed — right call
5	🟡 Redundant instructions	✅ Addressed	Scaffolds now REPLACE nudges, not layer on top
6	🟡 Self-referential metrics	🟡 Partially	Calibration is the one real metric; "reason count" is still compliance-measuring
7	🟡 Token budget (output inflation)	❌ Not addressed	Budget analysis still only counts input tokens
8	🟡 Compression edge cases	❌ Not addressed	Topic change within same frame still unresolved
9	🟢 Phase 3 premature abstraction	🟡 Partially	Phase 2 is narrower, but full schema is still spec'd for an unproven feature
10	🟢 No A/B / rollback plan	❌ Regressed	v2 at least had `NOUS_REASONING_SCAFFOLDS=true`; v3 doesn't mention a config flag at all

Score: 2 fully addressed, 4 partially, 4 not addressed (1 regressed)

Deep Dive: What's Still Broken

🔴 Cargo Cult Risk — Better Mitigated, Not Eliminated

v3 makes the right moves:

Decision-only pilot (2 weeks before expanding)
Compliance checking removed (no perverse incentives from verification)
"You may skip steps that don't apply"

But the scaffold also says: "don't skip RECALL or CALIBRATE"

This creates a targeted cargo cult for exactly these two steps. The model will always call recall_deep even when deciding something novel with no relevant history. It will always produce a calibration statement even when confidence is self-evident. These become ritual steps — not because the model judged them necessary, but because the scaffold said "don't skip."

The pilot will not catch this. Why? Because the success metrics don't measure whether RECALL and CALIBRATE steps were genuinely useful on a per-decision basis. They measure aggregate patterns (avg reason count, calibration error). A model that dutifully calls recall_deep for every decision and writes "Confidence: 0.82 because X" every time will look great on the metrics while performing empty rituals.

Suggestion: Remove "don't skip RECALL or CALIBRATE." Let the model decide which steps to skip based on the situation. If you must mandate something, mandate the output artifact ("you must call record_decision"), not the reasoning process ("you must recall first"). Process mandates are what create cargo cults.

🟠 Complexity Gate Still Only Covers One Frame

v3 adds:

_ALWAYS_SCAFFOLD = {"decision", "debug", "task"}
_COMPLEXITY_GATED = {"question"}

So "rename this variable" in task frame (22 chars) → full UNDERSTAND → PLAN → EXECUTE → VERIFY → REPORT scaffold. "ok do it" in task frame (8 chars) → same. "fix the typo" in debug frame → full SYMPTOM → REPRODUCE → ISOLATE → HYPOTHESIZE → TEST → VERIFY → FIX → RECORD.

These are simple, direct requests that don't need structured reasoning. The complexity gate exists but only gatekeeps the easiest frame (question). The heaviest frames (decision=8 steps, debug=8 steps) are ungated.

Frame misclassification makes this worse. If the classifier puts a casual message in the task frame (which happens — the classifier isn't perfect), you get full scaffolding for "sounds good".

Suggestion: Apply the 30-char gate to ALL frames, not just question. Or better: let the turn-1 scaffold include a preamble like "If this is a simple/direct request, skip the scaffold and respond directly." The model is smart enough to self-gate if given permission.

🟠 Output Token Inflation: The Elephant Not In the Room

v3 token budget:

Turn 1: ~280 tokens (full scaffold)
Turn 2+: ~45 tokens (compressed)

This counts input only. But the scaffold instructs 8 explicit steps. The model will now write:

CONTEXT: We need to decide whether to...
RECALL: Searching past decisions... found 3 relevant...
CONSTRAINTS: Time is limited, must be backward-compatible...
OPTIONS: (1) Approach A, (2) Approach B...
...

That's easily 400-800 output tokens per scaffolded decision, vs maybe 150-250 for an unscaffolded one. The output cost could be 2-3x the input scaffold cost.

The success criterion says "No increase in scaffold-related token costs > 15% per decision turn" — but does this measure input+output? Just input? Just the scaffold injection? This needs to be explicit, and it should include output tokens.

🆕 New Risks Introduced by v3

NEW 1: No Kill Switch (Regression from v2)

v2 mentioned NOUS_REASONING_SCAFFOLDS=true as a config flag. v3 drops this entirely. The integration code shows:

def _get_frame_instructions(self, turn_context):
    scaffold = get_scaffold(frame_id, turn_in_frame, message_length)
    if scaffold:
        return scaffold  # No way to disable this except removing the code

If scaffolds cause problems in production, the only way to turn them off is a code change + deploy. That's a 30-minute recovery vs a 30-second env var change.

Suggestion: Re-add the config flag. Settings.reasoning_scaffolds: bool = True with NOUS_REASONING_SCAFFOLDS env var. This is table stakes for any prompt-level change.

NEW 2: Phase 1.5 Expands 3 Frames With No Independent Success Criteria

v3 says:

"Only proceed if Decision frame pilot meets success criteria."

Good gate. But then Phase 1.5 adds debug, task, and question scaffolds with zero independent measurement criteria. How do you know the debug scaffold works? The task scaffold? You're measuring decision-frame calibration, then extrapolating to debug-frame performance.

Debug frame success looks different from decision frame success. You need: root cause identification speed, fix success rate, regression rate. Task frame success: completion rate, accuracy, user satisfaction. These aren't in the spec.

Suggestion: Each frame scaffold needs its own success criteria before expansion. At minimum: token cost delta + one quality metric specific to that frame's purpose.

NEW 3: Research Scaffold Omits Memory Recall

The Research subtask scaffold says:

SCOPE → 2. SEARCH → 3. GATHER → 4. SYNTHESIZE → 5. DELIVER

Notice what's missing: no RECALL step. The decision scaffold has RECALL (recall_deep), but the research scaffold goes straight to web search. For research subtasks spawned from a decision context, this means the agent won't check if Nous already knows the answer before hitting the web.

This is exactly the kind of subtle omission that scaffolds are supposed to prevent.

Suggestion: Add a step 1.5: RECALL — Check memory for existing knowledge before searching externally.

NEW 4: No Failure Criteria (What Kills the Project?)

The success criteria say what good looks like. But there's no definition of what bad looks like:

If reason count goes up but calibration gets worse, is that a fail?
If token cost is 14.9% (under the 15% cap) but output tokens triple, is that a fail?
If decisions become more "structured" but users find responses slower and more verbose, is that a fail?

Without explicit failure criteria, the pilot can always be declared a success by cherry-picking the metric that improved.

Suggestion: Add explicit failure criteria:

Calibration error increases by >X% → kill
Average response time per decision turn increases by >Y% → kill
User reports scaffold responses as "formulaic" or "verbose" → investigate

What v3 Got Right (Credit Where Due)

Decision-only pilot — This was Recommendation feat: implement 001-postgres-scaffold foundation #1 from Round 1. Nailed it.
Replace, don't layer — Recommendation feat: implement Brain module (002) — decision intelligence organ #2. Scaffolds replace _get_frame_instructions() output. Clean.
Compliance checking removed — Recommendation feat: 003.1 Heart enhancements — contradiction detection + domain compaction #6. Eliminated the perverse incentive loop.
Research → subtask injection — Correctly separated from frame system. Proper architecture.
Turn-aware compression — Token-efficient and well-designed.
scaffolds.py as standalone module — Followed ArchitectureReviewer's recommendation. Good separation.

Remaining Action Items

Priority	Action	Why
🔴 High	Re-add config flag / kill switch	Can't disable scaffolds without code change
🔴 High	Remove "don't skip RECALL or CALIBRATE" mandate	Creates targeted cargo cult for these specific steps
🟠 Medium	Extend complexity gate to all frames (or add self-gating preamble)	Trivial requests in task/debug still get full scaffolding
🟠 Medium	Define output token measurement in success criteria	Input-only budget analysis is incomplete
🟠 Medium	Add failure criteria (what kills the pilot)	Without these, success can always be declared
🟡 Low	Add RECALL step to research scaffold	Memory should be checked before web search
🟡 Low	Add per-frame success criteria for Phase 1.5	Don't extrapolate decision metrics to debug/task

Bottom Line

v3 is a meaningful improvement over v2. The decision-only pilot, replace-not-layer approach, and compliance check removal directly address the most actionable Round 1 concerns. The spec demonstrates genuine engagement with the criticism.

But the harder problems — cargo cult detection, output inflation measurement, kill-switch infrastructure, and failure criteria — were either partially addressed or not addressed. These are harder because they're harder, not because they were ignored. Some (like cargo cult detection) may not have clean solutions.

My recommendation: Ship the decision-frame pilot with the 🔴 items above fixed. The two high-priority fixes are small (re-add config flag, soften the RECALL/CALIBRATE mandate). Then let the 2-week pilot generate real data before over-engineering solutions to theoretical problems.

The spec is 80% there. That last 20% is the difference between "we shipped scaffolds" and "we shipped scaffolds that we can actually evaluate and roll back."

Signed: 😈 Devil's Advocate (Round 2)

The best part of this revision is what was removed (compliance checking). Knowing what NOT to build is an underrated skill.

tfatykhov · 2026-03-05T00:07:44Z

🔍 CorrectnessReviewer (Round 2) — v3 Spec Re-Review

Re-reviewed the v3 spec against the current codebase (runner.py, tools.py, schemas.py, layer.py, deliberation.py, config.py). Verified all Round 1 P1/P2 findings and checked for new issues introduced by the revision.

Round 1 P1 Resolution Status

#	Round 1 Finding	v3 Status	Verdict
P1-1	Research scaffold silently fails (`"research"` not in `FRAME_TOOLS`)	Fixed. Spec now correctly states "Research is NOT a frame in FRAME_TOOLS" and moves injection to `build_subtask_prefix()` with early-return `if frame_type == "research"` before the `FRAME_TOOLS` gate. Verified: `_SPAWN_TASK_SCHEMA` already accepts `"research"` in the enum (line 1000), and `config.py` has `frame_default_models = {"research": "claude-haiku-3-5-20241022"}`. The proposed early-return pattern is the correct fix.	✅ Resolved
P1-2	`frame_turn_count` tracking path unspecified	Partially addressed. v3 says "Add `turn_in_frame: int` to TurnContext (or derive from conversation history). Resets when frame changes." Better than v2, but see NEW P2-1 and P2-2 below.	⚠️ See below
P1-3	Wrong file path (`nous/memory/heart.py` → `nous/heart/heart.py`)	Fixed. v3 revision notes confirm fix. Affected files table now correctly shows `nous/nous/heart/heart.py`. `ls` confirms only `nous/heart/heart.py` exists.	✅ Resolved

Round 1 P2 Resolution Status

#	Round 1 Finding	v3 Status	Verdict
P2-4	`should_deliberate()` can't check scaffold compliance	Fixed (by design removal). v3 explicitly states: "No scaffold compliance checking — verifying 'did you follow step 3?' creates perverse incentives." The "What This Spec Does NOT Do" section reinforces this. Existing `_validate_decision_quality()` handles structural quality. Good decision.	✅ Resolved
P2-5	NOVEL fact quality check latency	Fixed (by removal). Phase 2 is now scoped to DB-driven template extensibility only. No new quality checks added.	✅ Resolved
P2-6	Complexity gate is a no-op	Fixed. v3 redesigned: `_ALWAYS_SCAFFOLD = {"decision", "debug", "task"}`, `_COMPLEXITY_GATED = {"question"}` (skip for `message_length < 30`), `_NEVER_SCAFFOLD = {"conversation", "creative", "initiation"}`. The gate now has real effect: short questions ("what time?") skip the 4-step scaffold.	✅ Resolved
P2-7	Phase 3 schema `agent_name` vs `agent_id`	Fixed (by removal). Phase 3 removed entirely. v3 has Phase 1 → 1.5 → 2 only.	✅ Resolved

🟡 NEW P2 Issues in v3

P2-1: Affected files table references wrong file for turn_in_frame

The affected files table says:

nous/nous/cognitive/context.py — Add turn_in_frame tracking to TurnContext

But TurnContext is defined in schemas.py (line 107), not context.py. The context.py file contains ContextEngine (line 57) — a completely different class. Either:

The table should say schemas.py (if adding the field to TurnContext), or
Both files need changes (tracking logic in context.py/layer.py, field in schemas.py)

As-is, the table will send an implementer to the wrong file.

P2-2: turn_context.user_message doesn't exist on TurnContext

The spec's integration code in _get_frame_instructions() uses:

message_length=len(turn_context.user_message or ""),

But TurnContext has no user_message field — confirmed by inspection of schemas.py:107-121. The user message is available in pre_turn() as the user_input parameter but is NOT stored on TurnContext.

Fix options:

(a) Add user_message: str | None = None to TurnContext and populate in layer.py:399
(b) Store only user_message_length: int on TurnContext (avoids carrying full text)
(c) Pass message_length to _get_frame_instructions() separately from turn_context

Option (b) is cleanest — _get_frame_instructions() only needs the length, not the content.

P2-3: turn_in_frame tracking mechanism still underspecified

v3 says "Add turn_in_frame: int to TurnContext (or derive from conversation history). Resets when frame changes." But:

SessionMetadata has turn_count: int (session-level, incremented in post_turn() at layer.py:539). No frame-level tracking exists.
pre_turn() constructs TurnContext (lines 399-410) without any frame-turn counting.
The "resets when frame changes" logic needs to track the previous frame, which neither SessionMetadata nor pre_turn() currently does.

Concrete suggestion: Add current_frame_id: str = "" and frame_turn_count: int = 0 to SessionMetadata. In pre_turn(), after frame selection:

if frame.frame_id != meta.current_frame_id:
    meta.current_frame_id = frame.frame_id
    meta.frame_turn_count = 1
else:
    meta.frame_turn_count += 1

Then pass meta.frame_turn_count to TurnContext.turn_in_frame. This is ~6 lines of implementation.

🟢 P3 — Minor / Informational

P3-1: get_scaffold() Phase 1 vs Phase 1.5 signatures need merge note

Phase 1 shows a complete get_scaffold() function (Decision-only). Phase 1.5 shows the complexity gate sets (_ALWAYS_SCAFFOLD, etc.) and a new function body with ... (ellipsis). An implementer needs to know these are sequential — Phase 1.5 replaces the Phase 1 function body, not layers on top. A one-line note ("Phase 1.5 replaces the Phase 1 implementation of get_scaffold()") would prevent confusion.

P3-2: Scaffold REPLACE semantics are clean

v3's approach of scaffolds replacing _get_frame_instructions() output (not layering) is well-matched to the code. _build_system_prompt() at line 1089 concatenates system_prompt + frame_instructions. Since scaffolds merge tool guidance into scaffold steps, returning the scaffold AS the frame instructions works perfectly. No changes needed to _build_system_prompt(). Good design.

P3-3: Open Question 3 answer is implied by P2-3 fix

Open Question 3 ("How should scaffold compression handle frame switches?") proposes "Reset turn count per frame switch." The P2-3 implementation above naturally handles this — frame_turn_count resets when frame.frame_id != meta.current_frame_id. This can be moved from open questions to a design decision.

Summary

Category	Count	Details
Round 1 P1 resolved	3/3	Research scaffold ✅, file path ✅, turn tracking ⚠️ (partially — see P2-3)
Round 1 P2 resolved	4/4	All resolved (3 by fix, 1 by removal)
New P2	3	Wrong file in affected table, missing `user_message` field, turn tracking still underspec'd
New P3	3	Phase merge note, clean REPLACE design, open question resolution

Overall: v3 is a significant improvement over v2. The major architectural concerns (research injection, compliance checking, complexity gate) are all properly resolved. The remaining P2 issues are implementation-path clarity items — they won't cause silent failures, but an implementer will need to make decisions the spec doesn't currently specify. The P2-3 fix suggestion above (~6 lines) would close the last open P1 from Round 1.

Recommendation: Address P2-1 (wrong file reference) and P2-2 (user_message field) in spec, incorporate P2-3 suggestion for turn tracking, then this is ready for implementation.

tfatykhov · 2026-03-05T00:07:50Z

🏗️ Architecture Review (Round 2) — F014 v3: Frame Reasoning Scaffolds

Overall Assessment: ✅ Ready for implementation

The v3 revision addressed the critical concerns from Round 1 cleanly. The decision to pilot Decision frame first, remove compliance checking, and merge tool nudges into scaffolds (rather than layering) shows good architectural judgment. Two minor items remain for documentation; no blockers.

Round 1 Concern Resolution Scorecard

#	Round 1 Concern	Status	How v3 Addressed It
1	Dual injection paths (scaffolds bypass budget system)	⚠️ Mostly	Scaffolds now REPLACE `_get_frame_instructions()` output (not layer). Token budget section added with detailed numbers. But the spec still doesn't explicitly document this as "intentional fixed overhead" per Round 1 recommendation (b). See §1 below.
2	scaffolds.py module (where do scaffold strings live?)	✅ Resolved	`nous/nous/cognitive/scaffolds.py` — standalone module from day 1. Both `runner.py` and `tools.py` import from it.
3	Token budget (unbudgeted overhead)	⚠️ Mostly	Detailed budget section with per-turn numbers and compression savings. Still only counts input tokens (output inflation unaddressed). See §2.
4	Phase 1→3 path (migration cleanliness)	✅ Resolved	Now Phase 1 → 1.5 → 2. Each phase gated on prior success. Static strings → more frames → DB-driven. Zero throwaway work.
5	Component interaction (scaffolds ↔ deliberation ↔ run_python)	✅ Resolved	Phase 2 compliance checking removed entirely — "creates perverse incentives." Existing `_validate_decision_quality()` is sufficient. Clean.
6	Phase 3 schema (compressed_form, domain selection)	✅ Resolved	Schema now has `compressed TEXT NOT NULL`. Domain values are concrete (`architecture, security, performance, integration, process, memory, debugging`). `agent_name` removed.
7	Redundant instructions (tool nudges + scaffold overlap)	✅ Resolved	Scaffolds REPLACE `_get_frame_instructions()` return for their frame. Tool guidance merged INTO scaffold steps (e.g., step 2 RECALL says "Use `recall_deep`"). No duplication.
8	Complexity gate (was a no-op)	✅ Resolved	Proper three-tier system: `_ALWAYS_SCAFFOLD` (decision, debug, task), `_COMPLEXITY_GATED` (question < 30 chars), `_NEVER_SCAFFOLD` (conversation, creative, initiation).
9	Research scaffold injection (not in FRAME_TOOLS)	✅ Resolved	Moved to `build_subtask_prefix()` in `tools.py` with early return — bypasses FRAME_TOOLS check entirely.
10	Heart file path	✅ Resolved	Corrected to `nous/nous/heart/heart.py`.
11	Pilot/rollout strategy (was all-at-once)	✅ Resolved	Decision frame only → 2-week measurement → expand if criteria met. This was the Devil's Advocate's #1 recommendation.

Score: 8/11 fully resolved, 2/11 mostly resolved, 0 unresolved. Strong revision.

§1. Remaining: Budget Documentation (Minor)

The dual injection path still exists — scaffolds flow through Runner._get_frame_instructions() (path #2), which is outside ContextEngine's budget system (path #1). The v3 spec has detailed token numbers but still doesn't include the explicit comment recommended in Round 1:

Option (b): Document this as intentional (scaffolds are "fixed overhead" not subject to dynamic budgeting) and add a comment in _build_system_prompt() noting the unbudgeted cost.

Recommendation: Add a one-line note in the _get_frame_instructions() integration code:

# Scaffolds are fixed overhead (~280 tok turn 1, ~45 tok turn 2+),
# intentionally outside ContextBudget. See F014 token budget analysis.

This is a documentation task, not an architectural concern. The actual numbers are well within bounds (2-4% of frame budget).

§2. Remaining: Output Token Inflation (Minor)

The token budget section counts input costs only. The Devil's Advocate flagged that scaffold-structured responses will be longer (model writes through each step). This is inherent to the approach and hard to estimate upfront, but the spec should acknowledge it:

Recommendation: Add a line to the Token Budget section:

Note: Scaffolds may increase output length as the model works through structured steps. Monitor average response length during the pilot period alongside the other success metrics.

§3. `turn_in_frame` Tracking — Clarify Computation Site

The spec says "Add turn_in_frame tracking to TurnContext" in the Affected Files table, and the integration code uses turn_context.turn_in_frame. But it doesn't specify WHERE the count is computed.

The Round 1 Correctness Review recommended option (a): compute in runner.py since it has access to Conversation.turn_contexts and can count consecutive frames. The spec should confirm this is the intended approach (or specify the alternative). A one-liner in the spec's integration section:

# In runner.py, before calling _get_frame_instructions():
turn_in_frame = self._count_consecutive_frame_turns(conversation, frame_id)

This is an implementation detail, not an architectural risk.

§4. New Observation: Phase 2 Domain Selection Gap

The Phase 2 get_scaffold() signature adds domain=None, and the schema has concrete domain values. But the spec doesn't specify where domain comes from at the call site. FrameSelection has default_category and default_stakes — are these the domain signal? Or does the caller derive domain from the user message?

This isn't blocking since Phase 2 is future work explicitly gated on Phase 1 success. But adding a one-sentence note (e.g., "Domain derived from FrameSelection.default_category or intent classification") would prevent ambiguity later.

§5. Pilot Strategy Assessment: ✅ Sound

The v3 pilot strategy is well-designed:

Scope: Decision frame only — highest stakes, most measurable, already has deliberation infrastructure
Duration: 2 weeks — long enough for meaningful data, short enough to iterate
Success criteria: Mix of structural (reason count) and quality (calibration) metrics
Gate: Phase 1.5 expansion explicitly blocked until criteria are met
Rollback: get_scaffold() returns None for non-decision frames; can trivially disable by returning None always

One gap: Open Question #4 asks "What's the baseline decision quality metric?" — this MUST be answered before the pilot starts, not after. The success criteria ("average reason count per decision increases") require a baseline to compare against.

Recommendation: Add a pre-pilot task: "Before enabling scaffolds, run a baseline query against the decisions table to capture current avg reason count, confidence distribution, and failure rate. Store these numbers in the spec or a tracking issue."

Summary

Area	Round 1	Round 2	Notes
Architectural fit	✅	✅	Clean extension point, proper separation
Coupling / injection paths	⚠️	✅ (with doc note)	Replace semantics resolve overlap; document budget bypass
Phase progression	✅	✅	Even cleaner with Decision-only pilot
Token budget	⚠️	✅ (with doc note)	Input costs well-analyzed; acknowledge output inflation
Component interaction	✅	✅	Compliance checking removal was the right call
Schema (Phase 2)	⚠️	✅	compressed column added, domain values concrete
Pilot/rollout	N/A	✅	Measure-first approach; collect baseline before starting

Verdict: No architectural blockers. Three minor documentation items (§1, §2, §3) and one pre-pilot task (baseline metrics). All can be addressed during implementation without spec revision.

🏗️ ArchitectureReviewer (Round 2)

…re criteria Must fixes: - Re-added NOUS_REASONING_SCAFFOLDS env var kill switch (regression from v2) - Fixed turn_context.user_message → message_length (user_message doesn't exist) Should fixes: - Affected files: context.py → schemas.py for TurnContext - Turn tracking site clarified: runner.py owns both fields - Research scaffold: added RECALL step before SEARCH Nice to have: - Failure criteria alongside success metrics - Baseline metrics promoted to concrete pre-pilot task - Open question #4 removed (now addressed in spec)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F014: Frame Reasoning Scaffolds#114

F014: Frame Reasoning Scaffolds#114
tfatykhov wants to merge 4 commits intomainfrom
feature/F014-reasoning-scaffolds

tfatykhov commented Mar 4, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 4, 2026

Uh oh!

tfatykhov commented Mar 4, 2026

Uh oh!

tfatykhov commented Mar 4, 2026

Uh oh!

tfatykhov commented Mar 4, 2026

Uh oh!

tfatykhov commented Mar 5, 2026

Uh oh!

tfatykhov commented Mar 5, 2026

Uh oh!

tfatykhov commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tfatykhov commented Mar 4, 2026

Summary

Motivation

Phases

CSTP Extension

Research Basis

Files

Uh oh!

chatgpt-codex-connector bot commented Mar 4, 2026

Uh oh!

tfatykhov commented Mar 4, 2026

🏗️ Architecture Review — F014: Frame Reasoning Scaffolds

Overall Assessment: ✅ Sound design, a few structural concerns to address

1. Architectural Fit ✅

2. Coupling & Separation of Concerns ⚠️

3. Phased Approach & Tech Debt Analysis ✅

4. Token Budget Analysis ⚠️

5. Scaffold ↔ Deliberation ↔ run_python Interaction ✅

6. Phase 3 Schema Review ⚠️

7. Open Questions — My Input

Summary

Uh oh!

tfatykhov commented Mar 4, 2026

🔍 CorrectnessReviewer — Technical Accuracy Review

✅ Correct Assumptions

🔴 P1 — Must Fix Before Implementation

🟡 P2 — Should Fix

🟢 P3 — Nice to Have

Summary

Uh oh!

tfatykhov commented Mar 4, 2026

😈 Devil's Advocate Review — F014: Frame Reasoning Scaffolds

🎭 1. The Cargo Cult Risk (Critical)

📊 2. The Complexity Gate Is Miscalibrated (High)

⚠️ 3. Paper #10 Warning Is Acknowledged But Not Addressed (High)

🔍 4. Phase 2 Quality Checks: Quis Custodiet Ipsos Custodes? (High)

🔗 5. Interaction with Existing Frame Instructions (Medium)

📈 6. Success Metrics Are Self-Referential (Medium)

💾 7. Token Budget Underestimates Real Cost (Medium)

🔄 8. Turn Compression Edge Cases (Low-Medium)

🏗️ 9. Phase 3 DB-Driven Templates: Premature Abstraction (Low)

🧪 10. No Rollback Plan or A/B Framework (Low)

Summary

Constructive Recommendations

Uh oh!

tfatykhov commented Mar 5, 2026

😈 Devil's Advocate (Round 2) — F014 v3 Re-Review

Scorecard: Were Round 1 Concerns Addressed?

Deep Dive: What's Still Broken

🔴 Cargo Cult Risk — Better Mitigated, Not Eliminated

🟠 Complexity Gate Still Only Covers One Frame

🟠 Output Token Inflation: The Elephant Not In the Room

🆕 New Risks Introduced by v3

NEW 1: No Kill Switch (Regression from v2)

NEW 2: Phase 1.5 Expands 3 Frames With No Independent Success Criteria

NEW 3: Research Scaffold Omits Memory Recall

NEW 4: No Failure Criteria (What Kills the Project?)

What v3 Got Right (Credit Where Due)

Remaining Action Items

Bottom Line

Uh oh!

tfatykhov commented Mar 5, 2026

🔍 CorrectnessReviewer (Round 2) — v3 Spec Re-Review

Round 1 P1 Resolution Status

Round 1 P2 Resolution Status

🟡 NEW P2 Issues in v3

🟢 P3 — Minor / Informational

Summary

Uh oh!

tfatykhov commented Mar 5, 2026

🏗️ Architecture Review (Round 2) — F014 v3: Frame Reasoning Scaffolds

Overall Assessment: ✅ Ready for implementation

Round 1 Concern Resolution Scorecard

§1. Remaining: Budget Documentation (Minor)

§2. Remaining: Output Token Inflation (Minor)

§3. turn_in_frame Tracking — Clarify Computation Site

§4. New Observation: Phase 2 Domain Selection Gap

§5. Pilot Strategy Assessment: ✅ Sound

Summary

§3. `turn_in_frame` Tracking — Clarify Computation Site