Skip to content

chore: sync public OSS from private main (2026-04-03)#13

Closed
abbudjoe wants to merge 1 commit intomainfrom
chore/release-sync-2026-04-03
Closed

chore: sync public OSS from private main (2026-04-03)#13
abbudjoe wants to merge 1 commit intomainfrom
chore/release-sync-2026-04-03

Conversation

@abbudjoe
Copy link
Copy Markdown
Contributor

@abbudjoe abbudjoe commented Apr 3, 2026

Syncs 164 commits of changes from the private downstream repo.

Highlights

  • Step 13 refactor (slices 13.1-13.4b): streaming bridge, retry policy, request builder, compaction extraction
  • Detached headless stabilization: replay-safe tool history, structured evidence, workspace-root rebinding
  • Skill lifecycle release fixes: sign command, built/installed/server-loaded semantics, docs alignment, e2e verification
  • Chat attachments support
  • Tool error surfacing + loop hardening
  • Backlog cleanup: roadmap reconciliation

Cleanup

  • llama-cpp-sys removed from workspace (private-only crate)
  • cfg(feature = "llama-cpp") gates stripped from fx-llm
  • Author metadata normalized
  • docs/architecture excluded (proprietary)

Syncs 164 commits worth of changes from private main:
- Step 13 refactor (streaming bridge, retry policy, request builder, compaction extraction)
- Detached headless stabilization
- Skill lifecycle release fixes (sign command, semantics, docs, e2e verification)
- Chat attachments support
- Tool error surfacing + loop hardening
- Backlog cleanup and roadmap reconciliation
@abbudjoe abbudjoe force-pushed the chore/release-sync-2026-04-03 branch from ab19708 to 0b458bb Compare April 3, 2026 03:02
@abbudjoe abbudjoe closed this Apr 3, 2026
@abbudjoe abbudjoe deleted the chore/release-sync-2026-04-03 branch April 3, 2026 03:04
abbudjoe added a commit that referenced this pull request Apr 8, 2026
Critical fixes:
- Replace unwrap() with expect() in create_user_input() helper (issue #1)
- Add comprehensive documentation that WASM tests verify infrastructure only, not execution (issue #2)
- Add TODO comments referencing PR #179 for real WASM runtime (issue #2)
- Add timeout and multi_thread flavor to concurrent audit test to prevent deadlocks (issue #3)
- Fix tautological assertions in edge case tests - now verify specific behavior (issue #4)

High priority fixes:
- Extend MockLlmProvider with 4 error types: ServiceUnavailable, RateLimitExceeded, Timeout, MalformedResponse (issue #5)
- Add comprehensive doc comments to MockLlmProvider explaining matching strategy and thread-safety (issue #12)
- Strengthen prompt injection test to require IntentCategory::Conversation (issue #6)
- Add test_audit_hash_chain_tampering_detection test (issue #7)
- Add test_skill_network_capability_denied test for runtime capability enforcement (issue #8)

Medium priority fixes:
- Extract all magic number encryption keys to named constants (issue #9)
- Add comment in Cargo.toml explaining intentional E2E test dependencies (issue #10)
- Make retry backoff timing test more robust with 80ms threshold instead of 100ms (issue #11)

Low priority improvements:
- Rename test_policy_allow_with_confirmation → test_policy_requires_confirmation_for_destructive_actions (issue #13)
- Add task IDs to concurrent audit test events for better debugging (issue #14)
- Remove unused test_storage_round_trip helper (was addressing issue #15 but simplified instead)

All tests pass (28/28), clippy clean with -D warnings, formatted with rustfmt.
abbudjoe added a commit that referenced this pull request Apr 8, 2026
* docs: implementation spec for #1102 — kernel/loadable boundary tests

* docs: implementation spec for #1104 — budget attack surface analysis

* fix(docs): add issue links to #1104 spec exclusions

* fix(docs): address R1 review comments on #1104 spec

* feat(security): update security model with budget attack surface (#1104)

* fix(security): add surface #13 defense posture row, remove out-of-scope spec (#1104)
abbudjoe added a commit that referenced this pull request Apr 8, 2026
* feat: wire kernel streaming to TUI for token-by-token rendering (#930) (#1095)

* feat: wire kernel streaming to TUI for token-by-token rendering (#930)

Add StreamRenderer component that subscribes to EventBus streaming events
and renders LLM output token-by-token as it arrives. The existing batch
rendering path continues to work for providers that don't support streaming.

Changes:
- StreamRenderer struct with handle_started/handle_delta/handle_finished
- run_with_streaming_display replaces the thinking spinner when streaming
  events arrive (spinner shows until first StreamDelta, then switches to
  token-by-token output)
- EventBus created in build_loop_engine_with_config and wired through
  TuiApp constructor
- handle_message returns Option<String> — None when streaming rendered
  inline, Some(rendered) for batch display
- Multi-phase support (Reason → Synthesize) with single prefix per cycle
- Interrupt handling: [interrupted] marker on UserStopped during streaming
- format_loop_metadata_for_result extracts metadata line for streaming mode
- 5 unit tests: prefix_once_per_cycle, multi_phase, flush_on_each_delta,
  interrupted_marker, resets_between_turns

All existing tests pass (208 fx-cli, 369 fx-kernel).

* fix: address review findings for PR #1095 (TUI streaming)

- Guard stderr flush in stop_spinner_if_active (only when spinner active)
- Extract print_interrupted_marker as free function (stateless, no self)
- Rename stream_renderer_flush_on_each_delta to match actual assertion
- Rename stream_interrupted_shows_marker to _does_not_panic
- Log tracing::warn on broadcast Lagged with skipped event count
- Document cycle_active spec divergence in wave2-b spec

* refactor: introduce DisplayLoopState struct + add async display loop test

- Extract mutable loop state (renderer, spinner_active, frame_index,
  streamed) into DisplayLoopState struct, reducing dispatch_streaming_event
  from 6 parameters to 3 (ENGINEERING.md §2 compliance).
- Add display_loop_processes_streaming_events_end_to_end integration test
  exercising the async wiring: EventBus → dispatch_streaming_event →
  StreamRenderer state transitions → stop signal exit.

* fix: wire RouterLoopLlmProvider to real complete_stream

Root cause: RouterLoopLlmProvider did not override complete_stream(),
so the kernel LlmProvider trait default kicked in — calling complete()
then wrapping the full response in stream::once(). This produced a single
chunk with the entire response instead of real streaming deltas.

Fix: delegate complete_stream() to self.router.complete_stream(), matching
how generate() already uses the router streaming path.

Also fix CompletionTestProvider::complete_stream to return a proper
stream chunk from the completion response (was returning empty stream).

* fix: Anthropic streaming drops tool calls — content_block_stop not handled

Root cause: The Anthropic streaming SSE parser never emitted
arguments_done=true. It handled content_block_start (id/name) and
content_block_delta (partial_json) but silently ignored
content_block_stop. The kernel's finalize_stream_tool_calls filters
out any tool call where arguments_done is false, so ALL streaming
tool calls from Anthropic were silently dropped.

Symptom: model calls list_directory but the tool call vanishes,
decide() sees only text content → Decision::Respond with stale
conversation context (context bleed between iterations).

Fix: emit a ToolUseDelta with arguments_done=true on
content_block_stop events. For text blocks this creates a harmless
entry (no id/name → filtered by finalize). For tool_use blocks it
correctly marks the accumulated arguments as complete.

Regression test: content_block_stop_marks_tool_arguments_done

* fix: reprint streamed output with markdown formatting on completion

* fix: address streaming review findings (NB1, NB2, NTH1)

* fix: add carriage return before ANSI clear in streaming finalize

The ANSI escape \x1b[{n}A moves cursor up N lines but stays at the
same column. Without returning to column 0 first, \x1b[J clears
from mid-line, leaving the original 'assistant ›' prefix visible —
resulting in a double prefix on the reprinted line.

Fix: insert \r between cursor-up and clear-to-end.

* fix: inject tool ID into Anthropic streaming argument deltas for multi-tool calls

* fix: drop cursor-based markdown reprint, defer to incremental (#1097)

The buffer+reprint approach (clear raw output via ANSI cursor-up,
reprint with render_markdown) is fundamentally unreliable:
terminal scrollback, varying widths, and shared prefix lines make
line estimation fragile, causing double-prefix artifacts.

Replace with a simple trailing newline. Raw streamed text is kept
as-is — readable but unformatted. Full markdown formatting during
streaming is deferred to incremental rendering (#1097).

* debug: add SSE event tracing for multi-tool diagnosis (temporary)

* fix: remove dead estimate_rendered_lines (reprint dropped)

* Changes to be committed:
	new file:   docs/specs/budget-soft-ceiling.md

* fix: remove debug traces, fix clippy blockers, clean up scope

- Remove temporary SSE debug eprintln! tracing (multi-tool diagnosis complete)
- Extract LoopEngineBundle struct from 4-tuple return (clippy type-complexity)
- Remove dead estimate_rendered_lines method and its test (reprint dropped)
- Remove budget-soft-ceiling.md (moved to feat/loop-resilience branch, PR #1098)
- All tests passing, clippy clean with -D warnings

* feat: loop resilience — budget soft-ceiling, fan-out cap, tool result truncation (#1098)

* docs: add loop resilience spec (budget soft-ceiling, fan-out cap, tool result truncation)

Three independent fixes for production loop failures:
1. Budget soft-ceiling at 80% with BudgetState enum and wrap-up mode
2. Fan-out cap (default 4) to prevent context overflow from parallel tools
3. Tool result truncation (default 16KB) to bound per-result context cost

Addresses Fawx self-diagnosed improvements from 2026-03-03 smoke testing.

* docs(spec): address all review findings on loop-resilience spec

Blocking fixes:
- B1: Corrected all file paths to actual crate locations (fx-kernel/src/budget.rs, fx-kernel/src/loop_engine.rs). No loop-engine crate exists.
- B2: Specified cost_cents as primary trigger, llm_calls as secondary. Wall time excluded. Rationale documented.
- B3: Reduced BudgetState to Normal + Low only. Exhausted already handled by BudgetExceeded. Critical removed (YAGNI — no distinct behavior).
- B4: Clarified wrap-up is a new LLM call via perceive() directive injection. 20% headroom at default 0.80 fraction is sufficient. Decision::Respond routing explained.

Non-blocking fixes:
- NB1: Wrap-up directive injected in perceive() — documented as cleanest injection point.
- NB2: Monotonic scope clarified to single run_cycle() call.
- NB3: soft_ceiling_fraction added to BudgetConfig (configurable, consistent with existing pattern).
- NB4: Split test case 2 into unit-testable parts (directive presence, tool blocking) vs smoke test (coherent output).
- NB5: Line estimates adjusted to ~400, risk assessment kept.

Nice-to-haves:
- NTH1: Using existing SignalKind::Performance instead of new variant.
- NTH2: Added section documenting ConversationBudget vs BudgetTracker orthogonality.
- NTH3: Per-resource thresholds noted as future design space in 'What This Does NOT Cover'.

* feat: loop resilience — budget soft-ceiling, fan-out cap, tool result truncation

Implements three independent resilience fixes for the loop engine:

1. Budget Soft-Ceiling (BudgetState enum):
   - BudgetState::Normal / BudgetState::Low based on cost_cents and llm_calls
   - Configurable soft_ceiling_fraction (default 0.80)
   - When Low: tools blocked, decomposition blocked, wrap-up directive injected
   - Performance signal emitted on Normal→Low transition

2. Fan-Out Cap:
   - max_fan_out (default 4) limits tool calls per LLM response
   - Excess calls deferred with descriptive message
   - Friction signal emitted when cap is hit

3. Tool Result Truncation:
   - max_tool_result_bytes (default 16384) caps individual results
   - UTF-8 char boundary safe truncation
   - Marker: [truncated — N bytes omitted, M total]

All 16 test cases from the spec implemented plus additional edge cases.
392 tests passing, clippy clean.

Closes #1098

* fix: address all review findings on PR #1098

NB1: Deferred tool messages were dead code — append_deferred_message pushed
to continuation_messages but synthesize_tool_fallback builds from
all_tool_results. Fix: inject deferred tools as synthetic ToolResult entries
into all_tool_results immediately after fan-out cap, ensuring they appear
in every return path (finalize_tool_response and synthesize_tool_fallback).

NB2: Performance signal lacked exactly-once guard — emitted on every
perceive() call while budget Low. Fix: add budget_low_signaled bool to
LoopEngine, set on first emission, reset in prepare_cycle().

NB3: Deferred message used Message::assistant(). Fix: replaced with
synthetic ToolResult entries (combined with NB1 fix).

NB4: Fan-out cap not applied to continuation tool calls. Fix: apply
apply_fan_out_cap() when state.current_calls is reassigned from
response.tool_calls in the continuation loop.

NB5: Eq trait removed from BudgetConfig/AllocationPlan/BudgetTracker by
f64 field. Fix: replace soft_ceiling_fraction: f64 with
soft_ceiling_percent: u8 (default 80), convert to f64 only in state().
Restores Eq derivation on all three types.

NTH1: Added budget_state_monotonicity_through_thresholds test — records
cost to 50% (Normal), 81% (Low), checks Low stays Low.

NTH2: Added budget_state_boundary_at_exactly_80_percent test — verifies
80/100 = Normal (exceeds_fraction uses >) and 81/100 = Low.

NTH3: Wrap-up directive changed from Message::user() to Message::system()
in perceive() — it's an engine directive, not user input.

NTH4: truncate_tool_result returns Cow<str> instead of String to avoid
allocation when no truncation needed. Callers use .into_owned().

All tests: cargo test -p fx-kernel — 400 passed, 0 failed.
Clippy: cargo clippy -p fx-kernel -- -D warnings — clean.

* fix(kernel): inject deferred tool results for continuation rounds, avoid needless alloc in truncation

NB1: Continuation-round fan-out now calls append_deferred_tool_results()
for capped tool calls, matching the initial-round behavior. The model
receives synthetic ToolResult entries for deferred calls so it knows
which calls were not executed.

NTH1: truncate_tool_results() now guards with a length check before
calling truncate_tool_result(), avoiding the Cow::Borrowed -> into_owned()
allocation when no truncation is needed.

Updated continuation_tool_calls_capped_by_fan_out test to verify deferred
results are present and correctly marked.

* docs: add specs for decompose gate (#1100), per-tool retry (#1101), memory decay (#1103)

* fix: remove temporary SSE/TRACK debug traces from anthropic streaming

Multi-tool streaming confirmed working via live test (6 read_file calls,
fan-out cap correctly limited to 4, model re-requested deferred 2).
Debug eprintln! calls were interleaving with stdout making response
text unreadable. Marked 'TEMPORARY DEBUG — remove after multi-tool
diagnosis' in the code.

* fix: address R4 review findings — NB1 done marker, NTH1-3 comments and cohesion

NB1: content_block_stop now emits arguments_delta: None instead of
Some("") — the done marker is a completion signal, not a delta. Updated
is_done_marker predicate to check arguments_delta.is_none() and added
regression test assertion.

NB2: estimate_rendered_lines was already removed in d3556e37 — N/A.

NTH1: Added inline comment noting future [...{n} events skipped...]
marker for broadcast::Receiver lagged events.

NTH2: Added comment on finalize() noting threshold-based skip as
future enhancement for long response reprint flash.

NTH3: Moved print_interrupted_marker() from free function to
associated function on StreamRenderer for cohesion.

B1-B3: Already resolved in prior commits (estimate_rendered_lines
removed in d3556e37, LoopEngineBundle struct in ee4da96b).

* fix(kernel): check budget soft-ceiling within tool round loop (#1105) (#1108)

The budget soft-ceiling (BudgetState::Low at 80%) was only checked at
entry to act_with_tools(), not within the inner tool round loop. When
budget crossed 80% mid-loop (e.g., reading 49 files batched at 4/round),
rounds continued until hard exhaustion, producing truncated output.

Changes:
- Record tool execution and LLM continuation costs incrementally in
  execute_tool_round() so budget.state() reflects actual consumption
- Check budget.state() at the top of each loop iteration in
  act_with_tools(); break to synthesis fallback when Low
- Check budget.state() before request_tool_continuation() in
  execute_tool_round(); skip LLM call when Low (BudgetLow variant)
- Adjust execute_action_and_finalize() to avoid double-recording costs
  for tool actions (already recorded incrementally)
- Extract compact_tool_continuation(), record_tool_execution_cost(),
  record_continuation_cost(), emit_budget_low_break_signal() helpers

Regression test verifies: budget at 76%, 3 tool calls, after round 1
budget crosses 80%, loop breaks, falls through to synthesis.

* feat(tui): incremental markdown rendering during streaming (#1097) (#1107)

* feat(tui): incremental markdown rendering during streaming (#1097)

Add MarkdownRenderer struct that maintains parse state across streaming
deltas, supporting headers, bold, italic, inline code, code blocks, and
lists with ANSI formatting. Wire into StreamRenderer::handle_delta() so
each token gets markdown-aware rendering. On finalize(), flush pending
state.

Graceful fallback: unclosed markers pass through as raw text. Never
crashes or swallows content.

* fix(tui): address PR #1107 review findings (NB1, NB2, NTH1-3)

NB1: Replace raw ANSI escape codes for bold/italic with crossterm Stylize
trait for consistent formatting throughout.
NB2: Use header level parameter - h1 gets underline+bright, h2 bold+medium,
h3 bold+dim for visual differentiation.
NTH1: Add + list prefix support alongside - and *.
NTH2: Tighten code fence detection - only accept triple backtick followed
by valid language identifier or nothing.
NTH3: Add tests for multi-format lines, flush inside code block, nested
formatting edge cases, plus prefix, fence rejection, header levels.

* fix(llm): use per-block index tracking to prevent streaming tool argument concatenation (#1106) (#1109)

* fix(llm): use per-block index tracking to prevent streaming tool argument concatenation (#1106)

Replace single current_tool_use_id with HashMap<usize, String> keyed by
Anthropic's content-block index. This prevents argument concatenation
when multiple tool_use blocks stream concurrently — deltas for tool A
arriving after tool B's content_block_start no longer get stamped with
the wrong ID.

Changes:
- AnthropicSseState: tool_ids_by_index replaces tool_use_active +
  current_tool_use_id
- parse_sse_data: accepts &mut tool_ids_by_index, dispatches to focused
  handler functions (handle_block_start/delta/stop/message_start/delta)
- track_tool_use_state removed — index-based routing makes it obsolete
- content_block_stop done markers now carry the correct tool ID
- All call sites updated (stream_from_sse, parse_sse_payload)

Tests:
- interleaved_tool_deltas_route_to_correct_tool_id: tool A delta after
  tool B start still carries tool A's ID
- four_concurrent_tool_blocks_all_arguments_clean: 4 blocks with
  out-of-order deltas all route correctly
- Existing tests updated with index fields to match real API

* fix: remove duplicate #[test] attribute

* test(kernel): loop engine error path coverage (#1099) (#1112)

* test(kernel): loop engine error path coverage (#1099)

Add 11 tests covering 5 critical error paths in the loop engine:

1. Budget exhaustion mid-tool-call (2 tests)
   - budget_exhaustion_mid_tool_execution_returns_budget_exhausted
   - budget_exhaustion_preserves_partial_response

2. Decomposition depth >2 integration (3 tests)
   - decompose_depth_three_chain_respects_depth_cap
   - decompose_at_max_depth_returns_fallback
   - decompose_depth_cap_prevents_infinite_recursion_end_to_end

3. Tool friction → escalation (2 tests)
   - repeated_tool_failures_synthesize_without_infinite_retry
   - tool_friction_caps_at_max_iterations

4. Context overflow during tool round (2 tests)
   - context_overflow_during_tool_round_returns_error
   - context_overflow_mid_tool_round_is_recoverable

5. Cancellation during decomposition (2 tests)
   - cancellation_during_decomposition_returns_user_stopped
   - cancellation_during_slow_tool_in_decomposition_is_clean

* fix(kernel): address PR #1112 review findings on error-path tests

NB1: Replace catch-all Ok(_)/{} arms in context overflow tests with
explicit variant matches (Complete, Error, BudgetExhausted) and content
assertions — no silent acceptance of unexpected results.

NB2: Fix budget_exhaustion_preserves_partial_response to allow 1 tool
invocation before exhaustion, then assert partial_response.is_some()
with non-empty content.

NB3: Rename decompose_depth_three_chain_respects_depth_cap to
decompose_at_depth_zero_with_cap_three_completes to match actual
behavior (single-level decomposition at depth 0).

NB4: Remove BudgetExhausted acceptance from cancellation test — with
20 LLM calls of budget, only UserStopped or Complete are valid.

NTH1: Extract ~150 lines of shared test helpers (ScriptedLlm,
StubToolExecutor, CancelAfterNthCallLlm, factory functions) into
test_fixtures module to reduce duplication.

NTH2: Tighten budget_exhaustion_mid_tool test to use max_llm_calls: 1
for deterministic budget exhaustion.

NTH3: Remove unexercised scripted LLM responses from
tool_friction_caps_at_max_iterations and other tests.

* docs: spec for synthesis context guard (#1110) (#1111)

* docs: spec for synthesis context guard (#1110)

* feat(kernel): synthesis context guard — three-layer defense against context overflow (#1110)

Layer 1: Round loop aggregate check — accumulated result bytes tracked in
BudgetTracker and fed into BudgetState::Low, which already triggers the
round loop break in act_with_tools().

Layer 2: Eviction before synthesis — evict_oldest_results() in
synthesize_tool_fallback() replaces oldest results with stubs (preserving
tool_call_id and tool_name) until under max_synthesis_tokens. Single
oversized results are truncated via existing truncate_tool_result().

Layer 3: Budget state trigger — accumulated_result_bytes field on
BudgetTracker, recorded per round in execute_tool_round(). Exceeding
max_aggregate_result_bytes triggers BudgetState::Low.

New BudgetConfig fields:
- max_aggregate_result_bytes (default 400KB): Layer 3 threshold
- max_synthesis_tokens (default 100K): Layer 2 eviction limit

Tests: eviction reduces token count, oldest evicted first, stubs include
tool_name, single oversized result truncated, no eviction when under limit,
budget state trigger, reset clears accumulated bytes.

* fix(fx-kernel): address review findings on synthesis context guard

NB1: Clamp max_synthesis_tokens to floor of 1000 so zero config doesn't
     evict all results, leaving nothing for synthesis.

NB2: Add comment explaining inverse token estimate undershoot is
     intentional (conservative eviction preferred over over-eviction).

NB3: Add comment documenting that evicted stubs in ActionResult are
     intentional (synthesis-only path, results not used downstream).

NTH1: Add tracing::debug! for total accumulated bytes when eviction
      is NOT triggered (aids 'why didn't it evict?' debugging).

NTH2: Add future-proofing comment on max_aggregate_result_bytes
      noting it should derive from model context window.

Includes regression test for zero-max-tokens floor clamping.
All 412 fx-kernel tests pass, clippy clean.

* fix: lower default max_synthesis_tokens from 100K to 50K

100K of tool results + prompt overhead exceeded 200K context window.
50K gives comfortable headroom for synthesis prompt + instructions.

* feat(kernel): per-tool retry budget with blocked escalation (#1101) (#1117)

* feat(memory): time-based memory decay and pruning (#1103) (#1116)

* feat(memory): time-based memory decay and pruning (#1103)

* fix(memory): address R1 review — remove duplicate decay_config, dead code, visibility (#1103)

* fix(memory): stabilize snapshot_returns_all test — sort before comparing (#1103)

* feat(kernel): per-tool retry budget with blocked escalation (#1101)

* fix(kernel): address R1 review — clippy, test coverage, reorder perf (#1101)

* chore(docs): move deprecated android/ios/h2/sprint docs to docs/deprecated/ (#1127)

* docs: spec for #1104 — budget manipulation attack surface (#1120)

* docs: implementation spec for #1102 — kernel/loadable boundary tests

* docs: implementation spec for #1104 — budget attack surface analysis

* fix(docs): add issue links to #1104 spec exclusions

* fix(docs): address R1 review comments on #1104 spec

* feat(security): update security model with budget attack surface (#1104)

* fix(security): add surface #13 defense posture row, remove out-of-scope spec (#1104)

* feat(kernel): decompose complexity gate — batch detection, floor, cost (#1100) (#1121)

* feat(kernel): decompose complexity gate — batch detection, floor, cost (#1100)

* fix(kernel): address R1 review — extract helper, remove clone, fix noop, boundary test (#1100)

* fix(kernel): update test 6 for == 1 tool requirement in complexity floor (#1100)

* fix(ci): remove obsolete H2 spec contract jobs, fix clippy field_reassign_with_default

* feat: add GitHub skill for PR management (#1129)

* feat(security): kernel/loadable boundary tests (#1102)

Rebased onto current staging after #1121 (decompose gate) and #1120
(budget attack surface) merged. Clean cherry-pick of test additions
from c10928f6 — no production code changes.

Tests cover:
- Kernel isolation: tool results cannot influence budget/depth decisions
- Budget immutability: loadable layer cannot modify kernel budget state
- Context integrity: tool outputs are truncated, not trusted as instructions
- Recursion depth enforcement: tools cannot bypass depth limits
- Registry immutability: concurrent registration, duplicate rejection
- Skill trait boundaries: lifecycle isolation, error containment
- Memory validation: key format enforcement, size limits, path traversal

* feat(loadable): GitHubSkill with create_pr and comment_pr (#1128)

* fix: remove unrelated #1102 boundary tests and unused kv_set

- Restore engine crate files (fx-kernel, fx-loadable, fx-memory) to
  staging state — boundary test code belongs in PR #1119, not here
- Remove unused kv_set function and its FFI import (YAGNI)
- Branch now contains ONLY github-skill files

Fixes R2 review findings B1 and NB1 on PR #1129

* fix(clippy): allow field_reassign_with_default in test code, fix unused vars

* feat(security): kernel/loadable boundary tests (#1102) (#1119)

Rebased onto current staging after #1121 (decompose gate) and #1120
(budget attack surface) merged. Clean cherry-pick of test additions
from c10928f6 — no production code changes.

Tests cover:
- Kernel isolation: tool results cannot influence budget/depth decisions
- Budget immutability: loadable layer cannot modify kernel budget state
- Context integrity: tool outputs are truncated, not trusted as instructions
- Recursion depth enforcement: tools cannot bypass depth limits
- Registry immutability: concurrent registration, duplicate rejection
- Skill trait boundaries: lifecycle isolation, error containment
- Memory validation: key format enforcement, size limits, path traversal
abbudjoe added a commit that referenced this pull request Apr 8, 2026
Address all 13 items from the review comment:

Blocking:
- #1: Replace serde_json::Value metadata with typed ThoughtMetadata enum
- #2: Edge classification uses explicit caller declaration (add_edge vs
  add_back_edge) with index-order sanity checks, not index arithmetic
- #3: Define Generate partial failure as all-or-nothing per parent

Non-blocking:
- #4: ThoughtIdAllocator uses plain u64, not AtomicU64
- #5: Remove created_at ghost field (Instant not serializable)
- #6: Replace raw usize with GraphNodeId wrapper throughout
- #7: Fix line count estimate to ~1,900 (was ~1,100)
- #8: Replace (usize, usize, bool) tuple with named EdgeSpec struct
- #9: GoT + sub_goals combination returns error instead of silent ignore

Nice-to-have:
- #10: LLM score parsing uses regex extraction with fallback
- #11: Each operation emits tracing::info_span! with node/op/cycle
- #12: Single budget mechanism (session-level), removed max_total_tokens
- #13: Document refine() sets last_node to final internal node

Step 6 gains 4 new test cases (12-15) covering partial failure,
parameter conflict rejection, refine wiring, and score parsing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant