Skip to content

fix: fix Gemini runner — Vertex auth, system prompt, custom commands, feedback MCP#803

Merged
Gkrumbach07 merged 10 commits intomainfrom
bug/fix-gemini-runner
Mar 4, 2026
Merged

fix: fix Gemini runner — Vertex auth, system prompt, custom commands, feedback MCP#803
Gkrumbach07 merged 10 commits intomainfrom
bug/fix-gemini-runner

Conversation

@Gkrumbach07
Copy link
Contributor

Summary

  • Vertex AI auth fixed: map platform env vars (ANTHROPIC_VERTEX_PROJECT_IDGOOGLE_CLOUD_PROJECT, CLOUD_ML_REGIONGOOGLE_CLOUD_LOCATION) in Gemini CLI subprocess env; validate credentials file exists at startup with a clear error
  • System prompt: write .gemini/system.md using ${SubAgents}/${AgentSkills}/${AvailableTools} variable substitutions to preserve Gemini's default instructions, then append platform context (workspace paths, repos, git, workflow, rubric/correction command hints)
  • Custom commands: /ambient:evaluate-rubric and /ambient:log-correction written to .gemini/commands/ambient/ at session setup
  • Feedback MCP server: minimal stdlib stdio MCP server (feedback_server.py) exposing evaluate_rubric and log_correction tools backed by existing Langfuse logging; registered in .gemini/settings.json with Langfuse credentials injected via env block (bypasses the Gemini CLI subprocess env blocklist)
  • add_dirs bug fixed: resolve_workspace_paths return value was silently dropped; now seeded into include_directories

Platform refactoring

Moved shared code out of individual bridges into the base layer:

  • platform/auth.py: validate_vertex_credentials_file() — shared Vertex credential validation used by both Claude and Gemini auth modules
  • bridge.py (base class): set_context(), _ensure_ready(), _setup_platform(), _refresh_credentials_if_stale() — identical in both bridges, now inherited; bridges only contain framework-specific code
  • bridge.py: _async_safe_manager_shutdown() — the async-safe fire-and-forget manager shutdown pattern shared by both mark_dirty() implementations

Test plan

  • Create a Gemini session with Vertex AI enabled — should connect and respond
  • Verify /ambient:evaluate-rubric and /ambient:log-correction commands appear in Gemini CLI
  • Run /ambient:log-correction — should log to Langfuse without "credentials missing" error
  • Verify Claude sessions still work (no regression from base class refactor)

🤖 Generated with Claude Code

Gkrumbach07 and others added 2 commits March 4, 2026 14:23
…nd feedback MCP server

## Fixes
- Map ANTHROPIC_VERTEX_PROJECT_ID/CLOUD_ML_REGION to GOOGLE_CLOUD_PROJECT/GOOGLE_CLOUD_LOCATION
  in subprocess env so Gemini CLI can connect to Vertex AI
- Validate GOOGLE_APPLICATION_CREDENTIALS file existence at startup (fail fast with clear error)
- Wire add_dirs from resolve_workspace_paths into include_directories (was silently dropped)

## New Gemini features
- System prompt: write .gemini/system.md using ${SubAgents}/${AgentSkills}/${AvailableTools}
  variable substitutions to preserve Gemini's default instructions, then append platform context
  (workspace paths, repos, git push instructions, workflow, rubric/correction hints)
- Custom commands: /ambient:evaluate-rubric and /ambient:log-correction written to
  .gemini/commands/ambient/ at session setup
- Feedback MCP server: minimal stdlib stdio MCP server exposing evaluate_rubric and
  log_correction tools backed by existing Langfuse logging; injected into .gemini/settings.json
  with Langfuse credentials in env block (bypasses Gemini CLI blocklist)

## Platform refactoring
- Extract validate_vertex_credentials_file() to platform/auth.py (shared by both bridges)
- Extract set_context(), _ensure_ready(), _setup_platform(), _refresh_credentials_if_stale()
  to PlatformBridge base class — bridges now only contain framework-specific code
- Extract _async_safe_manager_shutdown() to bridge.py as shared helper for mark_dirty()

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replaces mutable :latest tags with immutable SHA digests so operator
pods always pull a known-good image regardless of pull policy.
:latest with IfNotPresent is a known source of stale-image bugs (as
seen with the Gemini hydrate.sh fix earlier in this branch).

Digests pinned to current main build:
- vteam_claude_runner@sha256:5363f4bb...
- vteam_state_sync@sha256:4541a831...

CI should update these digests in the ConfigMap after each successful
image build on main.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Gkrumbach07 and others added 2 commits March 4, 2026 14:35
After building and pushing runner/state-sync images, patch the
ambient-agent-registry ConfigMap with the same SHA tag used for the
operator env vars. This ensures new sessions always use the image that
was just built rather than whatever :latest happened to be cached on
the node.

- On push to main: uses github.sha tag (or 'stage' if image unchanged)
- On workflow_dispatch: uses 'stage' tag

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions

This comment has been minimized.

:stage is a mutable tag — nodes with imagePullPolicy: IfNotPresent
won't re-pull it even when a new image is pushed. Use the immutable
git SHA tag instead so the operator always spawns pods with the exact
image built in that CI run.

Also replace the :stage fallback (for unchanged images) with :latest,
which is semantically correct: if an image wasn't rebuilt, the node
already has the right content and doesn't need to pull.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Mirrors the same fix from components-build-deploy.yml — patch the
ambient-agent-registry ConfigMap with the release version tag so new
sessions use the exact release image rather than whatever :latest is
cached on the node.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions

This comment has been minimized.

…th :latest

Previous approach fetched the ConfigMap from the repo file and applied
the full thing, which reset unchanged images back to :latest.

New approach:
1. Fetch the live JSON from the cluster (which has the last good SHA)
2. Only replace images that were actually rebuilt in this run using a
   greedy sed pattern ([^"]*) that matches any current tag/digest
3. oc patch just the data key — unchanged images keep their current tag

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions

This comment has been minimized.

- Use jq -Rs . instead of python3 for JSON escaping (simpler, already
  on ubuntu-latest, no extra process)
- Fix sed pattern [@:][^"]* to handle both :tag and @sha256:digest refs
  (previous [^"]* only matched after : so digest refs were skipped)
- Fix dispatch job operator env vars to use github.sha not :stage
  (ConfigMap used sha but operator still pointed to mutable :stage tag)
- Add concurrency: group to prod-release-deploy.yaml to prevent
  parallel releases racing on the ConfigMap read-modify-write

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Base class methods (_ensure_ready, _refresh_credentials_if_stale,
set_context) were accessing self._context, self._ready, and
self._last_creds_refresh via # type: ignore[attr-defined] because
those attributes were only declared in subclasses. Any future bridge
that forgot to initialise them would crash with AttributeError.

- Add PlatformBridge.__init__ declaring the three shared attributes
  with correct types
- Both bridges call super().__init__() and drop their own redundant
  declarations of the same three attrs
- Remove all seven # type: ignore[attr-defined] annotations

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions

This comment has been minimized.

… exception

The Gemini feedback_server was importing private functions (_log_to_langfuse,
_log_correction_to_langfuse, _get_session_context) directly from
bridges/claude/, creating an illegal cross-bridge dependency on internals.

- Create platform/feedback.py with public API: log_rubric_score(),
  log_correction(), get_session_context() — backed by the existing
  Claude bridge implementations but accessible to any bridge
- feedback_server.py now imports only from platform.feedback (no
  bridges.claude.* imports remain)
- Session context caching moved to platform/feedback.py so it's shared
  regardless of which caller uses get_session_context()

Also fix silent exception in system_prompt.py: bare `except Exception: pass`
replaced with `logger.warning(...)` so filesystem errors are visible.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Claude Code Review

Summary

PR #803 fixes Gemini CLI Vertex auth, adds system prompt support, custom slash commands, and a feedback MCP server. It also includes a solid refactor moving shared bridge code (credential refresh, lazy setup, async shutdown helper) into the `PlatformBridge` base class. The changes are well-structured and the logic is correct. Two issues stand out: an inverted dependency in the new platform layer and Langfuse credentials written to a file accessible to the agent.

Issues by Severity

Blocker Issues

None.

Critical Issues

None.

Major Issues

1. Inverted dependency: `platform/feedback.py` imports private symbols from a bridge-specific module

  • File: `components/runners/ambient-runner/ambient_runner/platform/feedback.py:747-755`
  • What: The platform layer (`platform/`) imports private implementation functions from `bridges/claude/`:
    ```python
    from ambient_runner.bridges.claude.corrections import (
    _get_session_context, # private
    _log_correction_to_langfuse, # private
    ...
    )
    from ambient_runner.bridges.claude.tools import _log_to_langfuse # private
    ```
  • Which standard: Architecture violation — the platform layer should not depend on bridge-specific modules. `platform/` is the shared base that bridges build on top of; the dependency arrow should never point the other way.
  • Impact: Any non-Claude bridge that uses `platform/feedback.py` silently pulls in Claude-specific code. Adding a third bridge without Claude installed would break feedback logging entirely.
  • Fix: Move `_get_session_context`, `_log_correction_to_langfuse`, `_log_to_langfuse`, `CORRECTION_SOURCES`, and `CORRECTION_TYPES` into `platform/` (e.g. `platform/corrections.py`). Have the Claude bridge import from there instead of the reverse.

2. `LANGFUSE_SECRET_KEY` written in plaintext to workspace-accessible `.gemini/settings.json`

  • File: `components/runners/ambient-runner/ambient_runner/bridges/gemini_cli/mcp.py:364-396`
  • What: `_build_feedback_server_entry()` injects `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` into the MCP server `env` block, which lands in `{cwd_path}/.gemini/settings.json`. The Gemini CLI agent has file-reading tools and can access the entire workspace including this file.
  • Which standard: Security standards — minimal credential exposure; the Gemini CLI's own subprocess env blocklist exists specifically to prevent the agent from seeing `LANGFUSE_*` keys. This workaround circumvents that intent by writing them to a file the agent can read.
  • Impact: An adversarial or prompt-injected interaction could read `.gemini/settings.json` and exfiltrate `LANGFUSE_SECRET_KEY`.
  • Fix (short-term): After `write_gemini_settings` returns, set `chmod 0o600` on the settings file:
    ```python
    settings_path = write_gemini_settings(cwd_path, settings)
    try:
    Path(settings_path).chmod(0o600)
    except OSError:
    pass
    ```
    Fix (long-term): Move the credentials file outside the agent-accessible workspace (e.g. `/run/secrets/ambient-feedback/`) and have the feedback server read from there, or inject only `LANGFUSE_ENABLED` into the settings env block and pass secrets via a separate mechanism.

Minor Issues

3. Return type annotation mismatch on `setup_gemini_mcp`

  • File: `components/runners/ambient-runner/ambient_runner/bridges/gemini_cli/mcp.py` (function signature)
  • What: The function now always registers `ambient-feedback` and calls `write_gemini_settings`, so it always returns a non-`None` path. The annotation `-> Optional[str]` is misleading.
  • Fix: Change to `-> str`.

4. Deferred import inconsistency in `_build_system_prompt`

  • File: `components/runners/ambient-runner/ambient_runner/bridges/gemini_cli/system_prompt.py:688`
  • What: `from ambient_runner.observability import is_langfuse_enabled` is a lazy import at the bottom of the function while all other platform imports in the same function are grouped at the top.
  • Fix: Hoist it to the function-level import block alongside the other lazy imports.

5. `deploy-with-dispatch` always pins both images to `github.sha`

  • File: `.github/workflows/components-build-deploy.yml` (new ConfigMap step in `deploy-with-dispatch`)
  • What: Unlike the conditional step earlier in the workflow (which guards each sed substitution with `detect-changes` output checks), the new ConfigMap patch in `deploy-with-dispatch` always replaces both `vteam_claude_runner` and `vteam_state_sync` with the current SHA — even if only one was actually rebuilt. This mirrors the pre-existing `oc set env` pattern above it.
  • Fix: Either guard each sed substitution with the same `detect-changes` conditionals, or add a comment documenting why unconditional patching is acceptable for the dispatch flow.

Positive Highlights

  • Excellent base-class refactor: Consolidating `set_context`, `_ensure_ready`, `_setup_platform`, `_refresh_credentials_if_stale`, and shared state into `PlatformBridge.init` eliminates ~80 lines of duplicated code across two bridges and gives any future bridge a correct lifecycle for free.

  • `_async_safe_manager_shutdown` consolidation: Single, well-documented implementation replacing identical copy-paste in both `mark_dirty()` methods. Using `create_task` + done-callback is the correct pattern for the within-loop case.

  • `validate_vertex_credentials_file` consolidation: Consistent, early credential validation with the same error messages shared between Claude and Gemini auth modules — good for operators debugging misconfigured credentials.

  • Env-var fallback in `session.py` is safe: The `if not env.get("GOOGLE_CLOUD_PROJECT")` guard correctly respects explicitly-set values and only fills in the alias when the var is absent.

  • CI concurrency guard on prod deployment: `cancel-in-progress: false` in `prod-release-deploy.yaml` is the right setting — prod deployments must never be cancelled mid-flight.

  • MCP server correctly routes logs to `stderr`: `logging.basicConfig(stream=sys.stderr)` in `feedback_server.py` is easy to get wrong and correctly done here since `stdout` is the JSON-RPC channel.

Recommendations

  1. (Major — before merge) Refactor `platform/feedback.py` to eliminate the reverse-dependency on `bridges/claude/`. Move shared correction/logging internals to `platform/corrections.py` and have the Claude bridge import from there.

  2. (Major — before merge or tracked as follow-up) Address the Langfuse credentials in `.gemini/settings.json`. At minimum, `chmod 0o600` the file immediately after writing it.

  3. (Minor) Fix the `Optional[str]` return type on `setup_gemini_mcp`.

  4. (Minor) Hoist the `is_langfuse_enabled` import to the function-level import block in `_build_system_prompt`.

  5. (Minor) Document or conditionally guard the unconditional image pinning in the `deploy-with-dispatch` ConfigMap step.


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@Gkrumbach07 Gkrumbach07 merged commit 047314a into main Mar 4, 2026
13 checks passed
@Gkrumbach07 Gkrumbach07 deleted the bug/fix-gemini-runner branch March 4, 2026 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant