-
-
Notifications
You must be signed in to change notification settings - Fork 1
bug(ansible): provision-fleet-roles.yml has no pre-flight git pull — stale code_source causes old Ansible roles to run #3561
Description
Summary
provision-fleet-roles.yml runs Ansible roles directly from code_source at /opt/autobot/code_source without first pulling the latest code from GitHub. If code_source is stale (e.g., after a PR merge that hasn't been picked up yet), the old role code executes and can cause hard-to-diagnose failures.
Observed in Production
Provisioning run on 2026-04-06 failed at Phase 5a (AI Stack) on 00-SLM-Manager:
TASK [ai-stack : Install AI Python packages from requirements-ai.txt]
fatal: [00-SLM-Manager]: FAILED!
ERROR: Could not find a version that satisfies the requirement numpy<3.0.0,>=2.4.3
ERROR: No matching distribution found for numpy<3.0.0,>=2.4.3
Root cause chain:
- PR fix(ansible): ai-stack role uses python3.12 for venv + fix pip cache (#3534) #3536 (merged) added stale-venv detection to the ai-stack role — detects Python 3.10 venv, deletes it, recreates with Python 3.12
code_sourceon the SLM manager was not updated after fix(ansible): ai-stack role uses python3.12 for venv + fix pip cache (#3534) #3536 merged- The stale ai-stack role (no stale-venv detection) ran against a host that had a pre-existing Python 3.10 venv
- numpy>=2.4.3 requires Python >=3.11 → install fails
Evidence that old code was deployed — task name in Ansible output:
TASK [ai-stack : Create Python virtual environment] ← old (no #3534 suffix)
Current code in repo:
- name: Create Python virtual environment (#3534) ← newThe stale-venv detection tasks and pip cache dir task were entirely absent from the verbose output.
Existing Pattern
update-all-nodes.yml already does this correctly (lines 61–68):
- name: "[PRE-FLIGHT] Sync latest code from GitHub (if reachable)"
git:
repo: "{{ github_repo }}"
dest: "{{ git_repo_root }}"
version: "{{ deploy_ref }}"
update: yes
ignore_errors: yesProposed Fix
Add the same pre-flight git pull play to provision-fleet-roles.yml (or to setup_wizard.py before invoking it), so the Ansible roles always reflect the latest merged code before a provision run.
The ignore_errors: yes pattern from update-all-nodes.yml is appropriate — if the SLM has no internet access, fall back to whatever is in code_source with a warning (same behavior as today, but explicit).
Impact
- Any provision run after a PR merge that touches Ansible roles may use stale role code
- Failures are non-obvious: the old task names in the verbose output are the only clue
- Affects all roles, not just ai-stack
Labels
- bug, ansible, infra, medium