Skip to content

bug(ansible): provision-fleet-roles.yml has no pre-flight git pull — stale code_source causes old Ansible roles to run #3561

@mrveiss

Description

@mrveiss

Summary

provision-fleet-roles.yml runs Ansible roles directly from code_source at /opt/autobot/code_source without first pulling the latest code from GitHub. If code_source is stale (e.g., after a PR merge that hasn't been picked up yet), the old role code executes and can cause hard-to-diagnose failures.

Observed in Production

Provisioning run on 2026-04-06 failed at Phase 5a (AI Stack) on 00-SLM-Manager:

TASK [ai-stack : Install AI Python packages from requirements-ai.txt]
fatal: [00-SLM-Manager]: FAILED!
ERROR: Could not find a version that satisfies the requirement numpy<3.0.0,>=2.4.3
ERROR: No matching distribution found for numpy<3.0.0,>=2.4.3

Root cause chain:

  1. PR fix(ansible): ai-stack role uses python3.12 for venv + fix pip cache (#3534) #3536 (merged) added stale-venv detection to the ai-stack role — detects Python 3.10 venv, deletes it, recreates with Python 3.12
  2. code_source on the SLM manager was not updated after fix(ansible): ai-stack role uses python3.12 for venv + fix pip cache (#3534) #3536 merged
  3. The stale ai-stack role (no stale-venv detection) ran against a host that had a pre-existing Python 3.10 venv
  4. numpy>=2.4.3 requires Python >=3.11 → install fails

Evidence that old code was deployed — task name in Ansible output:

TASK [ai-stack : Create Python virtual environment]        ← old (no #3534 suffix)

Current code in repo:

- name: Create Python virtual environment (#3534)          ← new

The stale-venv detection tasks and pip cache dir task were entirely absent from the verbose output.

Existing Pattern

update-all-nodes.yml already does this correctly (lines 61–68):

- name: "[PRE-FLIGHT] Sync latest code from GitHub (if reachable)"
  git:
    repo: "{{ github_repo }}"
    dest: "{{ git_repo_root }}"
    version: "{{ deploy_ref }}"
    update: yes
  ignore_errors: yes

Proposed Fix

Add the same pre-flight git pull play to provision-fleet-roles.yml (or to setup_wizard.py before invoking it), so the Ansible roles always reflect the latest merged code before a provision run.

The ignore_errors: yes pattern from update-all-nodes.yml is appropriate — if the SLM has no internet access, fall back to whatever is in code_source with a warning (same behavior as today, but explicit).

Impact

  • Any provision run after a PR merge that touches Ansible roles may use stale role code
  • Failures are non-obvious: the old task names in the verbose output are the only clue
  • Affects all roles, not just ai-stack

Labels

  • bug, ansible, infra, medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions