Skip to content

chore: enable vrf converged topo#24963

Open
auspham wants to merge 3 commits into
sonic-net:masterfrom
auspham:austinpham/37879355-enable-vrf-t0
Open

chore: enable vrf converged topo#24963
auspham wants to merge 3 commits into
sonic-net:masterfrom
auspham:austinpham/37879355-enable-vrf-t0

Conversation

@auspham

@auspham auspham commented May 29, 2026

Copy link
Copy Markdown
Contributor

Description of PR

Enable the converged (multi-VRF) cEOS peer model across the virtual testbeds and add the framework changes needed so existing tests pass unchanged on every converged topology — one solution that fits all cEOS topologies, with no per-test-case edits.

In converged mode a single cEOS VM hosts every logical neighbor as a VRF under one BGP process, and each logical neighbor's interface is renamed on the shared VM. This breaks several assumptions baked into the framework: (1) EOS config writes target router bgp <asn> in the default VRF; (2) tests reference per-logical interface names from minigraph; (3) the legacy untagged backplane reachability (PTF bp ↔ cEOS ↔ DUT) is split into per-VRF VLAN sub-interfaces; and (4) bash <cmd> invocations on the cEOS run in the default Linux netns, while all data routes live in per-VRF ns-<vrf> namespaces.

Type of change

  • Testbed and Framework (improvement)

Back port request

  • 202311
  • 202405
  • 202411
  • 202505
  • 202511
  • 202512
  • 202605

Approach

Problem

  1. EosHost.eos_config() calls land in the default VRF on converged VMs → BGP config writes (shutdown, no_shutdown_bgp, neighbor tweaks) silently miss.
  2. EosHost.eos_command() / eos_config() calls reference per-logical interface names (e.g. Ethernet1) that don't exist on the shared converged VM.
  3. PTF backplane interface has no IP on the parent device in converged mode, so untagged backplane traffic (used by e.g. BGP monitor / link-flap tests) has no return path.
  4. ceos_topo_converger dropped disabled_host_interfaces from the converged topo, turning previously-admin-down ports into active ports and breaking buffer/QoS deployment.
  5. bash <cmd> invocations on the cEOS (e.g. snmp/test_snmp_loopback.py) run in the route-less default Linux netns on converged, so they must be re-scoped into the per-VRF ns-<vrf> namespace that carries the route to the DUT.
  6. Several per-role startup-config templates never grew a converged branch (t0-64-32-leaf, t2-leaf, t2-core, dpu-tor, t1-8-lag-spine, t1-8-lag-tor). On a converged topology they rendered only the prime's own config and silently dropped every merged sub-peer's interface IP and BGP neighbor — so the DUT's sessions to those sub-peers never established (pre-test BGP sanity failed), and the per-VRF Linux namespace ns-<vrf> was never created (so the netns-scoped snmpget failed with "Cannot open network namespace").
  7. The converger selected one prime per (role, ASN), which exploded the prime count on real multi-ASN fabrics (e.g. t1-lag → 17, t1-lag-vpp → 17, t2 → 49). Each converged cEOS reserves significant memory, so this exhausted testbed RAM and failed add-topo.
  8. Convergence also ran for non-cEOS neighbor deployments (SONiC-VS / cisco / csonic), which have no converged render path. Reshaping the topology there produced a DUT minigraph whose BGP neighbors the unconverged VS peers could not answer → "Not all bgp sessions are established".

Solution

All changes are common-code and gated so non-converged / non-cEOS paths are byte-identical:

File Change Gate
tests/common/devices/eos.py Wrap eos_config() / eos_command() to rewrite router bgp <asn> parents into router bgp <prime_asn> / vrf <vrf>, translate interface <name> tokens via intf_map, and rewrite bash <cmd> into bash sudo ip netns exec ns-<vrf> <cmd>. VRF-aware get_route(). self.bgp_vrf / self.intf_map — both None on stock
tests/conftest.py Populate bgp_vrf, bgp_prime_asn, intf_map on each EosHost from multi_vrf_data in nbrhosts; only converge the topology when the neighbor type is cEOS so SONiC-VS / cisco / csonic runs keep historical behavior. if multi_vrf_peer: / neighbor_type in (eos, ceos)
ansible/roles/eos/templates/ceos_converged.j2 (new) Single shared converged startup-config, rendered as a pure function of the converger output (convergence_data) plus per-peer configuration — independent of base topo / role. topo_is_multi_vrf
ansible/roles/eos/templates/{t0-64-32-leaf, t2-leaf, t2-core, dpu-tor, t1-8-lag-spine, t1-8-lag-tor}.j2 Include ceos_converged.j2 when converged; the existing stock body is preserved unchanged in the {% else %} branch. when: topo_is_multi_vrf
ansible/ceos_topo_converger.py Select one prime per role (not per role+ASN). Each merged sub-peer is a VRF with its own local-as, so a single prime correctly serves sub-peers with different ASNs and the prime count stays minimal (e.g. dpu → 1, t1-lag/t1-lag-vpp/t2 → 2). Also preserve disabled_host_interfaces, and size max_fp_num to the converged front-panel count. n/a — converger only runs in converged mode
ansible/roles/eos/tasks/ceos_config.yml + ansible/roles/eos/templates/ceos_bp_compat.j2 (new) Append a stock-compat backplane shim (untagged VLAN 1 on the trunk, Vlan1 SVI in the prime VRF carrying the host bp_interface IP, advertise that subnet via BGP). when: topo_is_multi_vrf and configuration[hostname]['bp_interface'] is defined
ansible/roles/vm_set/library/vm_topology.py Assign the legacy backplane IP (ptf_bp_ip[v6]_addr) to the PTF parent backplane in addition to the per-VRF sub-interfaces. Only the if is_multi_vrf: branch of add_bp_port_with_vlans_to_docker()
ansible/testbed-cli.sh Only converge the topology when deploying with -k ceos; non-cEOS vm types skip convergence. if vm_type == ceos
ansible/vtestbed.yaml Enable use_converged_peers: True on the virtual testbeds; convergence then applies only to their cEOS deployments via the gates above. n/a

Verification

Validated on virtual testbeds (cEOS neighbors), confirming the converged render and the gating:

  • dpu — converges to a single prime hosting both ToR VRFs with per-VRF local-as (ASN 65200 and 65201). add-topo + deploy-mg succeed; the DUT establishes both BGP sessions and exchanges routes; bgp/test_bgp_fact.py passes including the pre-test BGP sanity check that previously failed.
  • multi-asic (t1-8-lag) — the shared template now renders the per-VRF config, so the ns-<vrf> Linux namespaces are created on the prime (the missing namespace that previously made snmp/test_snmp_loopback.py fail with "Cannot open network namespace").
  • t1-lag / t1-lag-vpp / t2 — one-prime-per-role keeps the prime count low so add-topo no longer exhausts testbed memory.
  • Non-cEOS deployments (SONiC-VS / cisco / csonic) skip convergence entirely (gated in both testbed-cli.sh and conftest.py) and remain byte-identical to historical behavior.
  • Stock topology paths are unchanged when topo_is_multi_vrf / is_multi_vrf is false — verified by code inspection of the gating conditions and by Jinja-parsing every wrapped template.

Any platform specific information?

Converged peer model only; cEOS neighbors only. Non-cEOS neighbor types (SONiC-VS / cisco / csonic) are explicitly excluded from convergence at both the deploy (testbed-cli.sh) and test (conftest.py) layers, so they keep their existing behavior.

@github-actions github-actions Bot requested review from r12f, sdszhang and wangxin May 29, 2026 03:09
@auspham auspham marked this pull request as ready for review May 29, 2026 03:53
@yijingyan2

Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@auspham auspham force-pushed the austinpham/37879355-enable-vrf-t0 branch from af8d2ad to b75adaa Compare May 29, 2026 07:48
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@github-actions github-actions Bot requested a review from xwjiang-ms May 29, 2026 07:48
@auspham auspham force-pushed the austinpham/37879355-enable-vrf-t0 branch from b75adaa to aa72861 Compare June 1, 2026 02:05
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@auspham auspham force-pushed the austinpham/37879355-enable-vrf-t0 branch from aa72861 to fd7ddbb Compare June 1, 2026 06:56
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@auspham

auspham commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

/azpw run

@mssonicbld

Copy link
Copy Markdown
Collaborator

⚠️ Notice: /azpw run only runs failed jobs now. If you want to trigger a whole pipline run, please rebase your branch or close and reopen the PR.
💡 Tip: You can also use /azpw retry to retry failed jobs directly.

Retrying failed(or canceled) jobs...

@mssonicbld

Copy link
Copy Markdown
Collaborator

No Azure DevOps builds found for #24963.

@auspham auspham closed this Jun 1, 2026
@auspham auspham reopened this Jun 1, 2026
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@auspham auspham force-pushed the austinpham/37879355-enable-vrf-t0 branch from fd7ddbb to 981d518 Compare June 1, 2026 09:16
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@auspham auspham force-pushed the austinpham/37879355-enable-vrf-t0 branch from 981d518 to 1551602 Compare June 2, 2026 01:44
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@auspham

auspham commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

1st stage: all passed only t0 multi_vrf enabled
image

Will toggle on for other topology

@auspham auspham force-pushed the austinpham/37879355-enable-vrf-t0 branch from 1551602 to b5829c3 Compare June 2, 2026 07:45
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@auspham auspham force-pushed the austinpham/37879355-enable-vrf-t0 branch from 3706d50 to fda59e0 Compare June 3, 2026 03:37
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

auspham added 3 commits June 9, 2026 17:19
Signed-off-by: Austin (Ngoc Thang) Pham <austinpham@microsoft.com>
Fixes the framework gaps that broke the 6 PR test plans when
use_converged_peers=True is set on a vtestbed:

1. ansible/ceos_topo_converger.py
   - Key prime selection on (label, ASN) instead of label alone. This was
     the root cause of the dpu BGP sanity failure: ARISTA01T0 (asn 65200)
     and ARISTA02T0 (asn 65201) collapsed into a single prime running
     'router bgp 65200', leaving the DUT's second neighbor permanently
     stuck. Topologies whose peers under a role already share an ASN
     (kvm-t0, t1-8-lag) keep the same prime count.
   - Emit 'max_fp_num_provided' at the converged-topo root, sized to
     max(default, min(max_vlans_per_prime, CEOSLAB_INTF_LIMIT)).
     roles/vm_set/tasks/start.yml already honors this override via
     set_fact, so create_bridges + ceos_network now allocate enough
     br-<vm>-N bridges and FP veth pairs for the merged sub-peers. This
     unblocks the 'Too many vlans' prepare assertion for t0-64-32,
     t1-lag, t1-lag-vpp, t1-64-lag (multi-asic-t1) and t2. Only emitted
     when above the default 4, so stock-shape topologies (kvm-t0, dpu,
     t1-8-lag) stay byte-identical.

2. tests/common/devices/eos.py
   - EosHost.get_route: when the caller passes no explicit vrf= and
     self.bgp_vrf is populated (set by the nbrhosts fixture from
     multi_vrf_data), auto-scope the 'show ip bgp <prefix>' command to
     'vrf <bgp_vrf>' and surface the returned per-VRF block under
     vrfs/default. This fixes test_bgp_bbr.py on multi-asic-t1 (and the
     other legacy readers in tests/filterleaf and tests/vlan that
     hardcode vrfs/default) without touching the test cases. Stock topos
     have self.bgp_vrf = None and the path is byte-identical. Callers
     that pass an explicit vrf= (e.g. tests/bgp/bgp_helpers.py) are not
     affected.

Verified via offline sims on every PR-impacted topology (sim output
attached to the PR description). Per-topo prime counts:
  kvm-t0=1, t0-64-32=1, dpu=2, t1-8-lag=2, t1-lag=17, t1-lag-vpp=17,
  t1-64-lag=21, t2=49.

Known follow-up (not in this commit): several non-converged-aware
templates (dpu-tor.j2, dpu-1-tor.j2, t0-64-32-leaf.j2, t2-leaf.j2,
t2-core.j2) still render only the prime's own interface config and
silently drop merged sub-peers' Ethernet IPs. Per-ASN keying mitigates
the worst case but doesn't eliminate it. Track separately.

Signed-off-by: Austin (Ngoc Thang) Pham <austinpham@microsoft.com>
The converged (multi-VRF) peer model only worked on the per-role templates
that had grown a converged branch, and the converger created one prime VM per
(role, ASN). Several virtual topologies broke when use_converged_peers was set:

- Per-role templates without a converged branch (t0-64-32-leaf, t2-leaf,
  t2-core, dpu-tor, t1-8-lag-spine, t1-8-lag-tor) rendered only the prime's own
  config and silently dropped every merged sub-peer's interface IP and BGP
  neighbor. The DUT's sessions to those sub-peers never came up, and the
  per-VRF Linux namespace ns-<vrf> was never created, so the netns-scoped
  snmpget in snmp/test_snmp_loopback.py failed with "Cannot open network
  namespace".
- Keying primes on (role, ASN) exploded the prime count on multi-ASN fabrics
  (t1-lag 17, t1-lag-vpp 17, t2 49); each converged cEOS reserves significant
  memory, exhausting testbed RAM and failing add-topo.
- Convergence also ran for non-cEOS (SONiC-VS/cisco/csonic) deployments, which
  have no converged render path, producing a DUT minigraph the unconverged VS
  peers could not answer.

Fixes (common code only, no test-case changes):

- ansible/roles/eos/templates/ceos_converged.j2 (new): a single shared
  converged startup-config rendered purely from the converger output. The
  per-role templates that lacked a converged branch now include it under
  topo_is_multi_vrf; their stock body is unchanged for non-converged topologies.
- ansible/ceos_topo_converger.py: select one prime per role. Each merged
  sub-peer is a VRF with its own local-as, so a single prime serves sub-peers
  with different ASNs (dpu -> 1 prime; t1-lag/t1-lag-vpp/t2 -> 2 primes).
- ansible/testbed-cli.sh and tests/conftest.py: only converge when the neighbor
  type is cEOS, so SONiC-VS/cisco/csonic deployments keep their historical,
  byte-identical behavior.

Signed-off-by: Austin (Ngoc Thang) Pham <austinpham@microsoft.com>
@auspham auspham force-pushed the austinpham/37879355-enable-vrf-t0 branch from 9a8db84 to c74f2f8 Compare June 9, 2026 07:19
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@yijingyan2

Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants