chore: enable vrf converged topo#24963
Open
auspham wants to merge 3 commits into
Open
Conversation
Contributor
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
af8d2ad to
b75adaa
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
b75adaa to
aa72861
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
aa72861 to
fd7ddbb
Compare
Collaborator
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
Contributor
Author
|
/azpw run |
Collaborator
|
Retrying failed(or canceled) jobs... |
Collaborator
|
No Azure DevOps builds found for #24963. |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
fd7ddbb to
981d518
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
981d518 to
1551602
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Contributor
Author
1551602 to
b5829c3
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
3706d50 to
fda59e0
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Austin (Ngoc Thang) Pham <austinpham@microsoft.com>
Fixes the framework gaps that broke the 6 PR test plans when
use_converged_peers=True is set on a vtestbed:
1. ansible/ceos_topo_converger.py
- Key prime selection on (label, ASN) instead of label alone. This was
the root cause of the dpu BGP sanity failure: ARISTA01T0 (asn 65200)
and ARISTA02T0 (asn 65201) collapsed into a single prime running
'router bgp 65200', leaving the DUT's second neighbor permanently
stuck. Topologies whose peers under a role already share an ASN
(kvm-t0, t1-8-lag) keep the same prime count.
- Emit 'max_fp_num_provided' at the converged-topo root, sized to
max(default, min(max_vlans_per_prime, CEOSLAB_INTF_LIMIT)).
roles/vm_set/tasks/start.yml already honors this override via
set_fact, so create_bridges + ceos_network now allocate enough
br-<vm>-N bridges and FP veth pairs for the merged sub-peers. This
unblocks the 'Too many vlans' prepare assertion for t0-64-32,
t1-lag, t1-lag-vpp, t1-64-lag (multi-asic-t1) and t2. Only emitted
when above the default 4, so stock-shape topologies (kvm-t0, dpu,
t1-8-lag) stay byte-identical.
2. tests/common/devices/eos.py
- EosHost.get_route: when the caller passes no explicit vrf= and
self.bgp_vrf is populated (set by the nbrhosts fixture from
multi_vrf_data), auto-scope the 'show ip bgp <prefix>' command to
'vrf <bgp_vrf>' and surface the returned per-VRF block under
vrfs/default. This fixes test_bgp_bbr.py on multi-asic-t1 (and the
other legacy readers in tests/filterleaf and tests/vlan that
hardcode vrfs/default) without touching the test cases. Stock topos
have self.bgp_vrf = None and the path is byte-identical. Callers
that pass an explicit vrf= (e.g. tests/bgp/bgp_helpers.py) are not
affected.
Verified via offline sims on every PR-impacted topology (sim output
attached to the PR description). Per-topo prime counts:
kvm-t0=1, t0-64-32=1, dpu=2, t1-8-lag=2, t1-lag=17, t1-lag-vpp=17,
t1-64-lag=21, t2=49.
Known follow-up (not in this commit): several non-converged-aware
templates (dpu-tor.j2, dpu-1-tor.j2, t0-64-32-leaf.j2, t2-leaf.j2,
t2-core.j2) still render only the prime's own interface config and
silently drop merged sub-peers' Ethernet IPs. Per-ASN keying mitigates
the worst case but doesn't eliminate it. Track separately.
Signed-off-by: Austin (Ngoc Thang) Pham <austinpham@microsoft.com>
The converged (multi-VRF) peer model only worked on the per-role templates that had grown a converged branch, and the converger created one prime VM per (role, ASN). Several virtual topologies broke when use_converged_peers was set: - Per-role templates without a converged branch (t0-64-32-leaf, t2-leaf, t2-core, dpu-tor, t1-8-lag-spine, t1-8-lag-tor) rendered only the prime's own config and silently dropped every merged sub-peer's interface IP and BGP neighbor. The DUT's sessions to those sub-peers never came up, and the per-VRF Linux namespace ns-<vrf> was never created, so the netns-scoped snmpget in snmp/test_snmp_loopback.py failed with "Cannot open network namespace". - Keying primes on (role, ASN) exploded the prime count on multi-ASN fabrics (t1-lag 17, t1-lag-vpp 17, t2 49); each converged cEOS reserves significant memory, exhausting testbed RAM and failing add-topo. - Convergence also ran for non-cEOS (SONiC-VS/cisco/csonic) deployments, which have no converged render path, producing a DUT minigraph the unconverged VS peers could not answer. Fixes (common code only, no test-case changes): - ansible/roles/eos/templates/ceos_converged.j2 (new): a single shared converged startup-config rendered purely from the converger output. The per-role templates that lacked a converged branch now include it under topo_is_multi_vrf; their stock body is unchanged for non-converged topologies. - ansible/ceos_topo_converger.py: select one prime per role. Each merged sub-peer is a VRF with its own local-as, so a single prime serves sub-peers with different ASNs (dpu -> 1 prime; t1-lag/t1-lag-vpp/t2 -> 2 primes). - ansible/testbed-cli.sh and tests/conftest.py: only converge when the neighbor type is cEOS, so SONiC-VS/cisco/csonic deployments keep their historical, byte-identical behavior. Signed-off-by: Austin (Ngoc Thang) Pham <austinpham@microsoft.com>
9a8db84 to
c74f2f8
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Contributor
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Description of PR
Enable the converged (multi-VRF) cEOS peer model across the virtual testbeds and add the framework changes needed so existing tests pass unchanged on every converged topology — one solution that fits all cEOS topologies, with no per-test-case edits.
In converged mode a single cEOS VM hosts every logical neighbor as a VRF under one BGP process, and each logical neighbor's interface is renamed on the shared VM. This breaks several assumptions baked into the framework: (1) EOS config writes target
router bgp <asn>in the default VRF; (2) tests reference per-logical interface names from minigraph; (3) the legacy untagged backplane reachability (PTFbp↔ cEOS ↔ DUT) is split into per-VRF VLAN sub-interfaces; and (4)bash <cmd>invocations on the cEOS run in the default Linux netns, while all data routes live in per-VRFns-<vrf>namespaces.Type of change
Back port request
Approach
Problem
EosHost.eos_config()calls land in the default VRF on converged VMs → BGP config writes (shutdown,no_shutdown_bgp, neighbor tweaks) silently miss.EosHost.eos_command()/eos_config()calls reference per-logical interface names (e.g.Ethernet1) that don't exist on the shared converged VM.ceos_topo_convergerdroppeddisabled_host_interfacesfrom the converged topo, turning previously-admin-down ports into active ports and breaking buffer/QoS deployment.bash <cmd>invocations on the cEOS (e.g.snmp/test_snmp_loopback.py) run in the route-less default Linux netns on converged, so they must be re-scoped into the per-VRFns-<vrf>namespace that carries the route to the DUT.t0-64-32-leaf,t2-leaf,t2-core,dpu-tor,t1-8-lag-spine,t1-8-lag-tor). On a converged topology they rendered only the prime's own config and silently dropped every merged sub-peer's interface IP and BGP neighbor — so the DUT's sessions to those sub-peers never established (pre-test BGP sanity failed), and the per-VRF Linux namespacens-<vrf>was never created (so the netns-scopedsnmpgetfailed with "Cannot open network namespace").(role, ASN), which exploded the prime count on real multi-ASN fabrics (e.g.t1-lag→ 17,t1-lag-vpp→ 17,t2→ 49). Each converged cEOS reserves significant memory, so this exhausted testbed RAM and failedadd-topo.Solution
All changes are common-code and gated so non-converged / non-cEOS paths are byte-identical:
tests/common/devices/eos.pyeos_config()/eos_command()to rewriterouter bgp <asn>parents intorouter bgp <prime_asn>/vrf <vrf>, translateinterface <name>tokens viaintf_map, and rewritebash <cmd>intobash sudo ip netns exec ns-<vrf> <cmd>. VRF-awareget_route().self.bgp_vrf/self.intf_map— bothNoneon stocktests/conftest.pybgp_vrf,bgp_prime_asn,intf_mapon eachEosHostfrommulti_vrf_datainnbrhosts; only converge the topology when the neighbor type is cEOS so SONiC-VS / cisco / csonic runs keep historical behavior.if multi_vrf_peer:/neighbor_type in (eos, ceos)ansible/roles/eos/templates/ceos_converged.j2(new)convergence_data) plus per-peer configuration — independent of base topo / role.topo_is_multi_vrfansible/roles/eos/templates/{t0-64-32-leaf, t2-leaf, t2-core, dpu-tor, t1-8-lag-spine, t1-8-lag-tor}.j2ceos_converged.j2when converged; the existing stock body is preserved unchanged in the{% else %}branch.when: topo_is_multi_vrfansible/ceos_topo_converger.pylocal-as, so a single prime correctly serves sub-peers with different ASNs and the prime count stays minimal (e.g.dpu→ 1,t1-lag/t1-lag-vpp/t2→ 2). Also preservedisabled_host_interfaces, and sizemax_fp_numto the converged front-panel count.ansible/roles/eos/tasks/ceos_config.yml+ansible/roles/eos/templates/ceos_bp_compat.j2(new)Vlan1SVI in the prime VRF carrying the hostbp_interfaceIP, advertise that subnet via BGP).when: topo_is_multi_vrf and configuration[hostname]['bp_interface'] is definedansible/roles/vm_set/library/vm_topology.pyptf_bp_ip[v6]_addr) to the PTF parent backplane in addition to the per-VRF sub-interfaces.if is_multi_vrf:branch ofadd_bp_port_with_vlans_to_docker()ansible/testbed-cli.sh-k ceos; non-cEOS vm types skip convergence.if vm_type == ceosansible/vtestbed.yamluse_converged_peers: Trueon the virtual testbeds; convergence then applies only to their cEOS deployments via the gates above.Verification
Validated on virtual testbeds (cEOS neighbors), confirming the converged render and the gating:
dpu— converges to a single prime hosting both ToR VRFs with per-VRFlocal-as(ASN 65200 and 65201).add-topo+deploy-mgsucceed; the DUT establishes both BGP sessions and exchanges routes;bgp/test_bgp_fact.pypasses including the pre-test BGP sanity check that previously failed.multi-asic(t1-8-lag) — the shared template now renders the per-VRF config, so thens-<vrf>Linux namespaces are created on the prime (the missing namespace that previously madesnmp/test_snmp_loopback.pyfail with "Cannot open network namespace").t1-lag/t1-lag-vpp/t2— one-prime-per-role keeps the prime count low soadd-topono longer exhausts testbed memory.testbed-cli.shandconftest.py) and remain byte-identical to historical behavior.topo_is_multi_vrf/is_multi_vrfis false — verified by code inspection of the gating conditions and by Jinja-parsing every wrapped template.Any platform specific information?
Converged peer model only; cEOS neighbors only. Non-cEOS neighbor types (SONiC-VS / cisco / csonic) are explicitly excluded from convergence at both the deploy (
testbed-cli.sh) and test (conftest.py) layers, so they keep their existing behavior.