Skip to content

Releases: containers/nri-plugins

v0.10.1

19 Aug 10:02
Compare
Choose a tag to compare

This is a new minor release of NRI Reference Plugins. It updates dependencies, enables RDT 'discovery mode', brings a few documentation improvements, and fixes a startup failure on machines with an asymmetric NUMA distance matrix.

What's Changed

  • build(deps): bump golang.org/x/oauth2 from 0.21.0 to 0.27.0 by @dependabot[bot] in #548
  • docs: update documentation for RDT monitoring/metrics. by @klihub in #549
  • rdt,resmgr: allow running RDT in discovery mode. by @klihub in #551
  • docs: update balloons debugging guidance and add examples by @askervin in #552
  • go.mod,Makefile: bump golang to latest 1.24.x. by @klihub in #555
  • sysfs: patch up asymmetric NUMA distances. by @klihub in #554

Full Changelog: v0.10.0...v0.10.1

v0.10.0

02 Jul 14:30
7094685
Compare
Choose a tag to compare

This new release of NRI Reference Plugins brings a new NRI plugin, new features in resource policy plugins, a number of bug fixes, end-to-end tests and few use cases in documentation.

What's New

Balloons Policy

  • Composite balloons enables allocating a diverse set of CPUs for containers with complex CPU requirements. For example, "allocate an equal number of CPUs from both NUMA nodes on CPU socket 0". This allocation enables efficient parallelism inside an AI inference engine container that runs inference on CPU, and still isolate inference engines from each other.

    balloonTypes:
    - name: balance-pkg0-nodes
      components:
      - balloonType: node0
      - balloonType: node1
    - name: node0
      preferCloseToDevices:
      - /sys/devices/system/node/node0
    - name: node1
      preferCloseToDevices:
      - /sys/devices/system/node/node1
  • Documentation includes recipes for preventing creation of certain containers on a worker node, and resetting CPU and memory pinning of all containers in a cluster.

Topology Aware Policy

  • Pick CPU and Memory by Topology Hints Normally topology hints are only used to pick the assigned pool for a workload. Once a pool is selected the available resources within the pool are considered equally good for satisfying the topology hints. When the policy is allocating exclusive CPUs and picking pinned memory for the workload, only other potential criteria and attributes are considered for picking the individual resources.

    When multiple devices are allocated to a single container, it is possible that this default assumption of all resources within the pool being topologically equal is not true. If a container is allocated misaligned devices, IOW devices with different memory or CPU locality. To overcome this, containers can now be annotated to prefer hint based selection and pinning of CPU and memory resources using the pick-resources-by-hints.resource-policy.nri.io annotation. For example,

    apiVersion: v1
    kind: Pod
    metadata:
      name: data-pump
      annotations:
        k8s.v1.cni.cncf.io/networks: sriov-net1
        prefer-isolated-cpus.resource-policy.nri.io/container.ctr0: "true"
        pick-resources-by-hints.resource-policy.nri.io/container.ctr0: "true"
    spec:
      containers:
      - name: ctr0
        image: dpdk-pump
        imagePullPolicy: Always
        resources:
          requests:
            cpu: 2
            memory: 100M
            vendor.com/sriov_netdevice_A: '1'
            vendor.com/sriov_netdevice_B: '1'
          limits:
            vendor.com/sriov_netdevice_A: '1'
            vendor.com/sriov_netdevice_B: '1'
            cpu: 2
            memory: 100M

    When annotated like that, the policy will try to pick one exclusive isolated CPU with locality to one device and another with locality to the other. It will also try to pick and pin to memory aligned with these devices.

Common Policy Improvements

These are improvements to common infrastructure and as such are available for the balloons and topology-aware policy plugins, as well as for the wireframe template policy plugin.

  • Cache Allocation

    Plugins can be configured to exercise class-based control over the L2 and L3 cache allocated to containers' processes. In practice, containers are assigned to classes. Classes have a corresponding cache allocation configuration. This configuration is applied to all containers and subsequently to all processes started in a container. To enable cache control use the control.rdt.enable option which defaults to false.

    Plugins can be configured to assign containers by default to a cache class named after the Pod QoS class of the container: one of BestEffort, Burstable, and Guaranteed. The configuration setting controlling this behavior is control.rdt.usagePodQoSAsDefaultClass and it defaults to false.

    Additionally, containers can be explicitly annotated to be assigned to a class. Use the rdtclass.resource-policy.nri.io annotation key for this. For instance

    apiVersion: v1
    kind: Pod
    metadata:
      name: test-pod
      annotations:
        rdtclass.resource-policy.nri.io/pod: poddefaultclass
        rdtclass.resource-policy.nri.io/container.special-container: specialclass
    ...

    This will assign the container named special-container within the pod to the specialclass RDT class and any other container within the pod to the poddefaultclass RDT class. Effectively these containers' processes will be assigned to the RDT CLOSes corresponding to those classes.

    Cache Class/Partitioning Configuration

    RDT configuration is supplied as part of thecontrol.rdt configuration block. Here is a sample snippet as a Helm chart value which assigns 33%, 66% and 100% of cache lines to BestEffort, Burstable and Guaranteed Pod QoS class containers correspondingly:

    config:
      control:
        rdt:
          enable: true
          usePodQoSAsDefaultClass: true
          options:
            l2:
              optional: true
            l3:
              optional: true
            mb:
              optional: true
          partitions:
            fullCache:
              l2Allocation:
                all:
                  unified: 100%
              l3Allocation:
                all:
                  unified: 100%
              classes:
                BestEffort:
                  l2Allocation:
                    all:
                      unified: 33%
                  l3Allocation:
                    all:
                      unified: 33%
                Burstable:
                  l2Allocation:
                    all:
                      unified: 66%
                  l3Allocation:
                    all:
                      unified: 66%
                Guaranteed:
                  l2Allocation:
                    all:
                      unified: 100%
                  l3Allocation:
                    all:
                      unified: 100%

    Cache Allocation Prerequisites

    Note that for cache allocation control to work, you must have

    • a hardware platform which supports cache allocation
    • resctrlfs pseudofilesystem enabled in your kernel, and loaded if it is a module
    • the resctrlfs filesystem mounted (possibly with extra options for your platform)

New plugin: nri-memory-policy

  • The NRI memory policy plugin sets Linux memory policy for new containers.
  • The memory policy plugin, for instance, advises kernel to interleave memory pages of a container on all NUMA nodes in the system, or on all NUMA nodes near the same socket where container's allowed CPUs are located.
  • The plugin works as a stand-alone plugin, and it works together with NRI resource policy plugins and Kubernetes resource managers. It recognizes CPU and memory pinning set by resource management components. The memory policy plugin should be after the resource policy plugins in the NRI plugins chain.
  • Memory policy for a container is defined in pod annotations.
  • At the time of NRI plugins release, latest released containerd or CRI-O do not support NRI Linux memory policy adjustments, or NRI container command line adjustments for a workaround. Using this plugin requires a container runtime that is built with NRI version including command line adjustments. (NRI version > 0.9.0)

What's Changed

  • resmgr,config: allow configuring cache allocation via goresctrl. by @klihub in #541
  • resmgr: expose RDT metrics. by @klihub in #543
  • Balloons with components by @askervin in #526
  • topology-aware: try picking resources by hints first by @klihub in #545
  • memory-policy: NRI plugin for setting memory policy by @askervin in #517
  • mempolicy: go interface for set_mempolicy and get_mempolicy syscalls by @askervin in #514
  • mpolset: get/set memory policy and exec a command by @askervin in #515
  • topology-aware: fix format of container-exported memsets. by @klihub in #532
  • resmgr: update container-exported resource data. by @klihub in #537
  • sysfs: add a helper for gathering whatever IDs related to CPUs by @askervin in #513
  • sysfs: fix CPU.GetCaches() to not return empty slice. by @klihub in #533
  • sysfs: export CPUFreq.{Min,Max}. by @klihub in #534
  • helm: add Chart for memory-policy deployment by @askervin in #519
  • go.{mod,sum}: use new goresctrl tag v0.9.0. by @klihub in #544
  • Drop tools.go in favor of native tool directive support in go 1.24 by @fmuyassarov in #535
  • golang: bump go version to 1.24[.3]. by @klihub in #528

Full Changelog: v0.9.4...v0.10.0

v0.9.4

14 Apr 07:43
Compare
Choose a tag to compare

This is a new minor release of NRI Reference Plugins. It fixes incorrect caching of Pod Resource API query results which in some cases could result in incorrect generated topology hints.

What's Changed

  • resmgr: purge cached pod resource list upon pod stop/removal. by @klihub in #507
  • github: explicitly ensure contents-only copying by @fmuyassarov in #508

Full Changelog: v0.9.3...v0.9.4

v0.9.3

02 Apr 11:28
3968d09
Compare
Choose a tag to compare

This is a new minor release of NRI Reference Plugins. It brings several new features, a number of bug fixes, end-to-end tests, and test coverage.

What's New

Balloons Policy

  • Cluster level visibility to CPU affinity. Configuration option agent.nodeResourceTopology: true enables observing balloons as zones in NodeResourceTopology custom resources. Furthermore, if showContainersInNrt: true is defined, information on each container, including CPU affinity, will be shown as a subzone of its balloon.

    Example configuration:

    showContainersInNrt: true
    agent:
      nodeResourceTopology: true

    Enables listing balloons and their cpusets on K8SNODE with

    kubectl get noderesourcetopology K8SNODE -o json | jq '.zones[] | select(.type=="balloon") | {"balloon":.name, "cpuset":(.attributes[]|select(.name=="cpuset").value)}'

    and containers with their cpusets on the same node:

    kubectl get noderesourcetopology K8SNODE -o json | jq '.zones[] | select(.type=="allocation for container") | {"container":.name, "cpuset":(.attributes[]|select(.name=="cpuset").value)}'
  • System load balancing. Even if two containers run on disjoint sets of logical CPUs, they may nevertheless affect each others performance. This happens, for instance, if two memory-intensive containers share the same level 2 cache, or if they are compute-intenstive, use the same compute resources of a physical CPU core, and run on two hyperthreads of the same core.

    New system load balancing in the balloons policy is based on classifying loads generated containers using new loadClasses configuration option. Based on the load classes associated with balloonTypes using loads, the policy allocates CPUs to new and existing balloons so that it avoids overloading level 2 caches or physical CPU cores.

    Example: policy prefers selecting CPUs for all "inference engine" and "computational-fluid-dynamics" balloons within separate level 2 cache blocks to prevent cache trashing by any two of containers in these balloons.

    balloonTypes:
    - name: inference-engine
      loads:
      - memory-intensive
      ...
    - name: computational-fluid-dynamics
      loads:
      - memory-intensive
    ...
    loadClasses:
    - name: memory-intensive
      level: l2cache

Topology Aware Policy

  • Improved topology hint control: the topologyhints.resource-policy.nri.io annotation key can be used to enable or disable topology hint generation for one or more containers altogether, or selectively for mounts, devices, and pod resources types.
    For example:
metadata:
  annotations:
    # disable topology hint generation for all containers by default
    topologyhints.resource-policy.nri.io/pod: none
    # disable other than mount-based hints for the 'diskwriter' container
    topologyhints.resource-policy.nri.io/container.diskwriter: mounts
    # disable other than device-based hints for the 'videoencoder' container
    topologyhints.resource-policy.nri.io/container.videoencoder: devices
    # disable other than pod resource-based hints for the 'dpdk' container
    topologyhints.resource-policy.nri.io/container.dpdk: pod-resources
    # enable device and pod resource-based hints for 'networkpump' container
    topologyhints.resource-policy.nri.io/container.networkpump: devices,pod-resources

It is also possible to enable and disable topology hint generation based on mount or device path, using allow and deny lists. See the updated documentation for more details.

  • relaxed system topology restrictions: the policy should not refuse to start up if a NUMA node is shared by more than one pool at the same topology hierarchy level. In particular, a single NUMA node shared by all sockets should not prevent startup any more.

  • improved Burstable QoS class container handling: the policy now allocates memory to burstable QoS class containers based on memory request estimates. This should lower the probability for unexpected allocation failures when burstable containers are used to allocate a node close to full capacity.

  • better global shared allocation preference: a preferSharedCPUs: true global configuration option now applies to all containers, unless they are annotated to opt out using the prefer-shared-cpus.resource-policy.nri.io annotation.

Common Policy Improvements

  • Container cache and memory bandwidth allocation enables class-based management of system L2 and L3 cache and memory bandwidth. They are modeled as class-based uncountable and shareable resources. Containers can be assigned to predefined classes-of-service (CLOS), or RDT classes for short. Each class defines a specific configuration for cache and memory bandwidth allocation, which is applied to all containers within that class. The assigned container class is resolved and mapped to a CLOS in the runtime using goresctrl library. RDT control must be enabled in the runtime and the assigned classes must be defined in the runtime configuration. Otherwise the runtime might fail to create containers that are assigned to an RDT class. Refer to the containerd, cri-o, and goresctrl documentation for more details about configuration.

A container can be assigned either to an RDT class matching its pod's QoS class (BestEffort, Burstable or Guaranteed), or it can be assigned to an arbitrary class using the rdtclass.resource-policy.nri.io annotation. To enable QoS-class based default assignment you can use a configuration fragment similar to this:

apiVersion: config.nri/v1alpha1
kind: TopologyAwarePolicy # or 'BalloonsPolicy' for the 'balloons' policy
metadata:
  name: default
spec:
  ...
  control:
    rdt:
      enable: true
      usePodQoSAsDefaultClass: true

RDT class assignment is also possible using annotations. For instance, to assign the packetpump container to the highprio and the scheduler container to the midprio classes. Any other potential container in the pod will be assigned to the class matching their pd's QoS class:

metadata:
  annotations:
    rdtclass.resource-policy.nri.io/container.packetpump: highprio
    rdtclass.resource-policy.nri.io/container.scheduler: midprio
  • Container block I/O prioritization allows class-based control block I/O prioritization and throttling. Containers can be assigned to predefined block I/O classes. Each class defines a specific configuration of prioritization and throttling parameters which are applied to all containers assigned to the class. The assigned container class is resolved and mapped to actual parameters in the runtime using goresctrl library. Block I/O control must be enabled in the runtime and the classes must be defined in the runtime configuration. Otherwise the runtime fails to create containers that are assigned to a block I/O class. Refer to the containerd, cri-o, and goresctrl documentation for more details about configuration.

A container can be assigned either to a Block I/O class matching its pod's QoS class (BestEffort, Burstable or Guaranteed), or it can be assigned to an arbitrary class using the blockioclass.resource-policy.nri.io annotation. To enable QoS-class based default assignment you can use a configuration fragment similar to this:

apiVersion: config.nri/v1alpha1
kind: TopologyAwarePolicy
metadata:
  name: default
spec:
  ...
  control:
    blockio:
      enable: true
      usePodQoSAsDefaultClass: true

Class assignment is also possible using annotations. For instance, to assign the database container to the highprio and
the logger container to the lowprio classes. Any other potential container in the pod will be assigned to the class matching their pd's QoS class:

metadata:
  annotations:
    blockioclass.resource-policy.nri.io/container.database: highprio
    blockioclass.resource-policy.nri.io/container.logger: lowprio

What's Changed

  • balloons: do not require minFreq and maxFreq in CPU classes by @askervin in #455
  • balloons: expose balloons and optionally containers with affinity in NRT by @askervin in #469
  • balloons: introduce loadClasses for avoiding unwanted overloading in critical locations by @askervin in #493
  • topology-aware: exclude isolated CPUs from policy-picked reserved cpusets. by @klihub in #474
  • topology-aware: rework building the topology pool tree. by @klihub in #477
  • topology-aware: allocate burstable container memory by requests. by @klihub in #491
  • topology-aware: better semantics for globally configured shared CPU preference. by @klihub in #498
  • topology-aware: more consistent setup error handling. by @klihub in #502
  • memtierd: allow overriding go version for image build. by @klihub in #456
  • resmgr: improve annotated topology hint control. by @klihub in #499
  • resmgr: eliminate extra container state 'overlay'. by @klihub in https://github.com/containers/nr...
Read more

v0.8.0

18 Dec 08:15
Compare
Choose a tag to compare

This is a new major release of NRI Reference Plugins. It brings several new features, a number of bug fixes, improvements to the build system, to CI, end-to-end tests, and test coverage.

What's New

Balloons Policy

  • New preserve policy option enables matching containers whose CPU
    and memory affinity must not be modified by the resource policy.

    This enables allowing selected containers to access all CPUs and
    memories. For example, allow pcm-sensor-server
    to access MSRs on every CPU for low-level metrics:

    preserve:
      matchExpressions:
        - key: pod/labels/app.kubernetes.io/name
          operator: In
          values:
            - pcm-sensor-server
    

    Earlier this required cpu.preserve.resource-policy.nri.io and
    memory.preserve.resource-policy.nri.io pod annotations.

  • New freqGovernor CPU class option enables setting CPU frequency
    governor based on the CPU class of a balloon. Example:

    balloonTypes:
    - name: powersaving
      cpuClass: mypowersave
    control:
      cpu:
        classes:
          mypowersave:
            freqGovernor: powersave
    
  • New memoryTypes balloon type option specifies required memory
    types when setting memory affinity. For example, containers in
    high-memory-bandwidth balloons will use only HBM when configured as:

    balloonTypes:
    - name: high-memory-bandwidth
      memoryTypes:
      - HBM
    
  • Support memory-type.resource-policy.nri.io pod annotation for
    setting memory affinity into closest HBM, DRAM, PMEM, or any
    combination. This annotation is a pod level override to the
    memoryTypes balloon type option.

  • L2-cache group aware CPU allocation and sharing. For example,
    containers in a balloon can be allowed to burst on idle
    (unallocated) CPUs that share the same L2 cache as CPUs allocated to
    the balloon.

    balloonTypes:
    - name: l2burst
      shareIdleCPUsInSame: l2cache
    
  • Override to pinMemory policy option in balloon type level. Enables
    setting memory affinity of containers only in certain balloons while
    others are not set, and vice versa. Example:

    pinMemory: false
    balloonTypes:
    - name: latency-sensitive
      pinMemory: true
      preferIsolCpus: true
      preferNewBalloons: true
    
  • New default configuration runs Guaranteed containers on dedicated
    CPUs while BestEffort and Burstable containers are allowed to share
    remaining CPUs on the same socket, but not cross socket boundaries.

  • Balance BestEffort containers between balloons with equal amount of
    available resources.

  • Smaller risk for OOMs on pinMemory: true, as memory affinity was
    refactored to use smart libmem.

Topology Aware Policy

The Topology Aware policy can now export prometheus metrics per topology zone. Exported metrics include pool CPU set and memory set, shared CPU subpool total capacity, allocations and available capacity, memory total capacity, allocations and available amount, number of assigned containers and containers in the shared subpool.

To enable exporting these metrics, make sure that you are running with the latest policy configuration custom resource definition and you have policy included in the spec/instrumentation/metrics/enabled slice, like this:

...
spec:
...
  instrumentation:
  ...
    metrics:
      enabled:
      - policy
...

The Topology Aware policy can now use data from the kubelet's Pod Resource API to generate extra topology hints for resource allocation and alignment. These hints are disabled in the default configuration installed by Helm charts. To enable them, make sure that you are running with the latest policy configuration custom resource definition and you have spec/agent/podResourceAPI set to true in the configuration, like this:

spec:
  agent:
    ...
    podResourceAPI: true
...
  • Support memory-type.resource-policy.nri.io pod annotation for
    setting memory affinity into closest HBM, DRAM or PMEM, or any
    combination.

What's Changed

Balloons Policy Fixes and Improvements

  • balloons: add "preserve" option to match containers whose pinning must not be modified by @askervin in #368
  • balloons: add support for cpu frequency governor tuning by @fmuyassarov in #374
  • balloons: set frequency scaling governor only when requested by @fmuyassarov in #379
  • balloons: improve handling of containers with no CPU requests by @askervin in #386
  • balloons: add debug logging to selecting a balloon type by @askervin in #396
  • balloons: support for L2 cache cluster allocation by @askervin in #384
  • balloons: add memoryTypes to balloon types by @askervin in #395
  • Add balloon type specific pinMemory option by @askervin in #451

Topology Aware Policy Fixes and Improvements

  • metrics: add topology-aware policy metrics collection. by @klihub in #406
  • topology-aware: correctly reconfigure implicit affinities for configuration changes. by @klihub in #394
  • fixes: copy assigned memory zone in grant clone. by @klihub in #413

New Policy Agnostic Metrics, Common De Facto Exporters

  • metrics: cleanup metrics registration, collection and gathering. by @klihub in #403
  • metrics: add de-facto standard collectors. by @klihub in #404
  • metrics: simplify policy/backend metrics collection interface. by @klihub in #408
  • metrics: add policy system collector. by @klihub in #405

Topology Hints Based on Pod Resource API

  • podresapi: agent,config,helm: make agent runtime configurable. by @klihub in #418
  • podresapi: resmgr,agent: generate topology hints from Pod Resource API. by @klihub in #419
  • podresapi: topology-aware: use Pod Resource API hints if present. by @klihub in #420
  • agent,resmgr: merge PodResources{List,Map}, cache last List() result. by @klihub in #423

Common Resource Management Fixes and Improvements

  • resmgr: fix "qosclass" in policy expressions by @askervin in #387
  • resmgr,agent: propagate startup config error back to CR. by @klihub in #416
  • libmem: implement policy-agnostic memory allocation/accounting. by @klihub in #332
  • libmem: typo and thinko fixes. by @klihub in #381
  • sysfs: enable faking CPU cache configurations using OVERRIDE_SYS_CACHES by @askervin in #383
  • cpuallocator, plugins: handle priority as an option. by @klihub in #414
  • Fix typos in expression code doc and matchExpression yamls by @askervin in #370

Helm Chart and Configuration Fixes and Improvements

  • helm: enable prometheus autodiscovery by @klihub in #393
  • helm: new balloons default configuration by @askervin in #391
  • apis/config: use consistent assignment in +kubebuilder:validation tags. by @klihub in #397
  • sample-configs: fix a copy-pasted comment thinko. by @klihub in #402

End-to-end Testing Fixes and Improvements

  • e2e: pull and save runtime logs after each test. by @klihub in #367
  • e2e: adjust metrics test for updated PrettyName(). by @klihub in #366
  • e2e: switch default test distro to fedora/40-cloud-base. by @klihub in #375
  • e2e: fix provisioning for Ubuntu cloud image. by @klihub in #377
  • e2e: enable vagrant debugging. by @klihub in #376
  • e2e: adjust $VM_HOSTNAME for policy node config usage. by @klihub in #378
  • e2e: skip long running tests by default. by @klihub in #373
  • e2e: fix command filenames in test output directories by @askervin in #390
  • e2e: containerd 2.0.0. provisioning fixup. by @klihub in #400
  • e2e/balloons: remove unknown/unused helm-launch argument. by @klihub in #407

Build Environment Fixes and Improvements

  • build: enable building debug binaries and images by @askervin in #388
  • build: update controller-tools to v0.16.5. by @klihub in #398
  • build: enable race-detector in DEBUG=1 builds. by @klihub in #409
  • build: enable race-detector in image build, too. by @klihub in #410
  • d...
Read more

v0.7.1

23 Sep 08:21
Compare
Choose a tag to compare

This release of NRI Reference Plugins brings new features, a few bug fixes, and updates to the documentation.

Highlights

  • balloons policy now supports assigning kernel-isolated CPU cores to balloons when available. To prefer isolated CPU cores for a balloon, use the new preferIsolCpus boolean configuration option. For instance,
balloonTypes:
  - name: high-prio-physical-core
    minCPUs: 2
    maxCPUs: 2
    preferNewBalloons: true
    preferIsolCpus: true
    hideHyperthreads: true
...
  • balloons policy now supports assigning performance optimized or energy efficient CPU cores to balloons when available. For instance, to define a balloon with energy efficient core preference and another one with performance core preference use the new preferCoreType configuration option like this:
balloonTypes:
  - name: low-prio
    namespaces:
      - logging
      - monitoring
    preferCoreType: efficient
...
  - name: high-prio
    preferCoreType: performance
...
  • Topology-aware policy now allocates CPU cores in clusters of shared last-level cache. Whenever this provides different grouping than the rest of the topology, for instance hyperthreads, the CPU allocator now divides cores into groups defined by shared last-level cache. The topology-aware policy tries to allocate as few LLC groups to a container as possible and tries to avoid sharing an LLC group by multiple containers.

What's New

  • balloons: add support for isolated cpus. by @fmuyassarov in #344
  • balloons: add support for power efficient & high performance cores by @fmuyassarov in #354
  • cpuallocator: implement clustered allocation based on cache groups. by @klihub in #343

What Changed

Resource assignment policies should now try harder to detect when a new container is a restarted instance of an existing container which has just exited or crashed. This should fix problems where a crashing container could not be restarted on an nearly fully allocated node.

  • deps: bump NRT dependencies to v0.1.2. by @fmuyassarov in #348
  • topology-aware: add missing SingleThreadForCPUs() to mockSysfs. by @klihub in #349
  • balloons: add support for isolated cpus. by @fmuyassarov in #344
  • cpuallocator: implement clustered allocation based on cache groups. by @klihub in #343
  • fixes: fix host-wait-vm-ssh-server, improve vm-reboot. by @klihub in #350
  • fix: clean up plugin at the beginning/end of tests. by @klihub in #351
  • doc: add availableResources in the balloons policy documentation by @askervin in #355
  • build: allow building a single plugin image. by @klihub in #357
  • balloons: add support for power efficient & high performance cores by @fmuyassarov in #354
  • e2e: fix cni_plugin=bridge in provisioning a vm by @askervin in #359
  • e2e: bridge CNI setup fixes for Fedora/containerd. by @klihub in #361
  • e2e: use bridge CNI plugin by default. by @klihub in #362
  • CI: verify in smaller steps, verify binary builds. by @klihub in #364
  • resmgr: lifecycle overlap detection and workaround. by @klihub in #358

Full Changelog: v0.7.0...v0.7.1

v0.7.0

03 Jul 07:30
Compare
Choose a tag to compare

This release of NRI Reference Plugins brings in new features and important bug fixes.

Highlights

  • Topology-aware and balloons resource policies now support soft-disabling of hyperthreads per container. This improves the performance of some classes of workloads. Both policies support new pod annotation:
    hide-hyperthreads.resource-policy.nri.io/container.<CONTAINER-NAME>: "true"
    
    and the balloons policy has new balloon-type option hideHyperthreads that soft-disables hyperthreads on all containers assigned to a balloon of this type.
  • The topology-aware policy supports pinning containers to high-bandwidth memory (HBM), or both HBM and DRAM, when pods are annotated with
    memory-type.resource-policy.nri.io/container.<CONTAINER-NAME>: hbm
    memory-type.resource-policy.nri.io/container.<CONTAINER-NAME>: hbm,dram
    
  • Automatic hardware topology hint generation has been fixed in the topology-aware policy. For instance, if a container uses a PCI device, the policy prefers pinning the container to CPUs and memory that are close to the device.

What's New

  • balloons: hideHyperthreads balloon type option and annotation by @askervin in #338
  • topology-aware: add support for hide-hyperthreads annotation. by @askervin in #331

What Changed

  • topology-aware: don't ignore HBM memory nodes without close CPUs. by @klihub in #329
  • topology-aware: relax NUMA node topology checks. by @klihub in #336
  • resmgr: exit when ttrpc connection goes down. by @klihub in #319
  • cpuallocator: don't filter based on single CoreKind. by @klihub in #345
  • sysfs,cpuallocator: fix CPU cluster discovery. by @klihub in #337
  • sysfs: survive NUMA nodes without memory. by @klihub in #339
  • sysfs: allow non-uniform thread count. by @klihub in #340
  • helm: flip podPriorityClassNodeCritical to true. by @klihub in #312
  • config-manager: allow configuring NRI timeouts. by @klihub in #318

New Contributors

Full Changelog: v0.5.0...v0.7.0

v0.5.1

29 Mar 15:54
194c433
Compare
Choose a tag to compare

This release of the NRI Reference Plugins brings a few improvements to hardware topology detection and resource assignment.

What's New

  • cpuallocator: topology discovery fixes and improvements. by @klihub in #206
  • cpuallocator: add support for hybrid core discovery, preferred allocation. by @klihub in #295
  • topology-aware: configurable allocation priority by @klihub in #282
  • resmgr: enable opentelemetry tracing (span propagation) over the NRI ttrpc connection. by @klihub in #293

Updates, Fixes, and Other Improvements

  • sysfs: dump system discovery results in a more predictable order. by @klihub in #294
  • github: package and publish interim unstable Helm charts from the main and release branches by @marquiz, @klihub in #303

Full Changelog: v0.4.1...v0.5.1

v0.4.1

16 Mar 14:48
659d042
Compare
Choose a tag to compare

This major release of the NRI Reference Plugins brings new features to a few plugins, numerous smaller other improvements, and several bug fixes.

Highlights

  • balloons policy: add groupBy balloon type option
    Group containers into same balloon instances if their groupBy expressions evaluate to the same group. For example, the following expression prefers assigning all containers in the pod to a balloon that already contains containers from the same namespace and have the same nsballoon pod label value
  ...
  balloonTypes:
    - name: my-pods
      groupBy: ${pod/namespace}-${pod/labels/nsballoon}
  ...

If there is no such a balloon, or if such instances do not have enough CPUs, then finding a suitable balloon continues as before: assign to some other existing balloon or create a new balloon if that is preferred.

  • balloons policy: add balloon matchExpressions option
    Assign containers to balloon instances by balloon match expressions, similar to affinity expression of the topology-aware policy. Expressions are evaluated for containers which are not explicitly assigned to any balloon by an annotation. If an expression matches a container, the container is assigned to an instance of the corresponding balloon. For instance, the following matchExpression will grab all containers with matching pod names to the associated balloon:
  ...
  balloonTypes:
    - name: my-pods
      matchExpressions:
        - key: pod/name
          operator: MatchesAny
          values: [ myPod*, nginx* ]
  ...

What's New

  • balloons: implement groupBy option by @askervin in #278
  • balloons: allow assigning containers to balloons by runtime-evaluated expressions by @klihub in #260
  • balloons policy: more regular built-in balloons, treat them much like user-defined ones
    Built-in reserved and default balloon types are no longer special cases. They can be configured with the same parameters as user-defined balloons.
  • balloons: support preserving CPU and memory pinning by @askervin in #257
  • topology-aware: support preserving CPU and memory pinning by @askervin in #258
  • feat(helm): Introduce priorityClassName system-node-critical by @ffuerste in #220
  • helm: allow setting NRI plugin index via values by @klihub in #227

Updates, Fixes And Other Improvements

  • balloons: fix the order of assigning containers into balloons by @askervin in #273
  • balloons: fix logged balloon name by @klihub in #259
  • balloons & topology-aware policies: better handling of UpdateContainer[Resources] requests
    Fill in missing bits in partial container resource updates from the current resource assignment. Filter out redundant resource updates without invoking the policy.
  • memtierd: update the nri-memtierd plugin to use memtierd v0.1.1 by @askervin in #287
  • operator: ensure to kustomize operator manifests before local deployment by @fmuyassarov in #240
  • resmgr: better expression validation, cleaner key resolution by @klihub in #256
  • resmgr: inject mount before container state update by @klihub in #223
  • resmgr: log containers by pretty name during startup by @klihub in #245
  • instrumentation: fix resource creation, use parent-based sampler by @klihub in #233
  • instrumentation: allow proper reconfiguration of tracing by @klihub in #234
  • cache: support annotations to preserve CPU and memory pinning by @askervin in #249
  • cache, resmgr: expose key evaluation, implement key substitution by @klihub in #277
  • cache: fix generated pod scope and simple affinity expressions by @klihub in #285
  • cache: store creation time of pod and containers cache objects by @askervin in #272
  • topology-aware: log resource operations at info level by @klihub in #252
  • doc: clarify selecting balloon type by @askervin in #281
  • doc: more consistent terminology in balloons documentation by @askervin in #269
  • fixes: rename default config group label, support/fall back to deprecated labels. by @klihub in #231

New Contributors

Full List of Merged PRs

For a full list of changes see v0.3.2...v0.4.1

v0.3.2

29 Dec 10:28
v0.3.2
Compare
Choose a tag to compare

This patch release fixes image versioning for the operator.

What's New

operator: point containerImage tag to the latest release (v0.3.2) by @fmuyassarov in #218