Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod requesting device through Akri, gets rejected with admission error #722

Open
ruzko opened this issue Nov 11, 2024 · 3 comments
Open
Labels
bug Something isn't working

Comments

@ruzko
Copy link

ruzko commented Nov 11, 2024

Describe the bug
Pods requesting an akri instance are unable to be scheduled due to `admission error: unable to claim slot

Output of kubectl get pods,akrii,akric -o wide

kubectl get pods,akric,akrii,services -o wide -n akri

NAME                                              READY   STATUS    RESTARTS   AGE   IP           NODE    NOMINATED NODE   READINESS GATES
pod/akri-agent-daemonset-q5fbr                    1/1     Running   0          24h   10.42.0.69   nixos   <none>           <none>
pod/akri-controller-deployment-54ff6d5c6c-lpx98   1/1     Running   0          24h   10.42.0.67   nixos   <none>           <none>
pod/akri-udev-discovery-daemonset-5fj55           1/1     Running   0          24h   10.42.0.68   nixos   <none>           <none>
pod/akri-webhook-configuration-8d684cc56-2vndm    1/1     Running   0          24h   10.42.0.70   nixos   <none>           <none>

NAME                               CAPACITY   AGE
configuration.akri.sh/akri-onvif   1          24h
configuration.akri.sh/akri-udev    1          24h

NAME                                CONFIG      SHARED   NODES       AGE
instance.akri.sh/akri-udev-e4c669   akri-udev   false    ["nixos"]   4h29m
instance.akri.sh/akri-udev-fdb118   akri-udev   false    ["nixos"]   24h

NAME                                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE   SELECTOR
service/akri-webhook-configuration   ClusterIP   10.43.228.183   <none>        443/TCP   24h   app.kubernetes.io/instance=akri,app.kubernetes.io/name=akri-webhook-configuration,app.kubernetes.io/part-of=akri-dev

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]

k3s version v1.30.3+k3s1 (f6466040)

To Reproduce
Steps to reproduce the behavior:

  1. Create cluster on nixos using:
    systemd.services.k3s = {
      enable = true;
      description = "kubernetes server";
      unitConfig = {
        Type = "notify";
        Delegate = "yes";
        KillMode = "process";
        LimitNOFILE = "1048576";
        LimitNPROC = "infinity";
        LimitCORE = "infinity";
        TasksMax = "infinity";

      };
      serviceConfig = {
        ExecStartPre = [ "-/run/current-system/sw/bin/modprobe br_netfilter"
          "-/run/current-system/sw/bin/modprobe overlay"
        ];

        ExecStart = "/etc/profiles/per-user/jacob/bin/k3s server --data-dir /storage/ocean/pool_1/kubernetes/";
      };

      wantedBy = [ "multi-user.target" ];
      wants = [ "network-online.target" ];
      after = [ "network-online.target" ];
    };
  1. Install Akri with this fluxcd Helmrelease yaml:
apiVersion: v1
kind: Namespace
metadata:
  name: akri
---
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: akri
  namespace: akri
spec:
  interval: 24h
  url: https://project-akri.github.io/akri
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: akri
  namespace: akri
spec:
  interval: 10h
  timeout: 5m
  chart:
    spec:
      chart: akri-dev
      version: '0.*'
      sourceRef:
        kind: HelmRepository
        name: akri
      interval: 5m
  releaseName: akri
  install:
    remediation:
      retries: 3
  upgrade:
    remediation:
      retries: 3
  values:
    prometheus:
      enabled: false
    agent:
      host:
        udev: /run/udev
    onvif:
      configuration:
        enabled: true
    opcua:
      configuration:
        enabled: false
    udev:
      discovery:
        enabled: true
      configuration:
        enabled: true
        discoveryDetails:
          groupRecursive: false
          udevRules:
            - 'ATTR{idVendor}=="1d50", ATTR{idProduct}=="614e"'
            - 'ATTR{idVendor}=="10c4", ATTR{idProduct}=="ea60"'
  1. deploy app requesting akri resource:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 1
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.42
        ports:
        - containerPort: 80
        resources:
          limits:
            akri.sh/akri-udev-fdb118: "1"
          requests:
            akri.sh/akri-udev-fdb118: "1"

Expected behavior
Pods requesting an akri instance are scheduled with access to that instance

Logs (please share snips of applicable logs)

kubectl get pod nginx-fcb89c6f8-xh7cz -oyaml

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-11-11T19:01:07Z"
  generateName: nginx-fcb89c6f8-
  labels:
    app: nginx
    pod-template-hash: fcb89c6f8
  name: nginx-fcb89c6f8-xh7cz
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: nginx-fcb89c6f8
    uid: 3d9fc188-35f9-4d83-86de-ac47dee04b74
  resourceVersion: "17503651"
  uid: 87e66030-5128-4910-adc0-748411080296
spec:
  containers:
  - image: nginx:1.42
    imagePullPolicy: IfNotPresent
    name: nginx
    ports:
    - containerPort: 80
      protocol: TCP
    resources:
      limits:
        akri.sh/akri-udev-fdb118: "1"
      requests:
        akri.sh/akri-udev-fdb118: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-tn469
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: nixos
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-tn469
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  message: 'Pod was rejected: Allocate failed due to rpc error: code = Unknown desc
    = Unable to claim slot, which is unexpected'
  phase: Failed
  reason: UnexpectedAdmissionError
  startTime: "2024-11-11T19:01:37Z"

Logs from the same app aren't retrievable (even with --previous)

kubectl logs -n akri ds/akri-agent-daemonset

...
...
...
[2024-11-11T15:52:45Z TRACE agent::plugin_manager::device_plugin_instance_controller] Plugin Manager: Reconciling akri-udev-fdb118
[2024-11-11T15:52:48Z TRACE agent::plugin_manager::device_plugin_slot_reclaimer] reclaiming unused slots - start
[2024-11-11T15:52:48Z TRACE agent::plugin_manager::device_plugin_slot_reclaimer] register - before call to register with the kubelet at socket /var/lib/kubelet/pod-resources/kubelet.sock
[2024-11-11T15:52:58Z TRACE agent::plugin_manager::device_plugin_slot_reclaimer] reclaiming unused slots - start
[2024-11-11T15:52:58Z TRACE agent::plugin_manager::device_plugin_slot_reclaimer] register - before call to register with the kubelet at socket /var/lib/kubelet/pod-resources/kubelet.sock
[2024-11-11T15:53:08Z TRACE agent::plugin_manager::device_plugin_slot_reclaimer] reclaiming unused slots - start
[2024-11-11T15:53:08Z TRACE agent::plugin_manager::device_plugin_slot_reclaimer] register - before call to register with the kubelet at socket /var/lib/kubelet/pod-resources/kubelet.sock
[2024-11-11T15:53:15Z WARN  agent::plugin_manager::device_plugin_instance_controller] Error during reconciliation of Instance Some("akri")::akri-udev-fdb118, retrying in 16s: Other(HyperError: error trying to connect: deadline has elapsed

    Caused by:
        0: error trying to connect: deadline has elapsed
        1: deadline has elapsed)

kubectl logs -n akri deploy/akri-webhook-configuration

Started Webhook server: 0.0.0.0:8443
Handler invoked
Handler received: AdmissionRequest
Validating Configuration
validate_configuration - deserialized Configuration: Object {"apiVersion": String("akri.sh/v0"), "kind": String("Configuration"), "metadata": Object {"annotations": Object {"meta.helm.sh/release-name": String("akri"), "meta.helm.sh/release-namespace": String("akri")}, "creationTimestamp": String("2024-11-10T18:48:36Z"), "finalizers": Array [String("nixos")], "generation": Number(1), "labels": Object {"app.kubernetes.io/managed-by": String("Helm"), "helm.toolkit.fluxcd.io/name": String("akri"), "helm.toolkit.fluxcd.io/namespace": String("akri")}, "managedFields": Array [Object {"apiVersion": String("akri.sh/v0"), "fieldsType": String("FieldsV1"), "fieldsV1": Object {"f:metadata": Object {"f:finalizers": Object {"v:\"nixos\"": Object {}}}}, "manager": String("nixos-fin"), "operation": String("Apply"), "time": String("2024-11-10T18:48:46Z")}, Object {"apiVersion": String("akri.sh/v0"), "fieldsType": String("FieldsV1"), "fieldsV1": Object {"f:metadata": Object {"f:annotations": Object {".": Object {}, "f:meta.helm.sh/release-name": Object {}, "f:meta.helm.sh/release-namespace": Object {}}, "f:labels": Object {".": Object {}, "f:app.kubernetes.io/managed-by": Object {}, "f:helm.toolkit.fluxcd.io/name": Object {}, "f:helm.toolkit.fluxcd.io/namespace": Object {}}}, "f:spec": Object {".": Object {}, "f:brokerProperties": Object {}, "f:capacity": Object {}, "f:discoveryHandler": Object {".": Object {}, "f:discoveryDetails": Object {}, "f:name": Object {}}}}, "manager": String("helm-controller"), "operation": String("Update"), "time": String("2024-11-10T18:48:36Z")}], "name": String("akri-onvif"), "namespace": String("akri"), "resourceVersion": String("17036990"), "uid": String("a706603b-b942-448f-acc4-b964dea31b8f")}, "spec": Object {"brokerProperties": Object {}, "capacity": Number(1), "discoveryHandler": Object {"discoveryDetails": String("ipAddresses: \n  action: Exclude\n  items: []\nmacAddresses:\n  action: Exclude\n  items: []\nscopes:\n  action: Exclude\n  items: []\nuuids:\n  action: Exclude\n  items: []\ndiscoveryTimeoutSeconds: 1\n"), "name": String("onvif")}}}
validate_configuration - expected deserialized format: Object {"apiVersion": String("akri.sh/v0"), "kind": String("Configuration"), "metadata": Object {"annotations": Object {"meta.helm.sh/release-name": String("akri"), "meta.helm.sh/release-namespace": String("akri")}, "finalizers": Array [String("nixos")], "generation": Number(1), "labels": Object {"app.kubernetes.io/managed-by": String("Helm"), "helm.toolkit.fluxcd.io/name": String("akri"), "helm.toolkit.fluxcd.io/namespace": String("akri")}, "name": String("akri-onvif"), "namespace": String("akri"), "resourceVersion": String("17036990"), "uid": String("a706603b-b942-448f-acc4-b964dea31b8f")}, "spec": Object {"brokerProperties": Object {}, "capacity": Number(1), "discoveryHandler": Object {"discoveryDetails": String("ipAddresses: \n  action: Exclude\n  items: []\nmacAddresses:\n  action: Exclude\n  items: []\nscopes:\n  action: Exclude\n  items: []\nuuids:\n  action: Exclude\n  items: []\ndiscoveryTimeoutSeconds: 1\n"), "name": String("onvif")}}}
Handler invoked
Handler received: AdmissionRequest
Validating Configuration
validate_configuration - deserialized Configuration: Object {"apiVersion": String("akri.sh/v0"), "kind": String("Configuration"), "metadata": Object {"annotations": Object {"meta.helm.sh/release-name": String("akri"), "meta.helm.sh/release-namespace": String("akri")}, "creationTimestamp": String("2024-11-10T18:48:36Z"), "finalizers": Array [String("nixos")], "generation": Number(1), "labels": Object {"app.kubernetes.io/managed-by": String("Helm"), "helm.toolkit.fluxcd.io/name": String("akri"), "helm.toolkit.fluxcd.io/namespace": String("akri")}, "managedFields": Array [Object {"apiVersion": String("akri.sh/v0"), "fieldsType": String("FieldsV1"), "fieldsV1": Object {"f:metadata": Object {"f:finalizers": Object {"v:\"nixos\"": Object {}}}}, "manager": String("nixos-fin"), "operation": String("Apply"), "time": String("2024-11-10T18:48:46Z")}, Object {"apiVersion": String("akri.sh/v0"), "fieldsType": String("FieldsV1"), "fieldsV1": Object {"f:metadata": Object {"f:annotations": Object {".": Object {}, "f:meta.helm.sh/release-name": Object {}, "f:meta.helm.sh/release-namespace": Object {}}, "f:labels": Object {".": Object {}, "f:app.kubernetes.io/managed-by": Object {}, "f:helm.toolkit.fluxcd.io/name": Object {}, "f:helm.toolkit.fluxcd.io/namespace": Object {}}}, "f:spec": Object {".": Object {}, "f:brokerProperties": Object {}, "f:capacity": Object {}, "f:discoveryHandler": Object {".": Object {}, "f:discoveryDetails": Object {}, "f:name": Object {}}}}, "manager": String("helm-controller"), "operation": String("Update"), "time": String("2024-11-10T18:48:36Z")}], "name": String("akri-udev"), "namespace": String("akri"), "resourceVersion": String("17036988"), "uid": String("09c842e5-57f3-4d5f-8f52-a750ac07c9b1")}, "spec": Object {"brokerProperties": Object {}, "capacity": Number(1), "discoveryHandler": Object {"discoveryDetails": String("groupRecursive: false\nudevRules:\n- ATTR{idVendor}==\"1d50\", ATTR{idProduct}==\"614e\"\n- ATTR{idVendor}==\"10c4\", ATTR{idProduct}==\"ea60\"\n"), "name": String("udev")}}}
validate_configuration - expected deserialized format: Object {"apiVersion": String("akri.sh/v0"), "kind": String("Configuration"), "metadata": Object {"annotations": Object {"meta.helm.sh/release-name": String("akri"), "meta.helm.sh/release-namespace": String("akri")}, "finalizers": Array [String("nixos")], "generation": Number(1), "labels": Object {"app.kubernetes.io/managed-by": String("Helm"), "helm.toolkit.fluxcd.io/name": String("akri"), "helm.toolkit.fluxcd.io/namespace": String("akri")}, "name": String("akri-udev"), "namespace": String("akri"), "resourceVersion": String("17036988"), "uid": String("09c842e5-57f3-4d5f-8f52-a750ac07c9b1")}, "spec": Object {"brokerProperties": Object {}, "capacity": Number(1), "discoveryHandler": Object {"discoveryDetails": String("groupRecursive: false\nudevRules:\n- ATTR{idVendor}==\"1d50\", ATTR{idProduct}==\"614e\"\n- ATTR{idVendor}==\"10c4\", ATTR{idProduct}==\"ea60\"\n"), "name": String("udev")}}}

kubectl logs -n akri akri-udev-discovery-daemonset-5fj55

[2024-11-11T19:10:05Z TRACE akri_udev::discovery_handler] discover - for udev rules ["ATTR{idVendor}==\"1d50\", ATTR{idProduct}==\"614e\"", "ATTR{idVendor}==\"10c4\", ATTR{idProduct}==\"ea60\""]
[2024-11-11T19:10:05Z INFO  akri_udev::discovery_impl] parse_udev_rule - enter for udev rule string ATTR{idVendor}=="1d50", ATTR{idProduct}=="614e"
[2024-11-11T19:10:05Z TRACE akri_udev::discovery_impl] parse_udev_rule - parsing udev_rule "ATTR{idVendor}==\"1d50\", ATTR{idProduct}==\"614e\""
[2024-11-11T19:10:05Z TRACE akri_udev::discovery_impl] find_devices - enter with udev_filters [UdevFilter { field: Pair { rule: attribute, span: Span { str: "ATTR{idVendor}", start: 0, end: 14 }, inner: [Pair { rule: bounded_key, span: Span { str: "{idVendor}", start: 4, end: 14 }, inner: [Pair { rule: key, span: Span { str: "idVendor", start: 5, end: 13 }, inner: [] }] }] }, operation: equality, value: "1d50" }, UdevFilter { field: Pair { rule: attribute, span: Span { str: "ATTR{idProduct}", start: 24, end: 39 }, inner: [Pair { rule: bounded_key, span: Span { str: "{idProduct}", start: 28, end: 39 }, inner: [Pair { rule: key, span: Span { str: "idProduct", start: 29, end: 38 }, inner: [] }] }] }, operation: equality, value: "614e" }]
[2024-11-11T19:10:05Z TRACE akri_udev::discovery_impl] enumerator_match_udev_filters - enter with udev_filters [UdevFilter { field: Pair { rule: attribute, span: Span { str: "ATTR{idVendor}", start: 0, end: 14 }, inner: [Pair { rule: bounded_key, span: Span { str: "{idVendor}", start: 4, end: 14 }, inner: [Pair { rule: key, span: Span { str: "idVendor", start: 5, end: 13 }, inner: [] }] }] }, operation: equality, value: "1d50" }, UdevFilter { field: Pair { rule: attribute, span: Span { str: "ATTR{idProduct}", start: 24, end: 39 }, inner: [Pair { rule: bounded_key, span: Span { str: "{idProduct}", start: 28, end: 39 }, inner: [Pair { rule: key, span: Span { str: "idProduct", start: 29, end: 38 }, inner: [] }] }] }, operation: equality, value: "614e" }]
[2024-11-11T19:10:05Z TRACE akri_udev::discovery_impl] enumerator_nomatch_udev_filters - enter with udev_filters []
[2024-11-11T19:10:09Z TRACE akri_udev::discovery_impl] filter_by_remaining_udev_filters - enter with udev_filters []
[2024-11-11T19:10:09Z TRACE akri_udev::discovery_impl] do_parse_and_find - returning discovered devices with devpaths: [("/devices/pci0000:00/0000:00:14.0/usb3/3-2", Some("/dev/bus/usb/003/018"))]
[2024-11-11T19:10:09Z INFO  akri_udev::discovery_impl] parse_udev_rule - enter for udev rule string ATTR{idVendor}=="10c4", ATTR{idProduct}=="ea60"
[2024-11-11T19:10:09Z TRACE akri_udev::discovery_impl] parse_udev_rule - parsing udev_rule "ATTR{idVendor}==\"10c4\", ATTR{idProduct}==\"ea60\""
[2024-11-11T19:10:09Z TRACE akri_udev::discovery_impl] find_devices - enter with udev_filters [UdevFilter { field: Pair { rule: attribute, span: Span { str: "ATTR{idVendor}", start: 0, end: 14 }, inner: [Pair { rule: bounded_key, span: Span { str: "{idVendor}", start: 4, end: 14 }, inner: [Pair { rule: key, span: Span { str: "idVendor", start: 5, end: 13 }, inner: [] }] }] }, operation: equality, value: "10c4" }, UdevFilter { field: Pair { rule: attribute, span: Span { str: "ATTR{idProduct}", start: 24, end: 39 }, inner: [Pair { rule: bounded_key, span: Span { str: "{idProduct}", start: 28, end: 39 }, inner: [Pair { rule: key, span: Span { str: "idProduct", start: 29, end: 38 }, inner: [] }] }] }, operation: equality, value: "ea60" }]
[2024-11-11T19:10:09Z TRACE akri_udev::discovery_impl] enumerator_match_udev_filters - enter with udev_filters [UdevFilter { field: Pair { rule: attribute, span: Span { str: "ATTR{idVendor}", start: 0, end: 14 }, inner: [Pair { rule: bounded_key, span: Span { str: "{idVendor}", start: 4, end: 14 }, inner: [Pair { rule: key, span: Span { str: "idVendor", start: 5, end: 13 }, inner: [] }] }] }, operation: equality, value: "10c4" }, UdevFilter { field: Pair { rule: attribute, span: Span { str: "ATTR{idProduct}", start: 24, end: 39 }, inner: [Pair { rule: bounded_key, span: Span { str: "{idProduct}", start: 28, end: 39 }, inner: [Pair { rule: key, span: Span { str: "idProduct", start: 29, end: 38 }, inner: [] }] }] }, operation: equality, value: "ea60" }]
[2024-11-11T19:10:09Z TRACE akri_udev::discovery_impl] enumerator_nomatch_udev_filters - enter with udev_filters []

kubectl logs -n akri deploy/akri-controller-deployment

[2024-11-11T14:36:28Z INFO  controller::util::node_watcher] handle_node - Added or modified: nixos
[2024-11-11T14:36:28Z TRACE controller::util::node_watcher] is_node_ready - for node Some("nixos")
[2024-11-11T14:37:40Z TRACE controller::util::instance_action] internal_do_instance_watch - aquired sync lock
[2024-11-11T14:37:40Z TRACE controller::util::instance_action] handle_instance - enter
[2024-11-11T14:37:40Z INFO  controller::util::instance_action] handle_instance - added or modified Akri Instance Some("akri-udev-e4c669"): InstanceSpec { configuration_name: "akri-udev", cdi_name: "akri.sh/akri-udev=e4c669", capacity: 1, broker_properties: {"UDEV_DEVNODE": "/dev/bus/usb/003/018", "UDEV_DEVPATH": "/devices/pci0000:00/0000:00:14.0/usb3/3-2"}, shared: false, nodes: ["nixos"], device_usage: {} }
[2024-11-11T14:37:40Z TRACE controller::util::instance_action] handle_instance_change - enter Add
[2024-11-11T14:37:40Z TRACE akri_shared::akri::configuration] find_configuration enter
[2024-11-11T14:37:40Z TRACE akri_shared::akri::configuration] find_configuration getting instance with name akri-udev
[2024-11-11T14:37:40Z TRACE akri_shared::akri::configuration] find_configuration return
[2024-11-11T14:41:10Z TRACE controller::util::node_watcher] handle_node - enter
[2024-11-11T14:41:10Z INFO  controller::util::node_watcher] handle_node - Added or modified: nixos
[2024-11-11T14:41:10Z TRACE controller::util::node_watcher] is_node_ready - for node Some("nixos")
[2024-11-11T14:41:28Z TRACE controller::util::node_watcher] handle_node - enter
...
...
...
[2024-11-11T18:56:28Z TRACE controller::util::node_watcher] is_node_ready - for node Some("nixos")
[2024-11-11T18:57:45Z TRACE controller::util::node_watcher] handle_node - enter
[2024-11-11T18:57:45Z INFO  controller::util::node_watcher] handle_node - Added or modified: nixos
[2024-11-11T18:57:45Z TRACE controller::util::node_watcher] is_node_ready - for node Some("nixos")
[2024-11-11T18:58:40Z TRACE controller::util::instance_action] internal_do_instance_watch - aquired sync lock
[2024-11-11T18:58:40Z TRACE controller::util::instance_action] handle_instance - enter
[2024-11-11T18:58:40Z INFO  controller::util::instance_action] handle_instance - added or modified Akri Instance Some("akri-udev-fdb118"): InstanceSpec { configuration_name: "akri-udev", cdi_name: "akri.sh/akri-udev=fdb118", capacity: 1, broker_properties: {"UDEV_DEVNODE": "/dev/bus/usb/003/004", "UDEV_DEVPATH": "/devices/pci0000:00/0000:00:14.0/usb3/3-5/3-5.4"}, shared: false, nodes: ["nixos"], device_usage: {"akri-udev-fdb118-0": "nixos"} }
[2024-11-11T18:58:40Z TRACE controller::util::instance_action] handle_instance_change - enter Add
[2024-11-11T18:58:40Z TRACE akri_shared::akri::configuration] find_configuration enter
[2024-11-11T18:58:40Z TRACE akri_shared::akri::configuration] find_configuration getting instance with name akri-udev
[2024-11-11T18:58:41Z TRACE akri_shared::akri::configuration] find_configuration return

journalctl -u k3s -r -g akri

nov. 11 17:15:12 nixos k3s[1663]: I1111 17:15:12.472422    1663 trace.go:236] Trace[1420334602]: "Get" accept:application/json, */*,audit-id:fcf15163-ef2c-4f1a-9718-9e96f4c5047e,client:127.0.0.1,api-group:,ap
i-version:v1,name:akri-agent-daemonset-q5fbr,subresource:log,namespace:akri,protocol:HTTP/2.0,resource:pods,scope:resource,url:/api/v1/namespaces/akri/pods/akri-agent-daemonset-q5fbr/log,user-agent:kubectl/v1
.30.3+k3s1 (linux/amd64) kubernetes/f646604,verb:CONNECT (11-Nov-2024 17:15:11.435) (total time: 1036ms):
nov. 11 17:08:51 nixos k3s[1663]: I1111 17:08:51.950363    1663 trace.go:236] Trace[640270479]: "Get" accept:application/json, */*,audit-id:f8dda6f3-d5c7-483d-8b2f-658049451f57,client:127.0.0.1,api-group:,api
-version:v1,name:akri-agent-daemonset-q5fbr,subresource:log,namespace:akri,protocol:HTTP/2.0,resource:pods,scope:resource,url:/api/v1/namespaces/akri/pods/akri-agent-daemonset-q5fbr/log,user-agent:kubectl/v1.
30.3+k3s1 (linux/amd64) kubernetes/f646604,verb:CONNECT (11-Nov-2024 17:08:50.854) (total time: 1095ms):
nov. 11 16:57:25 nixos k3s[1663]: I1111 16:57:25.365232    1663 trace.go:236] Trace[817891226]: "Get" accept:application/json, */*,audit-id:f9b2bfe3-9820-4a97-b256-6cea6345fc36,client:127.0.0.1,api-group:,api
-version:v1,name:akri-agent-daemonset-q5fbr,subresource:log,namespace:akri,protocol:HTTP/2.0,resource:pods,scope:resource,url:/api/v1/namespaces/akri/pods/akri-agent-daemonset-q5fbr/log,user-agent:kubectl/v1.
30.3+k3s1 (linux/amd64) kubernetes/f646604,verb:CONNECT (11-Nov-2024 16:57:24.331) (total time: 1033ms):
nov. 11 01:22:34 nixos k3s[1663]: I1111 01:22:34.869003    1663 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="akri/akri-webhook-configuration-8d684cc56" duration="94.757µs"
nov. 11 01:22:34 nixos k3s[1663]: I1111 01:22:34.868685    1663 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="akri/akri-controller-deployment-54ff6d5c6c" duration="69.304µs"
nov. 10 19:49:06 nixos k3s[1663]: I1110 19:49:06.155104    1663 server.go:144] "Got registration request from device plugin with resource" resourceName="akri.sh/akri-udev"
nov. 10 19:49:05 nixos k3s[1663]: I1110 19:49:05.055985    1663 server.go:144] "Got registration request from device plugin with resource" resourceName="akri.sh/akri-udev-fdb118"
nov. 10 19:49:03 nixos k3s[1663]: I1110 19:49:03.924640    1663 controller.go:615] quota admission added evaluator for: instances.akri.sh
nov. 10 19:48:46 nixos k3s[1663]: I1110 19:48:46.220657    1663 job_controller.go:566] "enqueueing job" key="akri/akri-webhook-configuration-patch"
nov. 10 19:48:46 nixos k3s[1663]: I1110 19:48:46.191105    1663 job_controller.go:566] "enqueueing job" key="akri/akri-webhook-configuration-patch"
nov. 10 19:48:46 nixos k3s[1663]: I1110 19:48:46.181994    1663 job_controller.go:566] "enqueueing job" key="akri/akri-webhook-configuration-patch"
nov. 10 19:48:46 nixos k3s[1663]: I1110 19:48:46.175577    1663 job_controller.go:566] "enqueueing job" key="akri/akri-webhook-configuration-patch"
nov. 10 19:48:46 nixos k3s[1663]: I1110 19:48:46.169360    1663 job_controller.go:566] "enqueueing job" key="akri/akri-webhook-configuration-patch"
nov. 10 19:48:46 nixos k3s[1663]: I1110 19:48:46.058346    1663 job_controller.go:566] "enqueueing job" key="akri/akri-webhook-configuration-patch"
nov. 10 19:48:45 nixos k3s[1663]: I1110 19:48:45.160865    1663 job_controller.go:566] "enqueueing job" key="akri/akri-webhook-configuration-patch"
nov. 10 19:48:44 nixos k3s[1663]: I1110 19:48:44.043802    1663 job_controller.go:566] "enqueueing job" key="akri/akri-webhook-configuration-patch"
nov. 10 19:48:42 nixos k3s[1663]: I1110 19:48:42.657606    1663 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-gcnlr\" (UniqueName: \"kubernet
es.io/projected/7b661bb6-a8aa-4ebe-b751-9947c55e1656-kube-api-access-gcnlr\") pod \"akri-webhook-configuration-patch-rwbrs\" (UID: \"7b661bb6-a8aa-4ebe-b751-9947c55e1656\") " pod="akri/akri-webhook-configurat
ion-patch-rwbrs"
nov. 10 19:48:42 nixos k3s[1663]: W1110 19:48:42.571005    1663 dispatcher.go:217] Failed calling webhook, failing closed akri-webhook-configuration.akri.svc: failed calling webhook "akri-webhook-configuratio
n.akri.svc": failed to call webhook: Post "https://akri-webhook-configuration.akri.svc:443/validate?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority
nov. 10 19:48:42 nixos k3s[1663]: W1110 19:48:42.570384    1663 dispatcher.go:217] Failed calling webhook, failing closed akri-webhook-configuration.akri.svc: failed calling webhook "akri-webhook-configuration.akri.svc": failed to call webhook: Post "https://akri-webhook-configuration.akri.svc:443/validate?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority
nov. 10 19:48:42 nixos k3s[1663]: I1110 19:48:42.540365    1663 job_controller.go:566] "enqueueing job" key="akri/akri-webhook-configuration-patch"
nov. 10 19:48:42 nixos k3s[1663]: I1110 19:48:42.524238    1663 topology_manager.go:215] "Topology Admit Handler" podUID="7b661bb6-a8aa-4ebe-b751-9947c55e1656" podNamespace="akri" podName="akri-webhook-configuration-patch-rwbrs

Additional context
I have followed the cluster setup guide, but I specifically haven't followed the section about granting the regular user admin privileges to the kube config. This seems to me like a security caveat, and I wonder if it is really necessary. Why does Akri need access to the kubelet socket, can't it use the kubernetes API, like other applications?
And if Akri does need access to the kubeconfig, how can the method be made more secure than currently?

kubectl get akric -n akri -oyaml

apiVersion: v1
items:
- apiVersion: akri.sh/v0
  kind: Configuration
  metadata:
    annotations:
      meta.helm.sh/release-name: akri
      meta.helm.sh/release-namespace: akri
    creationTimestamp: "2024-11-10T18:48:36Z"
    finalizers:
    - nixos
    generation: 1
    labels:
      app.kubernetes.io/managed-by: Helm
      helm.toolkit.fluxcd.io/name: akri
      helm.toolkit.fluxcd.io/namespace: akri
    name: akri-onvif
    namespace: akri
    resourceVersion: "17037162"
    uid: a706603b-b942-448f-acc4-b964dea31b8f
  spec:
    brokerProperties: {}
    capacity: 1
    discoveryHandler:
      discoveryDetails: "ipAddresses: \n  action: Exclude\n  items: []\nmacAddresses:\n
        \ action: Exclude\n  items: []\nscopes:\n  action: Exclude\n  items: []\nuuids:\n
        \ action: Exclude\n  items: []\ndiscoveryTimeoutSeconds: 1\n"
      name: onvif
- apiVersion: akri.sh/v0
  kind: Configuration
  metadata:
    annotations:
      meta.helm.sh/release-name: akri
      meta.helm.sh/release-namespace: akri
    creationTimestamp: "2024-11-10T18:48:36Z"
    finalizers:
    - nixos
    generation: 1
    labels:
      app.kubernetes.io/managed-by: Helm
      helm.toolkit.fluxcd.io/name: akri
      helm.toolkit.fluxcd.io/namespace: akri
    name: akri-udev
    namespace: akri
    resourceVersion: "17037163"
    uid: 09c842e5-57f3-4d5f-8f52-a750ac07c9b1
  spec:
    brokerProperties: {}
    capacity: 1
    discoveryHandler:
      discoveryDetails: |
        groupRecursive: false
        udevRules:
        - ATTR{idVendor}=="1d50", ATTR{idProduct}=="614e"
        - ATTR{idVendor}=="10c4", ATTR{idProduct}=="ea60"
      name: udev
kind: List
metadata:
  resourceVersion: ""

kubectl get akric -n akri -oyaml

apiVersion: v1
items:
- apiVersion: akri.sh/v0
  kind: Configuration
  metadata:
    annotations:
      meta.helm.sh/release-name: akri
      meta.helm.sh/release-namespace: akri
    creationTimestamp: "2024-11-10T18:48:36Z"
    finalizers:
    - nixos
    generation: 1
    labels:
      app.kubernetes.io/managed-by: Helm
      helm.toolkit.fluxcd.io/name: akri
      helm.toolkit.fluxcd.io/namespace: akri
    name: akri-onvif
    namespace: akri
    resourceVersion: "17037162"
    uid: a706603b-b942-448f-acc4-b964dea31b8f
  spec:
    brokerProperties: {}
    capacity: 1
    discoveryHandler:
      discoveryDetails: "ipAddresses: \n  action: Exclude\n  items: []\nmacAddresses:\n
        \ action: Exclude\n  items: []\nscopes:\n  action: Exclude\n  items: []\nuuids:\n
        \ action: Exclude\n  items: []\ndiscoveryTimeoutSeconds: 1\n"
      name: onvif
- apiVersion: akri.sh/v0
  kind: Configuration
  metadata:
    annotations:
      meta.helm.sh/release-name: akri
      meta.helm.sh/release-namespace: akri
    creationTimestamp: "2024-11-10T18:48:36Z"
    finalizers:
    - nixos
    generation: 1
    labels:
      app.kubernetes.io/managed-by: Helm
      helm.toolkit.fluxcd.io/name: akri
      helm.toolkit.fluxcd.io/namespace: akri
    name: akri-udev
    namespace: akri
    resourceVersion: "17037163"
    uid: 09c842e5-57f3-4d5f-8f52-a750ac07c9b1
  spec:
    brokerProperties: {}
    capacity: 1
    discoveryHandler:
      discoveryDetails: |
        groupRecursive: false
        udevRules:
        - ATTR{idVendor}=="1d50", ATTR{idProduct}=="614e"
        - ATTR{idVendor}=="10c4", ATTR{idProduct}=="ea60"
      name: udev
kind: List
metadata:
  resourceVersion: ""
jacob@nixos:~/ > kubectl get akrii -n akri -oyaml
apiVersion: v1
items:
- apiVersion: akri.sh/v0
  kind: Instance
  metadata:
    creationTimestamp: "2024-11-11T14:37:40Z"
    generation: 1
    name: akri-udev-e4c669
    namespace: akri
    ownerReferences:
    - apiVersion: akri.sh/v0
      controller: true
      kind: Configuration
      name: akri-udev
      uid: 09c842e5-57f3-4d5f-8f52-a750ac07c9b1
    resourceVersion: "17417974"
    uid: f9759b64-4912-4074-82a5-1fa01906a620
  spec:
    brokerProperties:
      UDEV_DEVNODE: /dev/bus/usb/003/018
      UDEV_DEVPATH: /devices/pci0000:00/0000:00:14.0/usb3/3-2
    capacity: 1
    cdiName: akri.sh/akri-udev=e4c669
    configurationName: akri-udev
    deviceUsage: {}
    nodes:
    - nixos
    shared: false
- apiVersion: akri.sh/v0
  kind: Instance
  metadata:
    creationTimestamp: "2024-11-10T18:49:03Z"
    finalizers:
    - nixos
    generation: 1
    name: akri-udev-fdb118
    namespace: akri
    ownerReferences:
    - apiVersion: akri.sh/v0
      controller: true
      kind: Configuration
      name: akri-udev
      uid: 09c842e5-57f3-4d5f-8f52-a750ac07c9b1
    resourceVersion: "17037297"
    uid: e7d85112-72f1-47b5-abd0-058665452995
  spec:
    brokerProperties:
      UDEV_DEVNODE: /dev/bus/usb/003/004
      UDEV_DEVPATH: /devices/pci0000:00/0000:00:14.0/usb3/3-5/3-5.4
    capacity: 1
    cdiName: akri.sh/akri-udev=fdb118
    configurationName: akri-udev
    deviceUsage: {}
    nodes:
    - nixos
    shared: false
kind: List
metadata:
  resourceVersion: ""
@ruzko ruzko added the bug Something isn't working label Nov 11, 2024
@github-project-automation github-project-automation bot moved this to Triage needed in Akri Roadmap Nov 11, 2024
@yujinkim-msft
Copy link
Contributor

Hi @ruzko thank you for your question! We've been pushing a lot of changes in recently preparing for a release -- I see that you're using the akri-dev chart, can you please try reinstalling with the latest dev chart and see if the same behavior occurs?

@kate-goldenring any thoughts here?

@kate-goldenring
Copy link
Contributor

@ruzko, can you increase the capacity field on your configuration to something greater than 1? That configures how many containers are allowed to use a device at once. Looks like you have 1 node, so it should be fine to keep it at 1, but there may be a race case here. This race case may be more prevalent in the rewrite of the agent. It may be worth trying out the previous v0.12.20 release.

@ruzko Akri creates device plugins which communicate with kubelet through its socket. This is why that socket is mounted in the Agent. More on device plugins: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/

@ruzko
Copy link
Author

ruzko commented Nov 26, 2024

Thank you for your comments :)

I've fallen back to using generic-device-plugin because it (at least so far) fulfills my needs.

I'll try to give your suggestions a go later this week, and report back with my findings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Triage needed
Development

No branches or pull requests

3 participants