Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add privileged mode for SST #105

Closed

Conversation

changzhi1990
Copy link
Contributor

The NRI pod needs to access the /dev/isst_interface device in the privileged mode.

The NRI pod needs to access the /dev/isst_interface device in the
privileged mode.
@changzhi1990
Copy link
Contributor Author

This pr addresses #101

@changzhi1990
Copy link
Contributor Author

@marquiz Do you have any comments for this pr?

@@ -66,6 +70,10 @@ spec:
mountPath: /var/lib/nri-resource-policy
- name: hostsysfs
mountPath: /host/sys
{{ if .Values.privilegedMode }}
- name: hostdev
mountPath: /host/dev
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mounting /dev should be a separate setting.

@@ -54,9 +54,13 @@ spec:
image: {{ .Values.image.name }}:{{ .Values.image.tag | default .Chart.AppVersion }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
securityContext:
{{ if .Values.privilegedMode }}
privileged: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also suggest to restrict the capabilities: to the minimum that is actually required.

I'm not sure what would be the nicest way to tie all the pieces (privileged mode, capabilities, /dev mount) together. Would there be three separate parameters or a signle devAccess or smth 🤨 Thoughts?

@changzhi1990
Copy link
Contributor Author

changzhi1990 commented Sep 13, 2023

Hi, @marquiz . I tried to use capabilities to grant privileges to access the /dev/isst_interface device. But I met a problem when using capabilities like:

          securityContext:
            allowPrivilegeEscalation: true
            capabilities:
              add:
                - AUDIT_CONTROL
                - AUDIT_READ
                - AUDIT_WRITE
                - BLOCK_SUSPEND
                - CHOWN
                - DAC_OVERRIDE
                - DAC_READ_SEARCH
                - FOWNER
                - FSETID
                - IPC_LOCK
                - IPC_OWNER
                - KILL
                - LEASE
                - LINUX_IMMUTABLE
                - MAC_ADMIN
                - MAC_OVERRIDE
                - MKNOD
                - NET_ADMIN
                - NET_BIND_SERVICE
                - NET_BROADCAST
                - NET_RAW
                - SETGID
                - SETFCAP
                - SETPCAP
                - SETUID
                - SYS_ADMIN
                - SYS_BOOT
                - SYS_CHROOT
                - SYS_MODULE
                - SYS_NICE
                - SYS_PACCT
                - SYS_PTRACE
                - SYS_RAWIO
                - SYS_RESOURCE
                - SYS_TIME
                - SYS_TTY_CONFIG
                - SYSLOG

In this securitycontext, I didn't know which capabilities I should add so I added all the capabilities to the nri pod. But the nri pod still reported an error:

W0912 08:17:25.346297       1 system.go:297] failed to get SST info for package 1: failed to read SST PP info: Mbox command failed with failed to open is
st device "/host/dev/isst_interface": open /host/dev/isst_interface: operation not permitted  

Do you have any ideas on that?

@changzhi1990
Copy link
Contributor Author

changzhi1990 commented Sep 13, 2023

If I don't add extra capabilities to the nri pod. The grep command output is:

sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/1340183/status
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000

sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=00000000a80425fb
WARNING: libcap needs an update (cap=40 should have a name).
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

And, if I add the extra capabilities to the nri pod. The grep command output is:

sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/1334365/status
CapInh: 0000000000000000                                                                                                                                                        CapPrm: 00000037ffffffff
CapEff: 00000037ffffffff
CapBnd: 00000037ffffffff
CapAmb: 0000000000000000

sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=00000037ffffffff
WARNING: libcap needs an update (cap=40 should have a name).
0x00000037ffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_
net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_block_suspe
nd,cap_audit_read

So, it seems that all the capabilities have already added to the nri pod. But the nri pod still can't access the /dev/isst_interface device.

@marquiz
Copy link
Collaborator

marquiz commented Sep 13, 2023

So, it seems that all the capabilities have already added to the nri pod. But the nri pod still can't access the /dev/isst_interface device.

Thanks @changzhi1990 for debugging this. Fair enough, seems like we need the privileged mode, then.

I'd still like to understand why that fails. Maybe its the ambient capabilities somehow being required (although I don't quite understand why)...

@marquiz
Copy link
Collaborator

marquiz commented Sep 13, 2023

So, it seems that all the capabilities have already added to the nri pod. But the nri pod still can't access the /dev/isst_interface device.

Thanks @changzhi1990 for debugging this. Fair enough, seems like we need the privileged mode, then.

I'd still like to understand why that fails. Maybe its the ambient capabilities somehow being required (although I don't quite understand why)...

Responding to myself: I think it's because runc uses cgroups to control access to device nodes. To make this work without privileged /dev/isst_interface would need to be added as a device to the container, not just mount :/

@changzhi1990
Copy link
Contributor Author

So, it seems that all the capabilities have already added to the nri pod. But the nri pod still can't access the /dev/isst_interface device.

Thanks @changzhi1990 for debugging this. Fair enough, seems like we need the privileged mode, then.
I'd still like to understand why that fails. Maybe its the ambient capabilities somehow being required (although I don't quite understand why)...

Responding to myself: I think it's because runc uses cgroups to control access to device nodes. To make this work without privileged /dev/isst_interface would need to be added as a device to the container, not just mount :/

Thanks for your response. I will try this.

@changzhi1990
Copy link
Contributor Author

I have compared the capabilities of two scenarios.

  1. Use "privileged: true"
  2. Add all capabilities

In the scenario1,

sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ ps -ef|grep nri
root       29920   29869 15 18:24 ?        00:00:02 /bin/nri-resource-policy-topology-aware --host-root /host --fallback-config /etc/nri-resource-policy/nri-resource-policy.cfg --pid-file /tmp/nri-resource-policy.pid -metrics-interval 5s
sdp        30432    3890  0 18:24 pts/0    00:00:00 grep --color=auto nri
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/29920/status
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/29869/status
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000

sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=000001ffffffffff
WARNING: libcap needs an update (cap=40 should have a name).
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,38,39,40

In the scenario2,

sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ ps -ef|grep nri
root       25523   25395 11 18:22 ?        00:00:03 /bin/nri-resource-policy-topology-aware --host-root /host --fallback-config /etc/nri-resource-policy/nri-resource-policy.cfg --pid-file /tmp/nri-resource-policy.pid -metrics-interval 5s
sdp        27017    3890  0 18:22 pts/0    00:00:00 grep --color=auto nri
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/25523/status
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ grep Cap /proc/25395/status
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=000003ffffffffff
WARNING: libcap needs an update (cap=40 should have a name).
0x000003ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,38,39,40,41
sdp@b49691a74bec:~/zhi/own/nri-plugins/deployment/helm/resource-management-policies/topology-aware$ capsh --decode=000001ffffffffff
WARNING: libcap needs an update (cap=40 should have a name).
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,38,39,40

The only difference between these two scenarios is that there is an extra "41" capabilities in scenario2.

@changzhi1990
Copy link
Contributor Author

Maybe.... we need an sst device plugin, like dsb device plugin, gpu device plugin, etc?

@marquiz
Copy link
Collaborator

marquiz commented Sep 14, 2023

Thanks for your response. I will try this.

@changzhi1990, ah sorry, I was a bit "equivocal" on this. I was merely speaking to myself because there is nothing else than privileged mode in the current setup that can make it work.

Maybe.... we need an sst device plugin, like dsb device plugin, gpu device plugin, etc?

Yes, this would be a way to do it. Another possible solution would be running another NRI plugin before this one that would insert the isst_interface device in the resource-policy container but let's not go there yet...

@changzhi1990
Copy link
Contributor Author

Thanks for your response. I will try this.

@changzhi1990, ah sorry, I was a bit "equivocal" on this. I was merely speaking to myself because there is nothing else than privileged mode in the current setup that can make it work.

Maybe.... we need an sst device plugin, like dsb device plugin, gpu device plugin, etc?

Yes, this would be a way to do it. Another possible solution would be running another NRI plugin before this one that would insert the isst_interface device in the resource-policy container but let's not go there yet...

Alright, Is worth to create an sst plugin? I mean maybe there are a lot of works to do

@marquiz
Copy link
Collaborator

marquiz commented Sep 14, 2023

Alright, Is worth to create an sst plugin? I mean maybe there are a lot of works to do

Talking about another NRI plugin @klihub had an idea about this (it seems that there already exists a sample plugin for this purpose, i.e. injecting a device) so that could be the solution, indeed. Let @klihub provide the details here

@klihub
Copy link
Collaborator

klihub commented Sep 14, 2023

@changzhi1990 You should be able to get read-only access to /dev/sst_interface using NRI itself and an(other) NRI plugin and without requiring any cpabilities or using a privileged container. We have a sample NRI plugin which can inject devices and the necessary cgroup access rules for them based on annotations.

For instance, I used that plugin and this following pod spec to test if I can do 'SST discovery':

apiVersion: v1
kind: Pod
metadata:
  name: sst-test
  annotations:
    devices.nri.io/container.c0: |+
      - path: /dev/isst_interface
        type: c
        major: 10
        minor: 123
        file_mode: 0o600
spec:
  containers:
  - name: c0
    image: quay.io/marquiz/goresctrl:test
    command:
      - sh
      - -c
      - sleep 3600
    resources:
      requests:
        cpu: 250m
        memory: 200M
      limits:
        cpu: 500m
        memory: 200M
    securityContext:
      privileged: false
      capabilities:
        drop:
          - all
    imagePullPolicy: IfNotPresent

Just run the device-injector plugin (for instance ./device-injector --idx 10 -name device-injector manually for testing) before starting the pod. Then you should be able to do this:

klitkey1@emr-1:~/xfer$ kubectl apply -f sst-test.yaml
klitkey1@emr-1:~/xfer$ kubectl exec -ti sst-test -c c0 -- /bin/bash --login
root@sst-test:/go/builder# /go/bin/sst-ctl info
...
  PPCurrentLevel: 0
  PPLocked: true
  PPMaxLevel: 4
  PPSupported: true
  PPVersion: 3
  TFEnabled: false
  TFSupported: true
...

This sample plugin was not really meant for production. It's merely a sample plugin which demonstrate some of NRI's capabilities. Anyway, if you'd like to use this plugin or you roll your own, you can enable it permanently on your cluster nodes by symlinking it into /opt/nri/plugins:

klitkey1@emr-1:~/xfer$ sudo mkdir -p /opt/nri/plugins
sudo ln -s $(pwd)/device-injector /opt/nri/plugins/10-device-injector
klitkey1@emr-1:~/xfer$ sudo systemctl restart containerd # or crio

After this you should be able to annotate your pods for device injection for testing...

In a production environment you might want to restrict somehow (for instance by namespaces) which pods can be annotated with injected devices. Also, you might want to deploy that plugin itself as a DaemonSet instead of having to install it separately on each of your worker nodes. If there is enough interest, we can consider polishing that plugin, adding any necessary mechanisms for restricting access to annotation-based device injection, etc. and creating images and other deployment artifacts for it within the nri-plugins repo.

@changzhi1990
Copy link
Contributor Author

@changzhi1990 You should be able to get read-only access to /dev/sst_interface using NRI itself and an(other) NRI plugin and without requiring any cpabilities or using a privileged container. We have a sample NRI plugin which can inject devices and the necessary cgroup access rules for them based on annotations.

For instance, I used that plugin and this following pod spec to test if I can do 'SST discovery':

apiVersion: v1
kind: Pod
metadata:
  name: sst-test
  annotations:
    devices.nri.io/container.c0: |+
      - path: /dev/isst_interface
        type: c
        major: 10
        minor: 123
        file_mode: 0o600
spec:
  containers:
  - name: c0
    image: quay.io/marquiz/goresctrl:test
    command:
      - sh
      - -c
      - sleep 3600
    resources:
      requests:
        cpu: 250m
        memory: 200M
      limits:
        cpu: 500m
        memory: 200M
    securityContext:
      privileged: false
      capabilities:
        drop:
          - all
    imagePullPolicy: IfNotPresent

Just run the device-injector plugin (for instance ./device-injector --idx 10 -name device-injector manually for testing) before starting the pod. Then you should be able to do this:

klitkey1@emr-1:~/xfer$ kubectl apply -f sst-test.yaml
klitkey1@emr-1:~/xfer$ kubectl exec -ti sst-test -c c0 -- /bin/bash --login
root@sst-test:/go/builder# /go/bin/sst-ctl info
...
  PPCurrentLevel: 0
  PPLocked: true
  PPMaxLevel: 4
  PPSupported: true
  PPVersion: 3
  TFEnabled: false
  TFSupported: true
...

This sample plugin was not really meant for production. It's merely a sample plugin which demonstrate some of NRI's capabilities. Anyway, if you'd like to use this plugin or you roll your own, you can enable it permanently on your cluster nodes by symlinking it into /opt/nri/plugins:

klitkey1@emr-1:~/xfer$ sudo mkdir -p /opt/nri/plugins
sudo ln -s $(pwd)/device-injector /opt/nri/plugins/10-device-injector
klitkey1@emr-1:~/xfer$ sudo systemctl restart containerd # or crio

After this you should be able to annotate your pods for device injection for testing...

In a production environment you might want to restrict somehow (for instance by namespaces) which pods can be annotated with injected devices. Also, you might want to deploy that plugin itself as a DaemonSet instead of having to install it separately on each of your worker nodes. If there is enough interest, we can consider polishing that plugin, adding any necessary mechanisms for restricting access to annotation-based device injection, etc. and creating images and other deployment artifacts for it within the nri-plugins repo.

Hi, Thanks for your detailed reply. According to your message, maybe we have two options for this issue.

  1. Start a device-injector plugin before the NRI pod starts.
  2. Polishing the plugin and adding some necessary mechanisms.

@klihub
Copy link
Collaborator

klihub commented Aug 26, 2024

Closing this as ancient stale.

@klihub klihub closed this Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants