amazon/bottlerocket-aws-k8s-1.25-nvidia-x86_64 ami in eks cluster #4155

salma-aneo · 2024-08-21T08:17:24Z

salma-aneo
Aug 21, 2024

Environment:

AWS Region: eu-west-3
Instance Type: g4dn.xlarge
AMI version: amazon/bottlerocket-aws-k8s-1.25-nvidia-x86_64-v1.20.5-a3e8bda1

This is my EKS managed node group:

# Node group for workers of ArmoniK on the GPU
  gpu_workers = {
    name                        = "gpu_workers"
    launch_template_description = "Node group for ArmoniK Compute-plane pods which run on the GPU"
    ami_type                    = "BOTTLEROCKET_x86_64_NVIDIA"
    instance_types              = ["g4dn.xlarge"]
    capacity_type               = "SPOT"
    min_size                    = 0
    desired_size                = 0
    max_size                    = 50
    labels = {
      service                        = "gpu_workers"
      "node.kubernetes.io/lifecycle" = "spot"
    }
    taints = {
      dedicated = {
        key    = "service"
        value  = "gpu_workers"
        effect = "NO_SCHEDULE"
      }
    }
    iam_role_use_name_prefix = false
    iam_role_additional_policies = {
      AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
    }
  }

This is my worker configuration;

  # Partition that run the workload on gpu
  gputest = {
    node_selector = { service = "gpu_workers" }
    # number of replicas for each deployment of compute plane
    replicas = 1
    # ArmoniK polling agent
    polling_agent = {
      limits = {
        cpu    = "2000m"
        memory = "2048Mi"
      }
      requests = {
        cpu    = "500m"
        memory = "256Mi"
      }
    }
    # ArmoniK workers
    worker = [
      {
        image = "salmaaneo/testnvidia1"
        tag = "latest"
        limits = {
          cpu    = "4000m"
          memory = "16384Mi"
          "nvidia/gpu" = "1"
        }
        requests = {
          cpu    = "2000m"
          memory = "8192Mi"
          "nvidia/gpu" = "1"
        }
      }
    ]
    hpa = {
      type              = "prometheus"
      polling_interval  = 15
      cooldown_period   = 300
      min_replica_count = 0
      max_replica_count = 5
      behavior = {
        restore_to_original_replica_count = true
        stabilization_window_seconds      = 300
        type                              = "Percent"
        value                             = 100
        period_seconds                    = 15
      }
      triggers = [
        {
          type      = "prometheus"
          threshold = 2
        },
      ]
    }
  },

The container inside my pod, which needs to run a workload on the GPU, uses a custom image built on nvidia/cuda:12.3.1-runtime-ubuntu20.04. I have confirmed that my pod is scheduled on the correct node group. However, my pod is unable to access the GPU.

I am wondering if the AMI I’m using includes the nvidia-container-toolkit package. If not, how is the GPU exposed to the container inside the pod?

Answered by salma-aneo

Aug 26, 2024

Thanks for your reply. I managed to find the issue; it was related to using pip3 to install modules on my base image. Apparently, that caused the symbolic links to break. I added a script to reestablish them, and everything works perfectly now.

View full answer

KCSesh · 2024-08-24T02:32:09Z

KCSesh
Aug 24, 2024
Maintainer

Using a Bottlerocker vended aws-k8s-*-nvidia variant should contain the necessary packages for GPU access, which you are using with: bottlerocket-aws-k8s-1.25-nvidia-x86_64-v1.20.5-a3e8bda1 in eu-west-3.

Do you mind sharing your Pod Configuration file which you use to deploy your nvidia/cuda:12.3.1-runtime-ubuntu20.04 container?

0 replies

salma-aneo · 2024-08-26T13:07:15Z

salma-aneo
Aug 26, 2024
Author

Thanks for your reply. I managed to find the issue; it was related to using pip3 to install modules on my base image. Apparently, that caused the symbolic links to break. I added a script to reestablish them, and everything works perfectly now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amazon/bottlerocket-aws-k8s-1.25-nvidia-x86_64 ami in eks cluster #4155

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

amazon/bottlerocket-aws-k8s-1.25-nvidia-x86_64 ami in eks cluster #4155

salma-aneo Aug 21, 2024

Replies: 2 comments

KCSesh Aug 24, 2024 Maintainer

salma-aneo Aug 26, 2024 Author

salma-aneo
Aug 21, 2024

KCSesh
Aug 24, 2024
Maintainer

salma-aneo
Aug 26, 2024
Author