Skip to content
This repository was archived by the owner on Mar 20, 2023. It is now read-only.
This repository was archived by the owner on Mar 20, 2023. It is now read-only.

Deployment of Standard_NC4as_T4_v3 fails if GPU drivers are specified #370

@fayora

Description

@fayora

Problem Description

If I deploy a pool with Standard_NC4as_T4_v3 without the gpu:nvidia_driver:source specification in pool.yaml, the pool succeeds but the NVIDIA drivers are not installed.

If I specify gpu:nvidia_driver:source, I get an error: local variable 'gpu_driver' referenced before assignment

The same pool.yaml works fine with Standard_NC6s_v3

Batch Shipyard Version

3.9.1

Steps to Reproduce

Try to deploy a pool with Standard_NC4as_T4_v3

Expected Results

Pool is deployed

Actual Results

Error is returned when gpu:nvidia_driver:source specification is provided in pool.yaml:

2021-09-21 09:02:21.573 INFO - uploading file /tmp/_MEIRpaARG/scripts/shipyard_docker_exec_task_runner.sh as 'shipyard_docker_exec_task_runner.sh'
Traceback (most recent call last):
  File "shipyard.py", line 3136, in <module>
  File "site-packages/click/core.py", line 764, in __call__
  File "site-packages/click/core.py", line 717, in main
  File "site-packages/click/core.py", line 1137, in invoke
  File "site-packages/click/core.py", line 1137, in invoke
  File "site-packages/click/core.py", line 956, in invoke
  File "site-packages/click/core.py", line 555, in invoke
  File "site-packages/click/decorators.py", line 64, in new_func
  File "site-packages/click/core.py", line 555, in invoke
  File "shipyard.py", line 1546, in pool_add
  File "convoy/fleet.py", line 3451, in action_pool_add
  File "convoy/fleet.py", line 1849, in _add_pool
  File "convoy/fleet.py", line 1555, in _construct_pool_object
UnboundLocalError: local variable 'gpu_driver' referenced before assignment
[9269] Failed to execute script shipyard

Redacted Configuration

pool.yaml

pool_specification:
  id: test-cluster-gpus-t4
  vm_configuration:
    platform_image:
      publisher: canonical
      offer: ubuntuserver
      sku: 18.04-lts
      native: true
  vm_count:
    dedicated: 1
  vm_size: Standard_NC4as_T4_v3
  autoscale:
    evaluation_interval: 00:05:00
    formula: |-
      startingNumberOfVMs = 1;
      maxNumberofVMs = 4;
      pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(300 * TimeInterval_Second);
      pendingTaskSamples  = 70 > pendingTaskSamplePercent ? startingNumberOfVMs : avg($PendingTasks.GetSample(300 * TimeInterval_Second));
      vmForPendingTask = pendingTaskSamples <= 1 ? 1 : pendingTaskSamples;
      $TargetDedicatedNodes=min(maxNumberofVMs, vmForPendingTask);
      $NodeDeallocationOption = taskcompletion;
  gpu:
    nvidia_driver:
      source: https://us.download.nvidia.com/tesla/470.57.02/NVIDIA-Linux-x86_64-470.57.02.run

config.yaml

batch_shipyard:
  storage_account_settings: storage_source
global_resources:
  docker_images:
  - <<REDACTED>>
  additional_registries:
    docker:
    - <<REDACTED>>.azurecr.io
  volumes:
    shared_data_volumes:
      shared_storage_vol:
        volume_driver: azurefile
        storage_account_settings: storage_mount
        azure_file_share_name: <<REDACTED>>
        container_path: /mnt/integrate/<<REDACTED>>
        mount_options:
        - file_mode=0777
        - dir_mode=0777
        - mfsymlinks
        bind_options: rw

Additional Logs

INSERT ADDITIONAL LOGS HERE

Additonal Comments

I also tried with source: https://us.download.nvidia.com/tesla/460.73.01/NVIDIA-Linux-x86_64-460.73.01.run which deploys without issues other NC series (e.g., NC6s v3) and got the same error.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions