This repository was archived by the owner on Mar 20, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 121
This repository was archived by the owner on Mar 20, 2023. It is now read-only.
Deployment of Standard_NC4as_T4_v3 fails if GPU drivers are specified #370
Copy link
Copy link
Open
Labels
Description
Problem Description
If I deploy a pool with Standard_NC4as_T4_v3 without the gpu:nvidia_driver:source
specification in pool.yaml, the pool succeeds but the NVIDIA drivers are not installed.
If I specify gpu:nvidia_driver:source
, I get an error: local variable 'gpu_driver' referenced before assignment
The same pool.yaml works fine with Standard_NC6s_v3
Batch Shipyard Version
3.9.1
Steps to Reproduce
Try to deploy a pool with Standard_NC4as_T4_v3
Expected Results
Pool is deployed
Actual Results
Error is returned when gpu:nvidia_driver:source specification
is provided in pool.yaml:
2021-09-21 09:02:21.573 INFO - uploading file /tmp/_MEIRpaARG/scripts/shipyard_docker_exec_task_runner.sh as 'shipyard_docker_exec_task_runner.sh'
Traceback (most recent call last):
File "shipyard.py", line 3136, in <module>
File "site-packages/click/core.py", line 764, in __call__
File "site-packages/click/core.py", line 717, in main
File "site-packages/click/core.py", line 1137, in invoke
File "site-packages/click/core.py", line 1137, in invoke
File "site-packages/click/core.py", line 956, in invoke
File "site-packages/click/core.py", line 555, in invoke
File "site-packages/click/decorators.py", line 64, in new_func
File "site-packages/click/core.py", line 555, in invoke
File "shipyard.py", line 1546, in pool_add
File "convoy/fleet.py", line 3451, in action_pool_add
File "convoy/fleet.py", line 1849, in _add_pool
File "convoy/fleet.py", line 1555, in _construct_pool_object
UnboundLocalError: local variable 'gpu_driver' referenced before assignment
[9269] Failed to execute script shipyard
Redacted Configuration
pool.yaml
pool_specification:
id: test-cluster-gpus-t4
vm_configuration:
platform_image:
publisher: canonical
offer: ubuntuserver
sku: 18.04-lts
native: true
vm_count:
dedicated: 1
vm_size: Standard_NC4as_T4_v3
autoscale:
evaluation_interval: 00:05:00
formula: |-
startingNumberOfVMs = 1;
maxNumberofVMs = 4;
pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(300 * TimeInterval_Second);
pendingTaskSamples = 70 > pendingTaskSamplePercent ? startingNumberOfVMs : avg($PendingTasks.GetSample(300 * TimeInterval_Second));
vmForPendingTask = pendingTaskSamples <= 1 ? 1 : pendingTaskSamples;
$TargetDedicatedNodes=min(maxNumberofVMs, vmForPendingTask);
$NodeDeallocationOption = taskcompletion;
gpu:
nvidia_driver:
source: https://us.download.nvidia.com/tesla/470.57.02/NVIDIA-Linux-x86_64-470.57.02.run
config.yaml
batch_shipyard:
storage_account_settings: storage_source
global_resources:
docker_images:
- <<REDACTED>>
additional_registries:
docker:
- <<REDACTED>>.azurecr.io
volumes:
shared_data_volumes:
shared_storage_vol:
volume_driver: azurefile
storage_account_settings: storage_mount
azure_file_share_name: <<REDACTED>>
container_path: /mnt/integrate/<<REDACTED>>
mount_options:
- file_mode=0777
- dir_mode=0777
- mfsymlinks
bind_options: rw
Additional Logs
INSERT ADDITIONAL LOGS HERE
Additonal Comments
I also tried with source: https://us.download.nvidia.com/tesla/460.73.01/NVIDIA-Linux-x86_64-460.73.01.run
which deploys without issues other NC series (e.g., NC6s v3) and got the same error.