This repository was archived by the owner on Mar 20, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 121
This repository was archived by the owner on Mar 20, 2023. It is now read-only.
TensorFlow-CPU quickstart issues #369
Copy link
Copy link
Open
Labels
Description
Following the TensorFlow CPU quickstart, I run into a couple of issues
- When creating the pool, I get a
RuntimeError: Could not find an Azure Batch Node Agent Sku for this offer=ubuntuserver publisher=canonical sku=16.04-lts. You can list the valid and available Marketplace images with the command: account images
From a look at Azure Portal, it looks like only 18.04 is currently available; indeed, changing pool.yml
to use 18.04-LTS instead is enough to get rid of this issue. This probably affects many of the bundled recipes:
batch-shipyard/recipes$ grep -R 16.04 .
./Caffe-CPU/config/pool.yaml: sku: 16.04-LTS
./Caffe-GPU/config/pool.yaml: sku: 16.04-LTS
./Caffe2-CPU/config/pool.yaml: sku: 16.04-LTS
./Caffe2-GPU/config/pool.yaml: sku: 16.04-LTS
./Chainer-CPU/config/pool.yaml: sku: 16.04-LTS
./Chainer-GPU/config/pool.yaml: sku: 16.04-LTS
./CNTK-CPU-Infiniband-IntelMPI/docker/Dockerfile:FROM ubuntu:16.04
./CNTK-CPU-OpenMPI/config/multinode/pool.yaml: sku: 16.04-LTS
./CNTK-CPU-OpenMPI/config/singlenode/pool.yaml: sku: 16.04-LTS
./CNTK-GPU-Infiniband-IntelMPI/docker/Dockerfile:FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
./CNTK-GPU-OpenMPI/config/multinode-multigpu/pool.yaml: sku: 16.04-LTS
./CNTK-GPU-OpenMPI/config/singlenode-multigpu/pool.yaml: sku: 16.04-LTS
./CNTK-GPU-OpenMPI/config/singlenode-singlegpu/pool.yaml: sku: 16.04-LTS
./FFmpeg-GPU/config/pool.yaml: sku: 16.04-LTS
./HPMLA-CPU-OpenMPI/config/pool.yaml: sku: 16.04-LTS
./HPMLA-CPU-OpenMPI/Data-Shredding/README.md:* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
./HPMLA-CPU-OpenMPI/docker/Dockerfile:FROM ubuntu:16.04
./HPMLA-CPU-OpenMPI/docker/README.md:* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
./HPMLA-CPU-OpenMPI/README.md:* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
./Keras+Theano-CPU/config/pool.yaml: sku: 16.04-LTS
./Keras+Theano-GPU/config/pool.yaml: sku: 16.04-LTS
./MXNet-CPU/config/multinode/pool.yaml: sku: 16.04-LTS
./MXNet-CPU/config/singlenode/pool.yaml: sku: 16.04-LTS
./MXNet-CPU/docker/Dockerfile:FROM ubuntu:16.04
./MXNet-GPU/config/multinode/pool.yaml: sku: 16.04-LTS
./MXNet-GPU/config/singlenode/pool.yaml: sku: 16.04-LTS
./NAMD-GPU/config/pool.yaml: sku: 16.04-LTS
./NAMD-TCP/config/pool.yaml: sku: 16.04-LTS
./RemoteFS-GlusterFS+BatchPool/config/pool.yaml: sku: 16.04-LTS
./TensorFlow-CPU/config/pool.yaml: sku: 16.04-LTS
./TensorFlow-Distributed/config/cpu/pool.yaml: sku: 16.04-LTS
./TensorFlow-Distributed/config/gpu/pool.yaml: sku: 16.04-LTS
./TensorFlow-GPU/config/docker/pool.yaml: sku: 16.04-LTS
./TensorFlow-GPU/config/singularity/pool.yaml: sku: 16.04-LTS
./Torch-CPU/config/pool.yaml: sku: 16.04-LTS
./Torch-CPU/docker/Dockerfile:FROM ubuntu:16.04
./Torch-GPU/config/pool.yaml: sku: 16.04-LTS
- After the pool is created and I try to create the included job, I get another error:
$ ../shipyard jobs add --tail stdout.txt
2021-09-16 10:16:30.581 INFO - Adding job tensorflowjob to pool tensorflow-cpu
2021-09-16 10:16:30.673 DEBUG - constructing 1 task specifications for submission to job tensorflowjob
2021-09-16 10:16:30.738 DEBUG - submitting 1 task specifications to job tensorflowjob
2021-09-16 10:16:30.741 DEBUG - submitting 1 tasks (0 -> 0) to job tensorflowjob
2021-09-16 10:16:30.971 INFO - submitted all 1 tasks to job tensorflowjob
2021-09-16 10:16:30.971 DEBUG - attempting to stream file stdout.txt from job=tensorflowjob task=task-00000
Traceback (most recent call last):
File "/mnt/c/Users/username/repos/batch-shipyard/shipyard.py", line 3136, in <module>
cli()
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/decorators.py", line 64, in new_func
return ctx.invoke(f, obj, *args, **kwargs)
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/mnt/c/Users/username/repos/batch-shipyard/shipyard.py", line 1968, in jobs_add
convoy.fleet.action_jobs_add(
File "/mnt/c/Users/username/repos/batch-shipyard/convoy/fleet.py", line 4065, in action_jobs_add
batch.add_jobs(
File "/mnt/c/Users/username/repos/batch-shipyard/convoy/batch.py", line 5892, in add_jobs
stream_file_and_wait_for_task(
File "/mnt/c/Users/username/repos/batch-shipyard/convoy/batch.py", line 3309, in stream_file_and_wait_for_task
tfp = batch_client.file.get_properties_from_task(
File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/azure/batch/operations/_file_operations.py", line 328, in get_properties_from_task
raise models.BatchErrorException(self._deserialize, response)
azure.batch.models._models_py3.BatchErrorException: Request encountered an exception.
Code: None
Message: None
Removing the resource_files
section is enough to take care of the issue; probably unsurprising as the given blob_source
(https://raw.githubusercontent.com/tensorflow/models/master/tutorials/image/mnist/convolutional.py) 404s.