Skip to content

Commit

Permalink
Tests and versions. (#116)
Browse files Browse the repository at this point in the history
* Create conftest.py file to allow the number of iterations in the inference test to be configurable.
This allows the tests to be used as inference speed benchmarks as originally intended.
Also, the allow_tf32 flags are now logged as well.

* Add `tmux` to the `simple` environment default environment.

* Make tests configurable for GPU as well.

* Add explanation as to how to run configurable test for inference speed comparison.

* Fix typo.

* Update Docker Compose installed version.

* Update pytest minimum version to 7.3.0, which fixes the walrus operator bug.

* Remove unnecessary import from conftest.py.

* Reformat project.

* Update documentation to mention that runtime speeds will probably be similar for `conda` installs and source builds. I have confirmed that the `conda` installs use `cuDNN` and probably `magma` properly. The speeds were identical on the hardware I tested.

* Update all Docker BuildKit frontend versions to simply 1, which will use the latest BuildKit syntax until the next major release.
  • Loading branch information
veritas9872 authored Apr 10, 2023
1 parent 741f426 commit cc549d9
Show file tree
Hide file tree
Showing 11 changed files with 44 additions and 20 deletions.
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# syntax = docker/dockerfile:1.4
# syntax = docker/dockerfile:1
# The top line is used by BuildKit. _**DO NOT ERASE IT**_.

# Use `export BUILDKIT_PROGRESS=plain` in the host terminal to see full build logs.
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ ls: # List all services.

# Utility for installing Docker Compose on Linux (but not WSL) systems.
# Visit https://docs.docker.com/compose/install for the full documentation.
COMPOSE_VERSION = v2.15.1
COMPOSE_VERSION = v2.17.2
COMPOSE_OS_ARCH = linux-x86_64
COMPOSE_URL = https://github.com/docker/compose/releases/download/${COMPOSE_VERSION}/docker-compose-${COMPOSE_OS_ARCH}
COMPOSE_PATH = ${HOME}/.docker/cli-plugins
Expand Down
19 changes: 10 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ If this is your first time using this project, follow these steps:
for the latest installation information. Note that Docker Compose V2
is available for WSL users with Docker Desktop by default.

4. Run `make env SERVICE=(train|devel|ngc|hub|simple)` on the terminal
4. Run `make env SERVICE=(train|devel|ngc|hub|simple)` on the terminal
at project root to create a basic `.env` file.
The `.env` file provides environment variables for `docker-compose.yaml`,
allowing different users and machines to set their own variables as required.
Expand All @@ -67,17 +67,17 @@ If this is your first time using this project, follow these steps:
Add configurations that should not be shared via source control there.
For example, volume-mount pairs specific to each host machine.


### Explanation of services

Different Docker Compose services are organized to serve different needs.

- `train`, the default service, should be used when compiled dependencies are
necessary or when PyTorch needs to be compiled from source due to
necessary or when PyTorch needs to be compiled from source due to
Compute Capability issues, etc.
- `devel` is designed for PyTorch CUDA/C++ developers who need to recompile
- `devel` is designed for PyTorch CUDA/C++ developers who need to recompile
frequently and have many complex dependencies.
- `ngc` is derived from the official NVIDIA PyTorch HPC images with the option
to install additional packages. It is recommended for users who wish to base
to install additional packages. It is recommended for users who wish to base
their projects on the NGC images provided by NVIDIA. Note that the NGC images
change greatly between different releases and that configurations for one
release may not work for another one.
Expand All @@ -91,7 +91,8 @@ Different Docker Compose services are organized to serve different needs.
`pip` packages can also be installed via `conda`. Also, the base image can
be configured to use images other than the Official Linux Docker images
by specifying the `BASE_IMAGE` argument directly in the `.env` file.
PyTorch runtime performance may be superior in official NVIDIA CUDA images.
PyTorch runtime performance may be superior in official NVIDIA CUDA images
under certain circumstances. Use the tests to benchmark runtime speeds.
**The `simple` service is recommended for users without compiled dependencies.**

The `Makefile` has been configured to take values specified in the `.env` file
Expand Down Expand Up @@ -250,7 +251,7 @@ Please read the Makefile to see the exact commands.
To fix this issue, create a new directory on the host to mount the containers' `.vscode-server` directories.
For example, one can set a volume pair as `${HOME}/.vscode-project1:/home/${USR}/.vscode-server` for project1.
Do not forget to create `${HOME}/.vscode-project1` on the host first. Otherwise, the directory will be owned by `root`,
which will cause VSCode to stall indefinately.
which will cause VSCode to stall indefinitely.
- If any networking issues arise, check `docker network ls` and check for conflicts.
Most networking and SSH problems can be solved by running `docker network prune`.

Expand All @@ -261,7 +262,7 @@ The main components of the project are as follows. The other files are utilities
1. Dockerfile
2. docker-compose.yaml
3. docker-compose.override.yaml
4. reqs/\*requirements.txt
4. reqs/(`*requirements.txt`|`*environment.yaml`)
5. .env

When the user inputs `make up` or another `make` command,
Expand Down Expand Up @@ -497,7 +498,7 @@ For other VSCode problems, try deleting `~/.vscode-server` on the host.
[not fail-safe](https://stackoverflow.com/a/8573310/9289275).

6. `torch.cuda.is_available()` will return a `... UserWarning:
CUDA initialization:...` error or the image will simply not start if
CUDA initialization:...` error or the image will simply not start if
the CUDA driver on the host is incompatible with the CUDA version on
the Docker image. Either upgrade the host CUDA driver or downgrade
the CUDA version of the image. Check the
Expand Down
2 changes: 1 addition & 1 deletion dockerfiles/hub.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# syntax = docker/dockerfile:1.4
# syntax = docker/dockerfile:1
# The top line is used by BuildKit. _**DO NOT ERASE IT**_.
ARG PYTORCH_VERSION
ARG CUDA_SHORT_VERSION
Expand Down
2 changes: 1 addition & 1 deletion dockerfiles/ngc.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# syntax = docker/dockerfile:1.4
# syntax = docker/dockerfile:1
# The top line is used by BuildKit. _**DO NOT ERASE IT**_.

ARG INTERACTIVE_MODE
Expand Down
2 changes: 1 addition & 1 deletion dockerfiles/simple.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# syntax = docker/dockerfile:1.4
# syntax = docker/dockerfile:1
# The top line is used by BuildKit. _**DO NOT ERASE IT**_.

# This Dockerfile exists to provide a method of installing all packages from
Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ target-version = ['py38', 'py39', 'py310']
include = '\.pyi?$'

[tool.pytest.ini_options]
minversion = "7.0" # Update to 7.2.3 as soon as it becomes available.
minversion = "7.3.0"
addopts = """\
--capture=tee-sys \
--doctest-modules \
Expand Down Expand Up @@ -85,7 +85,7 @@ max-doc-length = 80
[tool.ruff.per-file-ignores]
# Ignore `E402` (import violations) in all `__init__.py` files.
"__init__.py" = ["E402"]
"*test*.py" = ["D"] # ignore all docstring lints in tests
"*test*.py" = ["D"] # Ignore all docstring lints in tests.

[tool.ruff.mccabe]
# Unlike Flake8, default to a complexity level of 10.
Expand Down
3 changes: 2 additions & 1 deletion reqs/simple-environment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ dependencies: # Use conda packages if possible.
- pytorch::pytorch-cuda==11.8
- jemalloc
- intel::mkl
- intel::numpy # Use Numpy built with the Intel compiler for best performance with MKL.
- intel::numpy # Use Numpy built with the Intel compiler for best performance with MKL.
- pytest
- tmux==3.2a
- tqdm
5 changes: 5 additions & 0 deletions tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,8 @@ PyTest is the recommended testing platform.

Simple unit tests should preferably be written as doctests,
with more advanced tests being placed in this directory.

To use the `test_run.py` file as an inference speed benchmark, which was its
original purpose, use the following command to run 1024 iterations on GPU 0:

`python -m pytest tests/test_run.py::test_inference_run --gpu 0 --num_steps 1024`
3 changes: 3 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
def pytest_addoption(parser):
parser.addoption("--num_steps", type=int, action="store", default=64)
parser.addoption("--gpu", type=int, action="store", default=0)
20 changes: 17 additions & 3 deletions tests/test_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,16 @@ def enable_cudnn_benchmarking():
torch.backends.cudnn.benchmark = True


@pytest.fixture(scope="session", autouse=True)
def allow_tf32():
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True


@pytest.fixture(scope="session")
def device(gpu: int = 0) -> torch.device:
def device(pytestconfig) -> torch.device:
if torch.cuda.is_available():
device = torch.device(f"cuda:{int(gpu)}")
device = torch.device(f"cuda:{pytestconfig.getoption('gpu')}")
else:
device = torch.device("cpu")
msg = "No GPUs found for this container. Please check run configurations."
Expand Down Expand Up @@ -77,13 +83,18 @@ class Config(NamedTuple):
]


@pytest.fixture(scope="session")
def num_steps(pytestconfig):
return pytestconfig.getoption("num_steps")


@pytest.mark.parametrize(["name", "network_func", "input_shapes"], _configs)
def test_inference_run(
name: str,
network_func: Callable[[], nn.Module],
input_shapes: Sequence[Sequence[int]],
device: torch.device,
num_steps: int = 64,
num_steps,
enable_amp: bool = False,
enable_scripting: bool = False,
):
Expand Down Expand Up @@ -153,6 +164,9 @@ def get_cuda_info(device): # Using as a fixture to get device info.
logger.info(f"PyTorch Architecture List: {al}")
logger.info(f"GPU Device Name: {dp.name}")
logger.info(f"GPU Compute Capability: {dp.major}.{dp.minor}")
# No way to check if the GPU has TF32 hardware, only whether it is allowed.
logger.info(f"MatMul TF32 Allowed: {torch.backends.cuda.matmul.allow_tf32}")
logger.info(f"cuDNN TF32 Allowed: {torch.backends.cudnn.allow_tf32}")

# Python3.7+ required for `subprocess` to work as intended.
if int(platform.python_version_tuple()[1]) > 6:
Expand Down

0 comments on commit cc549d9

Please sign in to comment.