System metrics #320

andreasjansson · 2020-11-10T13:01:50Z

Why

Since ML models are often slow and expensive to train, we tend to spend a lot of time fine tuning computational performance. If we run our own servers we stare at nvidia-smi, htop, iotop, iftop, etc., which is far from ideal. If we're using Colab we're mostly left to guessing.

Additionally, for reproducibility it's important to know how much CPU, GPU, and memory resources were consumed, when deciding what type of machine is required to replicate a result.

How

replicate.checkpoint() automatically attaches system_metrics to the checkpoint data, which includes:

CPU usage per CPU (pegged CPUs in data loaders are a common bottleneck)
GPU usage per GPU
GPU memory usage (since TF allocates all the GPU memory, might have to think of something smart here)
System memory usage
Disk bytes read/written
Network bytes read/written
etc.

User data

One user asked for this because a change in CUDA version caused their results to not be replicable, in a horrible hard-to-find way.

The text was updated successfully, but these errors were encountered:

KushalP · 2020-11-19T20:59:36Z

What operating systems are you targeting? You could get away with something like eBPF here if limiting it to Linux. It will be very lightweight and won't detract from any ML processing resources.

You could get all of this in a fairly straight forward way. The difficult question will be to figure out the sampling rate you want to aggregate at (15 seconds?).

KushalP · 2020-11-19T21:00:27Z

If you're a team using AWS/GCP, these metrics may not matter that much to you compared with tracking the instance types you're using. That gives you better signals on the kinds of resource/budget limitations you may have had.

kvthr · 2020-12-16T11:23:42Z

Hi @andreasjansson. I had an idea along the same lines. Adding some of the training metadata to the checkpoints, like basic GPU specifications(GPU name, memory and driver version) and time taken for each epoch etc. The motive I had in my mind was this helps in benchmarking models and hardware.

Basic GPU specifications can be obtained from pynvml, a python wrapper from NVIDIA's nvml. I'm not sure how to implement the time taken per epoch part.

turian · 2020-12-16T19:13:10Z

https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/750c318c0fcf089bd430f4d58e69451eec55f0a9/pytorch/pytorch_utils.py#L144 has some code for counting the number of MFLOPS, which is a useful, non-machine-specific way of profiling different neural net architectures.

pytorch_memlab has some good profiling for pytorch which could also be useful

andreasjansson · 2020-12-18T10:57:56Z

@kvthr GPU metrics would be fantastic, I've found memory and per-GPU utilization to be very helpful when debugging bottlenecks.

FLOPS / MACs would be really good too. Thanks for that link @turian! There's also https://github.com/sovrasov/flops-counter.pytorch and https://github.com/Lyken17/pytorch-OpCounter, I haven't looked into them in detail so I don't know how they compare.

turian · 2020-12-19T01:46:11Z

@andreasjansson just to follow up with you about what else I want that doesn't exist, or that I'm not aware of yet.

Here are two really serious questions I have about my current project:

Apparently, I have a lot of GPU memory access. This makes no sense to me because I am not loading anything onto the GPU. Everything should be pre-loaded. Nonetheless a lot of time is spent moving things to-and-fro the GPU. Here are more details
When I increase the batch size, I get GPU OOM errors. I have no idea why because the data seems quite small. I tried pytorch_memlab but it didn't help yet. Related issue: Documentation for pl.LightningModule that includes many nn.Modules Stonesjtu/pytorch_memlab#28

So here is one (or two) tools that I think would have broad adoption. As an added benefit, if they hooked into replicate.ai by default (could be disabled optionally perhaps) it would increase adoption of your tool:

A dead simple thing that shows me for pytorch (or Python GPU stuff in general) exactly what gets moved to and fro the GPU. So I can very quickly spot memory-transfer bottlenecks.
Improved GPU profiling that in a fine-grained but easy-to-read way that demonstrates what is causing high GPU mem usage and OOMs. This could be a python_memlab extension.

These are things I would adopt today.

andreasjansson added the type/roadmap High-level goals. https://github.com/replicate/replicate/projects/1 label Nov 10, 2020

kvthr mentioned this issue Dec 17, 2020

Record Python version and other basic system information #413

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System metrics #320

System metrics #320

andreasjansson commented Nov 10, 2020 •

edited by bfirsh

Loading

KushalP commented Nov 19, 2020

KushalP commented Nov 19, 2020

kvthr commented Dec 16, 2020

turian commented Dec 16, 2020

andreasjansson commented Dec 18, 2020

turian commented Dec 19, 2020

System metrics #320

System metrics #320

Comments

andreasjansson commented Nov 10, 2020 • edited by bfirsh Loading

Why

How

User data

KushalP commented Nov 19, 2020

KushalP commented Nov 19, 2020

kvthr commented Dec 16, 2020

turian commented Dec 16, 2020

andreasjansson commented Dec 18, 2020

turian commented Dec 19, 2020

andreasjansson commented Nov 10, 2020 •

edited by bfirsh

Loading