Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System metrics #320

Open
andreasjansson opened this issue Nov 10, 2020 · 6 comments
Open

System metrics #320

andreasjansson opened this issue Nov 10, 2020 · 6 comments
Labels
type/roadmap High-level goals. https://github.com/replicate/replicate/projects/1

Comments

@andreasjansson
Copy link
Member

andreasjansson commented Nov 10, 2020

Why

Since ML models are often slow and expensive to train, we tend to spend a lot of time fine tuning computational performance. If we run our own servers we stare at nvidia-smi, htop, iotop, iftop, etc., which is far from ideal. If we're using Colab we're mostly left to guessing.

Additionally, for reproducibility it's important to know how much CPU, GPU, and memory resources were consumed, when deciding what type of machine is required to replicate a result.

How

replicate.checkpoint() automatically attaches system_metrics to the checkpoint data, which includes:

  • CPU usage per CPU (pegged CPUs in data loaders are a common bottleneck)
  • GPU usage per GPU
  • GPU memory usage (since TF allocates all the GPU memory, might have to think of something smart here)
  • System memory usage
  • Disk bytes read/written
  • Network bytes read/written
  • etc.

User data

One user asked for this because a change in CUDA version caused their results to not be replicable, in a horrible hard-to-find way.

@andreasjansson andreasjansson added the type/roadmap High-level goals. https://github.com/replicate/replicate/projects/1 label Nov 10, 2020
@KushalP
Copy link

KushalP commented Nov 19, 2020

What operating systems are you targeting? You could get away with something like eBPF here if limiting it to Linux. It will be very lightweight and won't detract from any ML processing resources.

You could get all of this in a fairly straight forward way. The difficult question will be to figure out the sampling rate you want to aggregate at (15 seconds?).

@KushalP
Copy link

KushalP commented Nov 19, 2020

If you're a team using AWS/GCP, these metrics may not matter that much to you compared with tracking the instance types you're using. That gives you better signals on the kinds of resource/budget limitations you may have had.

@kvthr
Copy link
Contributor

kvthr commented Dec 16, 2020

Hi @andreasjansson. I had an idea along the same lines. Adding some of the training metadata to the checkpoints, like basic GPU specifications(GPU name, memory and driver version) and time taken for each epoch etc. The motive I had in my mind was this helps in benchmarking models and hardware.

Basic GPU specifications can be obtained from pynvml, a python wrapper from NVIDIA's nvml. I'm not sure how to implement the time taken per epoch part.

@turian
Copy link

turian commented Dec 16, 2020

https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/750c318c0fcf089bd430f4d58e69451eec55f0a9/pytorch/pytorch_utils.py#L144 has some code for counting the number of MFLOPS, which is a useful, non-machine-specific way of profiling different neural net architectures.

pytorch_memlab has some good profiling for pytorch which could also be useful

@andreasjansson
Copy link
Member Author

@kvthr GPU metrics would be fantastic, I've found memory and per-GPU utilization to be very helpful when debugging bottlenecks.

FLOPS / MACs would be really good too. Thanks for that link @turian! There's also https://github.com/sovrasov/flops-counter.pytorch and https://github.com/Lyken17/pytorch-OpCounter, I haven't looked into them in detail so I don't know how they compare.

@turian
Copy link

turian commented Dec 19, 2020

@andreasjansson just to follow up with you about what else I want that doesn't exist, or that I'm not aware of yet.

Here are two really serious questions I have about my current project:

  1. Apparently, I have a lot of GPU memory access. This makes no sense to me because I am not loading anything onto the GPU. Everything should be pre-loaded. Nonetheless a lot of time is spent moving things to-and-fro the GPU. Here are more details

  2. When I increase the batch size, I get GPU OOM errors. I have no idea why because the data seems quite small. I tried pytorch_memlab but it didn't help yet. Related issue: Documentation for pl.LightningModule that includes many nn.Modules Stonesjtu/pytorch_memlab#28

So here is one (or two) tools that I think would have broad adoption. As an added benefit, if they hooked into replicate.ai by default (could be disabled optionally perhaps) it would increase adoption of your tool:

  1. A dead simple thing that shows me for pytorch (or Python GPU stuff in general) exactly what gets moved to and fro the GPU. So I can very quickly spot memory-transfer bottlenecks.

  2. Improved GPU profiling that in a fine-grained but easy-to-read way that demonstrates what is causing high GPU mem usage and OOMs. This could be a python_memlab extension.

These are things I would adopt today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/roadmap High-level goals. https://github.com/replicate/replicate/projects/1
Projects
None yet
Development

No branches or pull requests

4 participants