-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System metrics #320
Comments
What operating systems are you targeting? You could get away with something like eBPF here if limiting it to Linux. It will be very lightweight and won't detract from any ML processing resources. You could get all of this in a fairly straight forward way. The difficult question will be to figure out the sampling rate you want to aggregate at (15 seconds?). |
If you're a team using AWS/GCP, these metrics may not matter that much to you compared with tracking the instance types you're using. That gives you better signals on the kinds of resource/budget limitations you may have had. |
Hi @andreasjansson. I had an idea along the same lines. Adding some of the training metadata to the checkpoints, like basic GPU specifications(GPU name, memory and driver version) and time taken for each epoch etc. The motive I had in my mind was this helps in benchmarking models and hardware. Basic GPU specifications can be obtained from |
https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/750c318c0fcf089bd430f4d58e69451eec55f0a9/pytorch/pytorch_utils.py#L144 has some code for counting the number of MFLOPS, which is a useful, non-machine-specific way of profiling different neural net architectures.
|
@kvthr GPU metrics would be fantastic, I've found memory and per-GPU utilization to be very helpful when debugging bottlenecks. FLOPS / MACs would be really good too. Thanks for that link @turian! There's also https://github.com/sovrasov/flops-counter.pytorch and https://github.com/Lyken17/pytorch-OpCounter, I haven't looked into them in detail so I don't know how they compare. |
@andreasjansson just to follow up with you about what else I want that doesn't exist, or that I'm not aware of yet. Here are two really serious questions I have about my current project:
So here is one (or two) tools that I think would have broad adoption. As an added benefit, if they hooked into replicate.ai by default (could be disabled optionally perhaps) it would increase adoption of your tool:
These are things I would adopt today. |
Why
Since ML models are often slow and expensive to train, we tend to spend a lot of time fine tuning computational performance. If we run our own servers we stare at nvidia-smi, htop, iotop, iftop, etc., which is far from ideal. If we're using Colab we're mostly left to guessing.
Additionally, for reproducibility it's important to know how much CPU, GPU, and memory resources were consumed, when deciding what type of machine is required to replicate a result.
How
replicate.checkpoint()
automatically attachessystem_metrics
to the checkpoint data, which includes:User data
One user asked for this because a change in CUDA version caused their results to not be replicable, in a horrible hard-to-find way.
The text was updated successfully, but these errors were encountered: