Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Time #72

Closed
Huster-Hq opened this issue May 30, 2024 · 6 comments
Closed

Training Time #72

Huster-Hq opened this issue May 30, 2024 · 6 comments

Comments

@Huster-Hq
Copy link

Huster-Hq commented May 30, 2024

Hi, do you only include 'main_training part' in the '30 hours' here?

Please ask the total number of frames in your training data set, as well as your total number of training iterations is 125000 and batch_size is 16? (followed your train_config.yaml)

image

@hkchengrex
Copy link
Owner

It is for the entire training process.

I don't have the total number of training frames off the top of my head, but you can just count from the training dataset.

The default configuration in the code is correct.

@Huster-Hq
Copy link
Author

Thanks!

I currently face a problem: the training time is too long when I run train.py on our custom dataset (two 3090 GPUs).

My custom dataset: 19544 frames in total, and I set the 'iteration=125000', and 'batch_size=14'.

The following figures are my training record and GPU. Is this reasonable?
image
image

@hkchengrex
Copy link
Owner

  1. The first printed "time" is not reliable as it includes warm-up and initialization.
  2. In general, we want a high GPU utilization (see Training low perf #58 (comment)) -- your GPUs are sitting at 18C and 29C so I don't think they are working hard enough unless you have really good cooling. This usually indicates a bottleneck in CPU/disk IO.

@Huster-Hq
Copy link
Author

Huster-Hq commented May 31, 2024

At the later 'time', the 'avg_time' is still 12-13.

image
image
image

@hkchengrex
Copy link
Owner

Your CPUs and GPUs are not being used as much as they should be. Potentially related: pytorch/pytorch#99625

@hkchengrex
Copy link
Owner

Please feel free to re-open if there are any updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants