Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] How to train using more than 1 GPU? #275

Open
aidv opened this issue Feb 21, 2020 · 7 comments
Open

[Discussion] How to train using more than 1 GPU? #275

aidv opened this issue Feb 21, 2020 · 7 comments
Labels
question Further information is requested

Comments

@aidv
Copy link

aidv commented Feb 21, 2020

Is this possible?

Can I train using multiple GPUs?

@aidv aidv added the question Further information is requested label Feb 21, 2020
@stickyninja3
Copy link

I don't think so. Tensorflows documentation states that it does not place operations into multiple GPUs automatically. Tensorflow does not easily share graphs or sessions among multiple processess. There are some blogs on this discussion on the towardsdatascience.com site

I assume you have been able to get the training working. What is your set-up?

@aidv
Copy link
Author

aidv commented Feb 27, 2020

@stickyninja3 There's something called Distributed training that implies that it is possible.

tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines or TPUs. Using this API, you can distribute your existing models and training code with minimal code changes.

Also looking at the Spleeter source code, it implies that multiple machines can be used to train a model.

What I wonder now is why multiple machines before taking full advantage of multiple GPU's,
unless multiple GPU's in multiple machines are present.

Either way I'd love if the Spleeter devs would address this as it would greatly benefit the community.

So what would be nice to address is:

  1. How to train using multiple GPU's
  2. How to train using multiple machines
  3. Both of the above

@mmoussallam
Copy link
Collaborator

Hi @aidv

We have no plans to work on this feature for the moment. We don't have much experience with the Distributed training strategies and as @stickyninja3 said, it would probably require quite a lot of tuning to make it efficient.

If you feel that it can be achieved with minor changes, feel free to send us a gist of code and we'll look into it.

@aidv
Copy link
Author

aidv commented Feb 27, 2020

@mmoussallam thank you for addressing that.
I have looked into it a little bit, I don't have much knowledge in anything tensorflow related but I'm learning little by little.

So what about the multiple machines?

In Spleeter file train.py at line 95 I can see tf.estimator.train_and_evaluate(...
and when tracing the function train_and_evaluate it takes me to the file training.py which is located in the folder C:\Users\username\Anaconda3\Lib\site-packages\tensorflow_estimator\python\estimator.

Reading some of the comments I can see a whole bunch of info regarding distributed training.
It seems to be very doable.

@stickyninja3
Copy link

Hi all,

I think distributed training is easier with version 2 of tensorflow. The blogs i read all stated that tensorflow 1.14 / 1.13 don't share models across GPUs. It would be interesting to see what improvements could be made, but I can't even get training working on a single GPU. Nothing i have tried seems to work. It would be interesting to know the exact environments you use. I have been given my Dads old work laptop, which has a GTX1660. Going to reformat and try Ubuntu 18.04 now

@aidv
Copy link
Author

aidv commented Mar 4, 2020

@stickyninja3 I wonder how hard it would be to convert Spleeter code to use v2 of Tensorflow 🤔

Are you on Windows or MacOS? I'm on Windows and it's actually pretty easy to get it up an running.

Give me your email and I'll send you a message.

@stickyninja3
Copy link

Hi aidv,

[email protected] is my email. I have tried using Windows but couldn't get training working. I have a laptop to use. It has 64Gb memory. Core i7 gen9. GeForce GTX 1660ti.

It would be great to get this working.

Thanks,
Alec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants