Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU (WIP) #169

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Multi GPU (WIP) #169

wants to merge 15 commits into from

Conversation

nakosung
Copy link
Contributor

@nakosung nakosung commented Nov 2, 2016

Multi GPU implementation(Multi tower, averaging gradient)

@nakosung
Copy link
Contributor Author

nakosung commented Nov 2, 2016

I will squash commits after review completed. :)

@lemonzi
Copy link
Collaborator

lemonzi commented Nov 2, 2016

@nakosung GitHub takes care of squashing during merge, don't worry about that :)

Copy link
Contributor

@jyegerlehner jyegerlehner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks nakosung.

I don't see anything wrong with this. I'm assuming the somewhat complicated code for summing up gradients from the different gpus works. With a change this big it's probably better if someone else independently verifies that it works and hasn't broken anything before we merge. Can anyone confirm? I can check that it still works on single-gpu but that's as much as I would be able to do at the moment.

@@ -3,21 +3,36 @@
from .ops import causal_conv, mu_law_encode


def create_variable(name, shape):
def _create_variable(name, shape):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the underscore was for member functions (methods) of classes. Am I wrong?


# Set up logging for TensorBoard.
writer = tf.train.SummaryWriter(logdir)
writer.add_graph(tf.get_default_graph())
run_metadata = tf.RunMetadata()
summaries = tf.merge_all_summaries()
summaries = tf.merge_summary(summaries)
#summaries = tf.merge_all_summaries()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete commented-out line

@astanway
Copy link

FWIW, I've tried running this branch on an Amazon p2.8xlarge (with 8 GPUs) and I am finding that while all GPUs log full memory utilization, the volatile gpu-util from nvidia-smi stays at 0% across all of them and the training time per step is the same as with a single GPU (~3 seconds).

@nakosung
Copy link
Contributor Author

@astanway Actually this branch does additional minibatch, so you should tweak your learning rate and batch size to determine whether this branch works or not.

@akademi4eg
Copy link
Collaborator

This PR is marked as WIP. Are there any things not finished yet? (besides conflicts resolving)
Also, have anyone had a chance to test this PR on multi GPU setup? @astanway @nakosung

'''Create a bias variable with the specified name and shape and initialize
it to zero.'''
initializer = tf.constant_initializer(value=0.0, dtype=tf.float32)
return tf.Variable(initializer(shape=shape), name)


def _get_variable(name, shape):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the create_embedding_table() function should be changed to tf.get_variable() too? I'm not quite sure, if the embedding table is also trained as model parameters, should it be made shared among GPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weixsong Yes. This PR is out-dated. :)

@weixsong
Copy link

Hi @nakosung , did you have some model that trained by this multiple GPU code? Is the results the same good as single GPU?

@nakosung
Copy link
Contributor Author

@weixsong I haven't generated samples with multi-node version, but I think it should not differ from single-node version because its loss seemed good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants