-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enquiries about Parameter Sharding #98
Comments
Thanks for your interest in our project! We have already encapsulated those collective operators in our schedule language, so users only need to write the schedule primitives in order to conduct |
Thanks for replying! I've further studied the code and there are some follow-up questions I would like to ask.
Thanks a lot! |
|
Hello there, I have been reading your research paper "Decoupled Model Schedule for Deep Learning Training". In particular, in part (3) tensor parallelism, it is mentioned that "Since the output tensor only holds partial results after sharding, we need to conduct all_reduce to aggregate outputs
from different device". May I know which part of the source code at this repository is performing the all_reduce operation? Right now I am taking a look at build.py, and I believe that the captured code below is handling the sharded parameters and making a split when a shard is added by the user. But I am not exactly sure where the all_reduce operation will be done? Any help will be appreciated. Thank you.
The text was updated successfully, but these errors were encountered: