Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not train a model with multi_process kvworkers #182

Open
kangshantong opened this issue Jul 23, 2021 · 2 comments · May be fixed by #183
Open

Can not train a model with multi_process kvworkers #182

kangshantong opened this issue Jul 23, 2021 · 2 comments · May be fixed by #183

Comments

@kangshantong
Copy link

Hi,
I am trying to train models with ps-lite. It works well in multi_thread mode like test_kv_app_multi_workers, but in multi_process model, only one worker process works and the others are blocked in the PS::Start stage.

Trace the code of ps:Start, we can find that there is a barrier in this stage.After all the scheduler/servers/workers shoot the barrier command, every node will be activated by setting the barrier_done_ to true. But the code followed below only will set
the barrier_done_ to true for customer_id 0.

void Postoffice::Manage(const Message& recv) {
CHECK(!recv.meta.control.empty());
const auto& ctrl = recv.meta.control;
if (ctrl.cmd == Control::BARRIER && !recv.meta.request) {
barrier_mu_.lock();
auto size = barrier_done_[recv.meta.app_id].size();
for (size_t customer_id = 0; customer_id < size; customer_id++) {
barrier_done
[recv.meta.app_id][customer_id] = true;
}_

barrier_mu_.unlock();
barrier_cond_.notify_all();
}
}

@kangshantong
Copy link
Author

The bug can be fixed by the code followed.

void Postoffice::Manage(const Message& recv) {
CHECK(!recv.meta.control.empty());
const auto& ctrl = recv.meta.control;
if (ctrl.cmd == Control::BARRIER && !recv.meta.request) {
barrier_mu_.lock();
for (auto iter=barrier_done[recv.meta.app_id].begin();iter!=barrier_done_[recv.meta.app_id].end(); iter++)
{
size_t customer_id = iter -> first;
barrier_done_[recv.meta.app_id][customer_id] = true;
}_

barrier_mu_.unlock();
barrier_cond_.notify_all();
}
}

kangshantong added a commit to kangshantong/ps-lite that referenced this issue Jul 26, 2021
@kangshantong kangshantong linked a pull request Jul 26, 2021 that will close this issue
@kangshantong
Copy link
Author

@eric-haibin-lin can you review this commit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant