You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I am trying to train models with ps-lite. It works well in multi_thread mode like test_kv_app_multi_workers, but in multi_process model, only one worker process works and the others are blocked in the PS::Start stage.
Trace the code of ps:Start, we can find that there is a barrier in this stage.After all the scheduler/servers/workers shoot the barrier command, every node will be activated by setting the barrier_done_ to true. But the code followed below only will set
the barrier_done_ to true for customer_id 0.
Hi,
I am trying to train models with ps-lite. It works well in multi_thread mode like test_kv_app_multi_workers, but in multi_process model, only one worker process works and the others are blocked in the PS::Start stage.
Trace the code of ps:Start, we can find that there is a barrier in this stage.After all the scheduler/servers/workers shoot the barrier command, every node will be activated by setting the barrier_done_ to true. But the code followed below only will set
the barrier_done_ to true for customer_id 0.
void Postoffice::Manage(const Message& recv) {
CHECK(!recv.meta.control.empty());
const auto& ctrl = recv.meta.control;
if (ctrl.cmd == Control::BARRIER && !recv.meta.request) {
barrier_mu_.lock();
auto size = barrier_done_[recv.meta.app_id].size();
for (size_t customer_id = 0; customer_id < size; customer_id++) {
barrier_done[recv.meta.app_id][customer_id] = true;
}_
barrier_mu_.unlock();
barrier_cond_.notify_all();
}
}
The text was updated successfully, but these errors were encountered: