Skip to content

Commit

Permalink
bridge for trainer (#607)
Browse files Browse the repository at this point in the history
* bridge for trainer based on new channel

* bridge module dev (#578): bridge-core

* grpc client/server interceptor

* feat: train bridge add waiting_alert_timeout to logging warning if peer block

* [feat]: bridge stream transmit close if idle timeout, default idle
timeout 30 sec

* fix(channel): add gprc call timeout=heartbeat_interval

* fix: follower block in wait datablock if leader raise

* feat: add complete-checkpoint

* fix: saver_hook on woker 1

* fix: load_checkpoint_filename

Co-authored-by: zhangzihui <[email protected]>
  • Loading branch information
whisylan and ZhZhang711 authored Mar 24, 2021
1 parent 2506074 commit 438d603
Show file tree
Hide file tree
Showing 36 changed files with 2,523 additions and 791 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,8 @@ server.config.js
es.match_phrase.js
es.match_phrase.json
web_console/config
example/**/data
example/**/model
example/**/exp
output
!test/channel/greeter_pb2*.py
7 changes: 7 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,13 @@ protobuf:
--grpc_python_out=. \
protocols/fedlearner/common/*.proto

python -m grpc_tools.protoc -I protocols -I$(TF_PATH) \
--python_out=. \
--grpc_python_out=. \
protocols/fedlearner/channel/*.proto

cd web_console_v2/api; make protobuf

lint:
pylint --rcfile ci/pylintrc fedlearner example

Expand Down
11 changes: 9 additions & 2 deletions deploy/scripts/trainer/run_trainer_worker.sh
Original file line number Diff line number Diff line change
Expand Up @@ -65,10 +65,15 @@ if [ -n "$CHECKPOINT_PATH" ]; then
else
checkpoint_path="$OUTPUT_BASE_DIR/checkpoints"
fi
load_checkpoint_filename=$(normalize_env_to_args "--load-checkpoint-filename" "$LOAD_CHECKPOINT_FILENAME")
load_checkpoint_filename_with_path=$(normalize_env_to_args "--load-checkpoint-filename-with-path" "$LOAD_CHECKPOINT_FILENAME_WITH_PATH")

if [[ -n "$LOAD_CHECKPOINT_FROM" ]] && (( $WORKER_RANK == 0 )); then
python -c "
import tensorflow as tf
try:
import tensorflow.compat.v1 as tf
except ImportError:
import tensorflow as tf
import tensorflow_io
src = '${STORAGE_ROOT_PATH}/job_output/${LOAD_CHECKPOINT_FROM}/checkpoints'
dst = '${checkpoint_path}'
Expand Down Expand Up @@ -103,4 +108,6 @@ python main.py \
--export-path=$export_path \
$mode $verbosity \
$save_checkpoint_steps $sparse_estimator $summary_save_steps \
$save_checkpoint_secs $batch_size $learning_rate
$save_checkpoint_secs $batch_size $learning_rate \
$load_checkpoint_filename \
$load_checkpoint_filename_with_path
1 change: 1 addition & 0 deletions example/wide_n_deep/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ elif [ "$ROLE" == "follower" ]; then
--save-checkpoint-steps=100 \
--export-path=model/follower/saved_model \
--verbosity=2

else
echo "usage: $0 [leader | follower]"
fi
17 changes: 17 additions & 0 deletions fedlearner/channel/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright 2020 The FedLearner Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# -*- coding: utf-8 -*-

from fedlearner.channel.channel import Channel
Loading

0 comments on commit 438d603

Please sign in to comment.