-
Notifications
You must be signed in to change notification settings - Fork 62
Provide TF_CONFIG environment variable for distributed TensorFlow #15
Comments
Hi Damien, |
Great, thanks for the response. It is a JSON serialised dictionary in an environment variable, but would mean distributed training would ‘just work ™️’. |
Looking at Azure Batch AI environment variables it seems this is now available. |
Sorry, the functionality is not released yet. |
May I ask if anyone or @damienpontifex know what is the env variable for the master host? I encounter "valueerror: if "cluster" is set in tf_config, it must have one "chief" node." |
@wtam my understanding is you must have the task type and index set appropriately for the chief. In the page you linked this is Without seeing your actual code, seems the minimum requirement is |
@damienpontifex Thanks so much for the respond. Since BatchAI only have these env var on BatchAI, $AZ_BATCHAI_PS_HOSTS, $AZ_BATCHAI_WORKER_HOSTS & $AZ_BATCHAI_TASK_INDEX. I overcome the chief node define issueabove by manually reserve the 1st workerhost as chief node and put into the cluster spec. Now I move a bit forward but encounter another issue from RunConfig below: This is the cluster spec for the failed worker |
Stupid mistake I made on the cluster spec naming, RunConfig is trying to find worker from my worker_hosts and that why I got the ValueError. For people play around with Estimator Distributed GPU on BatchAI, better wait for its support as the away I reserved the worker node also require me to deduct the $AZ_BATCHAI_TASK_INDEX manually in the cluster spec for the workers. |
Hi @damienpontifex, maybe you have already known that, Batch AI is now automatically generating TF_CONFIG env var when running tensorflow job. Would you please try it out and please let us know if it works for you? Thanks! |
Hi @lliimsft, I'm seeing the automatically generated TF_CONFIG env var with nodeCount 1 as: |
Getting this error when running with nodeCount=3 in the stderr-ps-0.txt log "ValueError: If "cluster" is set in TF_CONFIG, it must have one "chief" node." For this task the TF_CONFIG variable was:
The worker logs just had "Warning: Permanently added '[10.0.0.5]:23' (ECDSA) to the list of known hosts." |
I put the code I'm running here https://github.com/damienpontifex/BatchAIMnist From the repo, I do: sh prepare-cluster.sh
sh data-prep.sh
# Wait until data prep done
sh train.sh |
Looking at the documentation, wondering whether the TF_CONFIG value should be: On the parameter server: On the chief I can't seem to find guidance on having all of chief, ps and worker on the same machine etc as the docstring https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/estimator/run_config.py#L351-L376 seems to have them all as separate machines. How can we assist to test and get this working? |
@lliimsft, @AlexanderYukhanov, can we please get some update on this? :) |
@damienpontifex @yangsiyu007 |
Thank you @lliimsft @yangsiyu007. I also wasn't aware of the change and thank you for the continued effort to support this. |
Hello guys, just wondering if Batch AI is generating the new format of TF_CONFIG now? |
I don't think so - not when I tried it the week before last... @lliimsft updates? |
@yangsiyu007 @awan-10 This work is still in progress. We will keep you updated in this post. |
I was looking at what is currently being set and what changes need to make the To verify what JSON structure worked I setup: os.environ['TF_CONFIG'] = TF_CONFIG_JSON_STRING
config = tf.estimator.RunConfig()
print('master => {}'.format(config.master))
print('task_id => {}'.format(config.task_id))
print('num_ps_replicas => {}'.format(config.num_ps_replicas))
print('num_worker_replicas => {}'.format(config.num_worker_replicas))
print('cluster_spec => {}'.format(config.cluster_spec))
print('task_type => {}'.format(config.task_type))
print('is_chief => {}'.format(config.is_chief)) Run with a 3 node job configured with 1 parameter server and 3 worker count CurrentCurrently in Batch AI we get the TF_CONFIG environment variable being: {"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"ps","index":0},"environment":"cloud"} wk-0 {"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":0},"environment":"cloud"} wk-1 {"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":1},"environment":"cloud"} wk-2 {"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":2},"environment":"cloud"} With these, the python code above gave the error:
WorkingTo get this working, we apparently need the master worker defined under chief in the cluster. As such, the 'cluster' part of the JSON object would become: "cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]} Then the task component would be changed for whichever node is initiated from "task":{"type":"chief","index":0} The other worker nodes would have the same as before with index now being 0 or 1 e.g. "task":{"type":"worker","index":1} TestingThis sample code parses into the import os
import json
import tensorflow as tf
def log_config_for(runconfig_string):
os.environ['TF_CONFIG'] = runconfig_string
config = tf.estimator.RunConfig()
print('master => {}'.format(config.master))
print('task_id => {}'.format(config.task_id))
print('num_ps_replicas => {}'.format(config.num_ps_replicas))
print('num_worker_replicas => {}'.format(config.num_worker_replicas))
print('cluster_spec => {}'.format(config.cluster_spec))
print('task_type => {}'.format(config.task_type))
print('is_chief => {}'.format(config.is_chief))
print()
def main():
machine_definitions = [
# Machine expected from settings with parameterServerCommandLineArgs
'{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"ps","index":0},"environment":"cloud"}',
# Machine expected from settings with masterCommandLineArgs
'{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"chief","index":0},"environment":"cloud"}',
# Machine expected from settings with workerCommandLineArgs
'{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":0},"environment":"cloud"}',
# Machine expected from settings with workerCommandLineArgs
'{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":1},"environment":"cloud"}'
]
for definition in machine_definitions:
log_config_for(definition)
if __name__ == '__main__':
main() |
I found a workaround and I was able to manipulate the TF_CONFIG environment variable and get it working and put the code here https://github.com/damienpontifex/batchai-tfconfig-workaround The environment variable manipulation was: def remap_tfconfig(is_master):
tf_config = json.loads(os.environ['TF_CONFIG'])
master_worker = tf_config['cluster']['worker'][0]
tf_config['cluster']['worker'] = tf_config['cluster']['worker'][1:]
tf_config['cluster']['chief'] = [master_worker]
if is_master:
tf_config['task']['type'] = 'chief'
tf_config['task']['index'] = 0
elif tf_config['task']['type'] == 'worker':
tf_config['task']['index'] -= 1
os.environ['TF_CONFIG'] = json.dumps(tf_config) And I pass in Hopefully this can help in getting the fix into Batch AI 😄 |
Tried this again today in Azure ML Workspace with 'Machine Learning Compute' and following the Parameter Server setup and got an error
Getting the |
I found this description of chief vs. master: Based on it, |
The TensorFlow
ClusterConfig
can parse worker and parameter server settings from aTF_CONFIG
environment variable (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py#L64-L156)I was trying to pass it via an environment variable in the job configuration file like so:
"{ 'cluster': { 'ps': $AZ_BATCHAI_PS_HOSTS, 'worker': $AZ_BATCHAI_WORKER_HOSTS }, 'task': { 'index': $AZ_BATCHAI_TASK_INDEX, 'type': '' } }"
Which is kind of fine, but falls down for a few cases:
More generally though, providing this configuration via a
TF_CONFIG
environment variable would significantly lower the bar to get distributed training working in TensorFlow and Azure Batch. It would also simplify command line arg parameters and mean just the appropriate data directories would need to be passed and mean the same arguments could be used across master, worker and ps potentially simplify the tensorflowSettings property further.The text was updated successfully, but these errors were encountered: