Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Provide TF_CONFIG environment variable for distributed TensorFlow #15

Open
damienpontifex opened this issue Dec 21, 2017 · 23 comments
Open

Comments

@damienpontifex
Copy link

The TensorFlow ClusterConfig can parse worker and parameter server settings from a TF_CONFIG environment variable (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py#L64-L156)

I was trying to pass it via an environment variable in the job configuration file like so:

"{ 'cluster': { 'ps': $AZ_BATCHAI_PS_HOSTS, 'worker': $AZ_BATCHAI_WORKER_HOSTS }, 'task': { 'index': $AZ_BATCHAI_TASK_INDEX, 'type': '' } }"

Which is kind of fine, but falls down for a few cases:

  1. When there are no parameter servers (i.e. single node) the ps hosts should be an empty array, but in this case it's just an empty string.
  2. The variables for hosts and workers are comma separated and the TF code parses it as JSON, so would ideally be an array type inside this string.
  3. The 'task.type' property can be 'master', 'worker' or 'ps' but that doesn't seem to have an appropriate environment variable and I had to pass the option via command line args

More generally though, providing this configuration via a TF_CONFIG environment variable would significantly lower the bar to get distributed training working in TensorFlow and Azure Batch. It would also simplify command line arg parameters and mean just the appropriate data directories would need to be passed and mean the same arguments could be used across master, worker and ps potentially simplify the tensorflowSettings property further.

@AlexanderYukhanov
Copy link
Contributor

Hi Damien,
Thank you for the feedback! We will figure out how to make use of TF_CONFIG for tensorflow framework. At first glance, we can just introduce dedicated environment variables for using with TF_CONFIG.

@damienpontifex
Copy link
Author

Great, thanks for the response. It is a JSON serialised dictionary in an environment variable, but would mean distributed training would ‘just work ™️’.

@damienpontifex
Copy link
Author

Looking at Azure Batch AI environment variables it seems this is now available.

@AlexanderYukhanov
Copy link
Contributor

Sorry, the functionality is not released yet.

@wtam
Copy link

wtam commented Mar 18, 2018

May I ask if anyone or @damienpontifex know what is the env variable for the master host? I encounter "valueerror: if "cluster" is set in tf_config, it must have one "chief" node."
you can reference here the chief node in the cluster spec
https://www.tensorflow.org/api_docs/python/tf/estimator/RunConfig

@damienpontifex
Copy link
Author

@wtam my understanding is you must have the task type and index set appropriately for the chief. In the page you linked this is {'cluster': cluster, 'task': {'type': 'chief', 'index': 0}}) where the cluster variable has three keys: chief, ps and worker.

Without seeing your actual code, seems the minimum requirement is cluster to have {'chief': ['host0:2222']}. You can have a look at the logic in RunConfig to see if there's a case with your setup that you have configured wrong.

@wtam
Copy link

wtam commented Mar 19, 2018

@damienpontifex Thanks so much for the respond. Since BatchAI only have these env var on BatchAI, $AZ_BATCHAI_PS_HOSTS, $AZ_BATCHAI_WORKER_HOSTS & $AZ_BATCHAI_TASK_INDEX. I overcome the chief node define issueabove by manually reserve the 1st workerhost as chief node and put into the cluster spec. Now I move a bit forward but encounter another issue from RunConfig below:
ValueError: worker is not a valid task_type in the cluster_spec???????:
<tensorflow.python.training.server_lib.ClusterSpec object at 0x7fdd35049750>
Not sure where goes wrong? My cluster is 3 nodes, 1 node reserved for PS and chief node and another 2 nodes for worker node. Appreciate any comment or suggestion to help me out.

This is the cluster spec for the failed worker
'{"cluster": {"chief": ["10.0.0.4:2223"], "worker_hosts": ["10.0.0.5:2222", "10.0.0.6:2222"], "ps_hosts": ["10.0.0.4:2222"]}, "task": {"index": "1", "type": "worker"}}')

@wtam
Copy link

wtam commented Mar 20, 2018

Stupid mistake I made on the cluster spec naming, RunConfig is trying to find worker from my worker_hosts and that why I got the ValueError. For people play around with Estimator Distributed GPU on BatchAI, better wait for its support as the away I reserved the worker node also require me to deduct the $AZ_BATCHAI_TASK_INDEX manually in the cluster spec for the workers.

@llidev
Copy link
Contributor

llidev commented Apr 17, 2018

Hi @damienpontifex, maybe you have already known that, Batch AI is now automatically generating TF_CONFIG env var when running tensorflow job. Would you please try it out and please let us know if it works for you? Thanks!

@yangsiyu007
Copy link

Hi @lliimsft, I'm seeing the automatically generated TF_CONFIG env var with nodeCount 1 as:
{'task': {'type': 'master', 'index': 0}, 'cluster': {'ps': [''], 'worker': ['10.0.0.4:2222']}, 'environment': 'cloud'}
which doesn't seem to work in this 1-node cluster scenario?

@damienpontifex
Copy link
Author

Getting this error when running with nodeCount=3 in the stderr-ps-0.txt log

"ValueError: If "cluster" is set in TF_CONFIG, it must have one "chief" node."

For this task the TF_CONFIG variable was:

{'cluster': {'worker': ['10.0.0.4:2223', '10.0.0.5:2222', '10.0.0.6:2222'], 'ps': ['10.0.0.4:2222']}, 'task': {'type': 'ps', 'index': 0}, 'environment': 'cloud'}

The worker logs just had "Warning: Permanently added '[10.0.0.5]:23' (ECDSA) to the list of known hosts."

@damienpontifex
Copy link
Author

I put the code I'm running here https://github.com/damienpontifex/BatchAIMnist

From the repo, I do:

sh prepare-cluster.sh
sh data-prep.sh
# Wait until data prep done
sh train.sh

@damienpontifex
Copy link
Author

Looking at the documentation, wondering whether the TF_CONFIG value should be:

On the parameter server:
{'cluster': {'chief': ['10.0.0.4:2224'], 'worker': ['10.0.0.4:2223', '10.0.0.5:2222', '10.0.0.6:2222'], 'ps': ['10.0.0.4:2222']}, 'task': {'type': 'ps', 'index': 0}, 'environment': 'cloud'}

On the chief
{'cluster': {'chief': ['10.0.0.4:2224'], 'worker': ['10.0.0.4:2223', '10.0.0.5:2222', '10.0.0.6:2222'], 'ps': ['10.0.0.4:2222']}, 'task': {'type': 'chief', 'index': 0}, 'environment': 'cloud'}

I can't seem to find guidance on having all of chief, ps and worker on the same machine etc as the docstring https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/estimator/run_config.py#L351-L376 seems to have them all as separate machines.

How can we assist to test and get this working?

@yangsiyu007
Copy link

@lliimsft, @AlexanderYukhanov, can we please get some update on this? :)

@llidev
Copy link
Contributor

llidev commented Apr 30, 2018

@damienpontifex @yangsiyu007
The TF_CONFIG environment variable offered by Batch AI is based on TensorFlow Trainer Development Considerations, where the cluster only contains ps/worker, and task type will be master, worker, or ps. Although according to run_config.py, tensorflow is now accepting more options such as "chief", which seems to be confused for us (not sure how it differentiates from "master"). We are looking at this.

@damienpontifex
Copy link
Author

Thank you @lliimsft @yangsiyu007. I also wasn't aware of the change and thank you for the continued effort to support this.

@awan-10
Copy link

awan-10 commented Jul 24, 2018

Hello guys, just wondering if Batch AI is generating the new format of TF_CONFIG now?

@yangsiyu007
Copy link

I don't think so - not when I tried it the week before last... @lliimsft updates?

@llidev
Copy link
Contributor

llidev commented Jul 26, 2018

@yangsiyu007 @awan-10 This work is still in progress. We will keep you updated in this post.

@damienpontifex
Copy link
Author

I was looking at what is currently being set and what changes need to make the RunConfig parse it correctly. Investigations outlined below and will look into updating the TF_CONFIG variable on each machine through code to ensure this change is successful. @lliimsft could this below help in making the appropriate changes?

To verify what JSON structure worked I setup:

os.environ['TF_CONFIG'] = TF_CONFIG_JSON_STRING

config = tf.estimator.RunConfig()
print('master => {}'.format(config.master))
print('task_id => {}'.format(config.task_id))
print('num_ps_replicas => {}'.format(config.num_ps_replicas))
print('num_worker_replicas => {}'.format(config.num_worker_replicas))
print('cluster_spec => {}'.format(config.cluster_spec))
print('task_type => {}'.format(config.task_type))
print('is_chief => {}'.format(config.is_chief))

Run with a 3 node job configured with 1 parameter server and 3 worker count

Current

Currently in Batch AI we get the TF_CONFIG environment variable being:
In ps-0

{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"ps","index":0},"environment":"cloud"}

wk-0

{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":0},"environment":"cloud"}

wk-1

{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":1},"environment":"cloud"}

wk-2

{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":2},"environment":"cloud"}

With these, the python code above gave the error:

ValueError: If "cluster" is set in TF_CONFIG, it must have one "chief" node.

Working

To get this working, we apparently need the master worker defined under chief in the cluster. As such, the 'cluster' part of the JSON object would become:

"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]}

Then the task component would be changed for whichever node is initiated from masterCommandLineArgs and have task of:

"task":{"type":"chief","index":0}

The other worker nodes would have the same as before with index now being 0 or 1 e.g.

"task":{"type":"worker","index":1}

Testing

This sample code parses into the RunConfig correctly, but I haven't tested it on a cluster and an estimator yet to see if it hooks it all up fine:

import os
import json
import tensorflow as tf

def log_config_for(runconfig_string):
  os.environ['TF_CONFIG'] = runconfig_string

  config = tf.estimator.RunConfig()
  print('master => {}'.format(config.master))
  print('task_id => {}'.format(config.task_id))
  print('num_ps_replicas => {}'.format(config.num_ps_replicas))
  print('num_worker_replicas => {}'.format(config.num_worker_replicas))
  print('cluster_spec => {}'.format(config.cluster_spec))
  print('task_type => {}'.format(config.task_type))
  print('is_chief => {}'.format(config.is_chief))
  print()

def main():

  machine_definitions = [ 
    # Machine expected from settings with parameterServerCommandLineArgs
    '{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"ps","index":0},"environment":"cloud"}',
    # Machine expected from settings with masterCommandLineArgs
    '{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"chief","index":0},"environment":"cloud"}',
    # Machine expected from settings with workerCommandLineArgs
    '{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":0},"environment":"cloud"}',
    # Machine expected from settings with workerCommandLineArgs
    '{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":1},"environment":"cloud"}'
  ]

  for definition in machine_definitions:
    log_config_for(definition)

if __name__ == '__main__':
  main()

@damienpontifex
Copy link
Author

I found a workaround and I was able to manipulate the TF_CONFIG environment variable and get it working and put the code here https://github.com/damienpontifex/batchai-tfconfig-workaround

The environment variable manipulation was:

def remap_tfconfig(is_master):
  tf_config = json.loads(os.environ['TF_CONFIG'])
  master_worker = tf_config['cluster']['worker'][0]
  tf_config['cluster']['worker'] = tf_config['cluster']['worker'][1:]
  tf_config['cluster']['chief'] = [master_worker]
  if is_master:
    tf_config['task']['type'] = 'chief'
    tf_config['task']['index'] = 0
  elif tf_config['task']['type'] == 'worker':
    tf_config['task']['index'] -= 1
  
  os.environ['TF_CONFIG'] = json.dumps(tf_config)

And I pass in --master through to the masterCommandLineArgs that gets received by ArgumentParser by parser.add_argument('--master', action='store_true'). Then just call remap_tfconfig(args.master) after parse_args

Hopefully this can help in getting the fix into Batch AI 😄

@damienpontifex
Copy link
Author

Tried this again today in Azure ML Workspace with 'Machine Learning Compute' and following the Parameter Server setup and got an error

Run failed: argument of type 'ClusterSpec' is not iterable

Getting the TF_CONFIG quite right seems to still be an issue

@robertlugg
Copy link

robertlugg commented Apr 27, 2020

Thank you @lliimsft @yangsiyu007. I also wasn't aware of the change and thank you for the continued effort to support this.

I found this description of chief vs. master:
https://cloud.google.com/ai-platform/training/docs/distributed-training-details#chief-versus-master

Based on it, master is unsupported in TF2 and should be replaced with chief

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants