Provide TF_CONFIG environment variable for distributed TensorFlow #15

damienpontifex · 2017-12-21T07:54:35Z

The TensorFlow ClusterConfig can parse worker and parameter server settings from a TF_CONFIG environment variable (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py#L64-L156)

I was trying to pass it via an environment variable in the job configuration file like so:

"{ 'cluster': { 'ps': $AZ_BATCHAI_PS_HOSTS, 'worker': $AZ_BATCHAI_WORKER_HOSTS }, 'task': { 'index': $AZ_BATCHAI_TASK_INDEX, 'type': '' } }"

Which is kind of fine, but falls down for a few cases:

When there are no parameter servers (i.e. single node) the ps hosts should be an empty array, but in this case it's just an empty string.
The variables for hosts and workers are comma separated and the TF code parses it as JSON, so would ideally be an array type inside this string.
The 'task.type' property can be 'master', 'worker' or 'ps' but that doesn't seem to have an appropriate environment variable and I had to pass the option via command line args

More generally though, providing this configuration via a TF_CONFIG environment variable would significantly lower the bar to get distributed training working in TensorFlow and Azure Batch. It would also simplify command line arg parameters and mean just the appropriate data directories would need to be passed and mean the same arguments could be used across master, worker and ps potentially simplify the tensorflowSettings property further.

The text was updated successfully, but these errors were encountered:

AlexanderYukhanov · 2017-12-21T17:26:34Z

Hi Damien,
Thank you for the feedback! We will figure out how to make use of TF_CONFIG for tensorflow framework. At first glance, we can just introduce dedicated environment variables for using with TF_CONFIG.

damienpontifex · 2017-12-21T22:48:27Z

Great, thanks for the response. It is a JSON serialised dictionary in an environment variable, but would mean distributed training would ‘just work ™️’.

damienpontifex · 2018-03-14T01:34:59Z

Looking at Azure Batch AI environment variables it seems this is now available.

AlexanderYukhanov · 2018-03-14T04:44:18Z

Sorry, the functionality is not released yet.

wtam · 2018-03-18T15:13:59Z

May I ask if anyone or @damienpontifex know what is the env variable for the master host? I encounter "valueerror: if "cluster" is set in tf_config, it must have one "chief" node."
you can reference here the chief node in the cluster spec
https://www.tensorflow.org/api_docs/python/tf/estimator/RunConfig

damienpontifex · 2018-03-19T10:41:07Z

@wtam my understanding is you must have the task type and index set appropriately for the chief. In the page you linked this is {'cluster': cluster, 'task': {'type': 'chief', 'index': 0}}) where the cluster variable has three keys: chief, ps and worker.

Without seeing your actual code, seems the minimum requirement is cluster to have {'chief': ['host0:2222']}. You can have a look at the logic in RunConfig to see if there's a case with your setup that you have configured wrong.

wtam · 2018-03-19T12:01:57Z

@damienpontifex Thanks so much for the respond. Since BatchAI only have these env var on BatchAI, $AZ_BATCHAI_PS_HOSTS, $AZ_BATCHAI_WORKER_HOSTS & $AZ_BATCHAI_TASK_INDEX. I overcome the chief node define issueabove by manually reserve the 1st workerhost as chief node and put into the cluster spec. Now I move a bit forward but encounter another issue from RunConfig below:
ValueError: worker is not a valid task_type in the cluster_spec???????:
<tensorflow.python.training.server_lib.ClusterSpec object at 0x7fdd35049750>
Not sure where goes wrong? My cluster is 3 nodes, 1 node reserved for PS and chief node and another 2 nodes for worker node. Appreciate any comment or suggestion to help me out.

This is the cluster spec for the failed worker
'{"cluster": {"chief": ["10.0.0.4:2223"], "worker_hosts": ["10.0.0.5:2222", "10.0.0.6:2222"], "ps_hosts": ["10.0.0.4:2222"]}, "task": {"index": "1", "type": "worker"}}')

wtam · 2018-03-20T11:04:21Z

Stupid mistake I made on the cluster spec naming, RunConfig is trying to find worker from my worker_hosts and that why I got the ValueError. For people play around with Estimator Distributed GPU on BatchAI, better wait for its support as the away I reserved the worker node also require me to deduct the $AZ_BATCHAI_TASK_INDEX manually in the cluster spec for the workers.

llidev · 2018-04-17T18:03:59Z

Hi @damienpontifex, maybe you have already known that, Batch AI is now automatically generating TF_CONFIG env var when running tensorflow job. Would you please try it out and please let us know if it works for you? Thanks!

yangsiyu007 · 2018-04-26T22:39:26Z

Hi @lliimsft, I'm seeing the automatically generated TF_CONFIG env var with nodeCount 1 as:
{'task': {'type': 'master', 'index': 0}, 'cluster': {'ps': [''], 'worker': ['10.0.0.4:2222']}, 'environment': 'cloud'}
which doesn't seem to work in this 1-node cluster scenario?

damienpontifex · 2018-04-27T03:13:33Z

Getting this error when running with nodeCount=3 in the stderr-ps-0.txt log

"ValueError: If "cluster" is set in TF_CONFIG, it must have one "chief" node."

For this task the TF_CONFIG variable was:

{'cluster': {'worker': ['10.0.0.4:2223', '10.0.0.5:2222', '10.0.0.6:2222'], 'ps': ['10.0.0.4:2222']}, 'task': {'type': 'ps', 'index': 0}, 'environment': 'cloud'}

The worker logs just had "Warning: Permanently added '[10.0.0.5]:23' (ECDSA) to the list of known hosts."

damienpontifex · 2018-04-27T03:17:39Z

I put the code I'm running here https://github.com/damienpontifex/BatchAIMnist

From the repo, I do:

sh prepare-cluster.sh
sh data-prep.sh
# Wait until data prep done
sh train.sh

damienpontifex · 2018-04-30T02:20:01Z

Looking at the documentation, wondering whether the TF_CONFIG value should be:

On the parameter server:
{'cluster': {'chief': ['10.0.0.4:2224'], 'worker': ['10.0.0.4:2223', '10.0.0.5:2222', '10.0.0.6:2222'], 'ps': ['10.0.0.4:2222']}, 'task': {'type': 'ps', 'index': 0}, 'environment': 'cloud'}

On the chief
{'cluster': {'chief': ['10.0.0.4:2224'], 'worker': ['10.0.0.4:2223', '10.0.0.5:2222', '10.0.0.6:2222'], 'ps': ['10.0.0.4:2222']}, 'task': {'type': 'chief', 'index': 0}, 'environment': 'cloud'}

I can't seem to find guidance on having all of chief, ps and worker on the same machine etc as the docstring https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/estimator/run_config.py#L351-L376 seems to have them all as separate machines.

How can we assist to test and get this working?

yangsiyu007 · 2018-04-30T07:07:28Z

@lliimsft, @AlexanderYukhanov, can we please get some update on this? :)

llidev · 2018-04-30T08:22:28Z

@damienpontifex @yangsiyu007
The TF_CONFIG environment variable offered by Batch AI is based on TensorFlow Trainer Development Considerations, where the cluster only contains ps/worker, and task type will be master, worker, or ps. Although according to run_config.py, tensorflow is now accepting more options such as "chief", which seems to be confused for us (not sure how it differentiates from "master"). We are looking at this.

damienpontifex · 2018-04-30T12:21:57Z

Thank you @lliimsft @yangsiyu007. I also wasn't aware of the change and thank you for the continued effort to support this.

awan-10 · 2018-07-24T23:26:49Z

Hello guys, just wondering if Batch AI is generating the new format of TF_CONFIG now?

yangsiyu007 · 2018-07-25T02:50:14Z

I don't think so - not when I tried it the week before last... @lliimsft updates?

llidev · 2018-07-26T21:28:59Z

@yangsiyu007 @awan-10 This work is still in progress. We will keep you updated in this post.

damienpontifex · 2018-09-21T12:47:51Z

I was looking at what is currently being set and what changes need to make the RunConfig parse it correctly. Investigations outlined below and will look into updating the TF_CONFIG variable on each machine through code to ensure this change is successful. @lliimsft could this below help in making the appropriate changes?

To verify what JSON structure worked I setup:

os.environ['TF_CONFIG'] = TF_CONFIG_JSON_STRING

config = tf.estimator.RunConfig()
print('master => {}'.format(config.master))
print('task_id => {}'.format(config.task_id))
print('num_ps_replicas => {}'.format(config.num_ps_replicas))
print('num_worker_replicas => {}'.format(config.num_worker_replicas))
print('cluster_spec => {}'.format(config.cluster_spec))
print('task_type => {}'.format(config.task_type))
print('is_chief => {}'.format(config.is_chief))

Run with a 3 node job configured with 1 parameter server and 3 worker count

Current

Currently in Batch AI we get the TF_CONFIG environment variable being:
In ps-0

{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"ps","index":0},"environment":"cloud"}

wk-0

{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":0},"environment":"cloud"}

wk-1

{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":1},"environment":"cloud"}

wk-2

{"cluster":{"ps":["10.0.0.4:2222"],"worker":["10.0.0.4:2223","10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":2},"environment":"cloud"}

With these, the python code above gave the error:

ValueError: If "cluster" is set in TF_CONFIG, it must have one "chief" node.

Working

To get this working, we apparently need the master worker defined under chief in the cluster. As such, the 'cluster' part of the JSON object would become:

"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]}

Then the task component would be changed for whichever node is initiated from masterCommandLineArgs and have task of:

"task":{"type":"chief","index":0}

The other worker nodes would have the same as before with index now being 0 or 1 e.g.

"task":{"type":"worker","index":1}

Testing

This sample code parses into the RunConfig correctly, but I haven't tested it on a cluster and an estimator yet to see if it hooks it all up fine:

import os
import json
import tensorflow as tf

def log_config_for(runconfig_string):
  os.environ['TF_CONFIG'] = runconfig_string

  config = tf.estimator.RunConfig()
  print('master => {}'.format(config.master))
  print('task_id => {}'.format(config.task_id))
  print('num_ps_replicas => {}'.format(config.num_ps_replicas))
  print('num_worker_replicas => {}'.format(config.num_worker_replicas))
  print('cluster_spec => {}'.format(config.cluster_spec))
  print('task_type => {}'.format(config.task_type))
  print('is_chief => {}'.format(config.is_chief))
  print()

def main():

  machine_definitions = [ 
    # Machine expected from settings with parameterServerCommandLineArgs
    '{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"ps","index":0},"environment":"cloud"}',
    # Machine expected from settings with masterCommandLineArgs
    '{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"chief","index":0},"environment":"cloud"}',
    # Machine expected from settings with workerCommandLineArgs
    '{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":0},"environment":"cloud"}',
    # Machine expected from settings with workerCommandLineArgs
    '{"cluster":{"chief": ["10.0.0.4:2223"],"ps":["10.0.0.4:2222"],"worker":["10.0.0.5:2222","10.0.0.6:2222"]},"task":{"type":"worker","index":1},"environment":"cloud"}'
  ]

  for definition in machine_definitions:
    log_config_for(definition)

if __name__ == '__main__':
  main()

damienpontifex · 2018-09-22T02:52:13Z

I found a workaround and I was able to manipulate the TF_CONFIG environment variable and get it working and put the code here https://github.com/damienpontifex/batchai-tfconfig-workaround

The environment variable manipulation was:

def remap_tfconfig(is_master):
  tf_config = json.loads(os.environ['TF_CONFIG'])
  master_worker = tf_config['cluster']['worker'][0]
  tf_config['cluster']['worker'] = tf_config['cluster']['worker'][1:]
  tf_config['cluster']['chief'] = [master_worker]
  if is_master:
    tf_config['task']['type'] = 'chief'
    tf_config['task']['index'] = 0
  elif tf_config['task']['type'] == 'worker':
    tf_config['task']['index'] -= 1
  
  os.environ['TF_CONFIG'] = json.dumps(tf_config)

And I pass in --master through to the masterCommandLineArgs that gets received by ArgumentParser by parser.add_argument('--master', action='store_true'). Then just call remap_tfconfig(args.master) after parse_args

Hopefully this can help in getting the fix into Batch AI 😄

damienpontifex · 2019-06-17T06:19:55Z

Tried this again today in Azure ML Workspace with 'Machine Learning Compute' and following the Parameter Server setup and got an error

Run failed: argument of type 'ClusterSpec' is not iterable

Getting the TF_CONFIG quite right seems to still be an issue

robertlugg · 2020-04-27T16:42:03Z

Thank you @lliimsft @yangsiyu007. I also wasn't aware of the change and thank you for the continued effort to support this.

I found this description of chief vs. master:
https://cloud.google.com/ai-platform/training/docs/distributed-training-details#chief-versus-master

Based on it, master is unsupported in TF2 and should be replaced with chief

AlexanderYukhanov added the enhancement label Jan 19, 2018

damienpontifex closed this as completed Mar 14, 2018

AlexanderYukhanov reopened this Mar 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide TF_CONFIG environment variable for distributed TensorFlow #15

Provide TF_CONFIG environment variable for distributed TensorFlow #15

damienpontifex commented Dec 21, 2017

AlexanderYukhanov commented Dec 21, 2017

damienpontifex commented Dec 21, 2017

damienpontifex commented Mar 14, 2018

AlexanderYukhanov commented Mar 14, 2018

wtam commented Mar 18, 2018

damienpontifex commented Mar 19, 2018

wtam commented Mar 19, 2018 •

edited

Loading

wtam commented Mar 20, 2018

llidev commented Apr 17, 2018

yangsiyu007 commented Apr 26, 2018

damienpontifex commented Apr 27, 2018

damienpontifex commented Apr 27, 2018

damienpontifex commented Apr 30, 2018

yangsiyu007 commented Apr 30, 2018

llidev commented Apr 30, 2018

damienpontifex commented Apr 30, 2018

awan-10 commented Jul 24, 2018

yangsiyu007 commented Jul 25, 2018

llidev commented Jul 26, 2018

damienpontifex commented Sep 21, 2018

damienpontifex commented Sep 22, 2018

damienpontifex commented Jun 17, 2019

robertlugg commented Apr 27, 2020 •

edited

Loading

Provide TF_CONFIG environment variable for distributed TensorFlow #15

Provide TF_CONFIG environment variable for distributed TensorFlow #15

Comments

damienpontifex commented Dec 21, 2017

AlexanderYukhanov commented Dec 21, 2017

damienpontifex commented Dec 21, 2017

damienpontifex commented Mar 14, 2018

AlexanderYukhanov commented Mar 14, 2018

wtam commented Mar 18, 2018

damienpontifex commented Mar 19, 2018

wtam commented Mar 19, 2018 • edited Loading

wtam commented Mar 20, 2018

llidev commented Apr 17, 2018

yangsiyu007 commented Apr 26, 2018

damienpontifex commented Apr 27, 2018

damienpontifex commented Apr 27, 2018

damienpontifex commented Apr 30, 2018

yangsiyu007 commented Apr 30, 2018

llidev commented Apr 30, 2018

damienpontifex commented Apr 30, 2018

awan-10 commented Jul 24, 2018

yangsiyu007 commented Jul 25, 2018

llidev commented Jul 26, 2018

damienpontifex commented Sep 21, 2018

Current

Working

Testing

damienpontifex commented Sep 22, 2018

damienpontifex commented Jun 17, 2019

robertlugg commented Apr 27, 2020 • edited Loading

wtam commented Mar 19, 2018 •

edited

Loading

robertlugg commented Apr 27, 2020 •

edited

Loading