jax_utils replicate fails to propagate shared batchnorm #452

mortner31 · 2020-09-08T09:01:04Z

mortner31
Sep 8, 2020

Problem you have encountered:

While trying to learn a siamese network based on imagenet example that replicates model states accross devices, I found an issue with batch norms.

Running on :

jax 0.1.75
jaxlib 0.1.51
flax 0.2.0

Logs, error messages, etc:

Traceback (most recent call last):
  File "modik/modik-projects/modik/projects/wip/flax/debug_flax_shared_bn.py", line 89, in <module>
    jax_utils.replicate(model_state_11)
  File "/home/ortner/anaconda/lib/python3.6/site-packages/flax/jax_utils.py", line 76, in replicate
    return jax.tree_map(lambda x: _replicate(x, devices), tree)
  File "/home/ortner/anaconda/lib/python3.6/site-packages/jax/tree_util.py", line 164, in tree_map
    return treedef.unflatten(map(f, leaves))
  File "/home/ortner/anaconda/lib/python3.6/site-packages/flax/jax_utils.py", line 76, in <lambda>
    return jax.tree_map(lambda x: _replicate(x, devices), tree)
  File "/home/ortner/anaconda/lib/python3.6/site-packages/flax/jax_utils.py", line 62, in _replicate
    buffers = [jax.interpreters.xla.device_put(x, device=d) for d in devices]
  File "/home/ortner/anaconda/lib/python3.6/site-packages/flax/jax_utils.py", line 62, in <listcomp>
    buffers = [jax.interpreters.xla.device_put(x, device=d) for d in devices]
  File "/home/ortner/anaconda/lib/python3.6/site-packages/jax/interpreters/xla.py", line 118, in device_put
    x = canonicalize_dtype(x)
  File "/home/ortner/anaconda/lib/python3.6/site-packages/jax/interpreters/xla.py", line 143, in canonicalize_dtype
    raise TypeError(f"No canonicalize_dtype handler for type: {type(x)}")
TypeError: No canonicalize_dtype handler for type: <class 'jax.interpreters.partial_eval.JaxprTracer'>

Steps to reproduce:

Whenever possible, please provide a minimal example. Please consider submitting it as a Colab link.

https://colab.research.google.com/drive/1tRqd7rykyvjxgF6Cqv9OqTdXVffYbqL2?usp=sharing

jheek · 2020-09-14T07:33:58Z

jheek
Sep 14, 2020
Maintainer

A quick workaround is to set train=False during module creation. The edge case arises because you are combining lazy init + shared batch_norm + train=True during init.

I'll have to think a bit more about how to fix this nicely.

0 replies

mortner31 · 2020-09-14T08:03:01Z

mortner31
Sep 14, 2020
Author

Thanks for your quick answer.

From your answer, I understood that lazy init is part of the issue. I therefore switched to direct init (module.init(prng_key,x) instead of module.init_by_shape(prng_key,[(shape,dtype)])

(full updated and working colab here: https://colab.research.google.com/drive/1Cm1MIHkKBmZ-xBq21fEG6byzd1IbrGSz?usp=sharing)

It now works, but I wonder : is there any reason why I would prefer init_by_shape rather than init ? The ability to learn the batch norm is of higher importance in my case than saving the cost of an init which will be done anyhow ! I'd rather keep train=True for learning purposes, or maybe there is something I am not getting right.

For the record : the new create model in the new colab:

def create_model(prng_key, use_bn=True, shared=True):
    input_shape = (100, 64, 64, 2)
    model_dtype = jnp.float32
    
    module = MySiameseNet.partial(train=True, use_bn=use_bn, shared=shared)

    
    with nn.stateful() as init_state:
        with flax.nn.stochastic(prng_key):
            x = jnp.zeros(input_shape,dtype=model_dtype)
            _, initial_params = module.init(prng_key, x)
            model = nn.Model(module, initial_params)

    return model, init_state

Thanks

0 replies

jheek · 2020-09-14T08:45:19Z

jheek
Sep 14, 2020
Maintainer

You don't need train=True during init. The problem is that with train=True you are trying to gather batch statistics (they don't exist because the init is lazy). Of course you can set train=True again during the actual train steps.
Basically train=True -> gather batch statsistics, and train=False -> use running average of batch statistics

0 replies

mortner31 · 2020-09-14T10:31:31Z

mortner31
Sep 14, 2020
Author

Thanks, your answers made everything super clear to me.

My mind was biased by Keras since in Keras, if you set the trainable = False at batchnorm creation it is not possible to come back by setting it to true since it runs in inference mode forever (at least from what I understand of this post https://keras.io/guides/transfer_learning/#do-a-round-of-finetuning-of-the-entire-model ).

Flax behavior, where the operator is the same in both mode is an excellent news. For the record, based on your proposition:

https://colab.research.google.com/drive/12Bgq0XSy-Y8G2a3HhHLaF6xKuRRQa5Z9?usp=sharing

def create_model(prng_key, use_bn=True, shared=True):
    input_shape = (100, 64, 64, 2)
    model_dtype = jnp.float32

    # workaround for avoiding shared lazy init issues with bn
    module_for_init = MySiameseNet.partial(train=False, use_bn=use_bn, shared=shared)
    
    with nn.stateful() as init_state:
        _, initial_params = module_for_init.init_by_shape(prng_key, [(input_shape, model_dtype)])

    module_for_train = MySiameseNet.partial(train=True, use_bn=use_bn, shared=shared)

    model = nn.Model(module_for_train, initial_params)

    return model, init_state

0 replies

jheek · 2020-09-14T11:08:49Z

jheek
Sep 14, 2020
Maintainer

Yes this works correctly although I would prefer to write it as follows for simplicity:

def create_model(prng_key, use_bn=True, shared=True):
    input_shape = (100, 64, 64, 2)
    model_dtype = jnp.float32

    # workaround for avoiding shared lazy init issues with bn
    module = MySiameseNet.partial(train=True, use_bn=use_bn, shared=shared)
    
    with nn.stateful() as init_state:
        _, initial_params = module.init_by_shape(prng_key, [(input_shape, model_dtype)], train=False)

    model = nn.Model(module, initial_params)

    return model, init_state

0 replies

avital · 2020-12-12T18:31:19Z

avital
Dec 12, 2020

IIUC, this is no longer an issue in Linen because we no longer have init_by_shape. I'll convert this to a discussion for posterity, but if there are any related issues in Linen please feel free to open a new issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jax_utils replicate fails to propagate shared batchnorm #452

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

jax_utils replicate fails to propagate shared batchnorm #452

mortner31 Sep 8, 2020

Problem you have encountered:

Logs, error messages, etc:

Steps to reproduce:

Replies: 6 comments

jheek Sep 14, 2020 Maintainer

mortner31 Sep 14, 2020 Author

jheek Sep 14, 2020 Maintainer

mortner31 Sep 14, 2020 Author

jheek Sep 14, 2020 Maintainer

avital Dec 12, 2020

mortner31
Sep 8, 2020

jheek
Sep 14, 2020
Maintainer

mortner31
Sep 14, 2020
Author

jheek
Sep 14, 2020
Maintainer

mortner31
Sep 14, 2020
Author

jheek
Sep 14, 2020
Maintainer

avital
Dec 12, 2020