[Question] How to tie/share weights in a flax neural network #1264

patrickvonplaten · 2021-04-22T10:52:36Z

patrickvonplaten
Apr 22, 2021

Sorry this is not really a bug report but more a question. Lots of language models tie their word embedding matrix to the output logits matrix (This both saves quite some memory since vocab_sizes can be huge & can lead to better results). In PyTorch it's quite easy to do so - one can simply do the following:

SimpleNetwork(nn.Module):

def __init__(self):
       self.input_embedding = nn.Embedding(...)
       self.output_embedding = nn.Dense(...)

model = SimpleNetwork()

model.output_embedding.weight = model.input_embedding.weight

This will simply make sure that both weights point to the same weight node in the graph and consequently the weights are tied in the network.

Is there a way to do this in Flax?

I tried to simple set the weights of state to each other, e.g.:

model = SimpleNetwork()

state = model.init(....)

state["params"]["output_embedding"]["kernel"] = state["params"]["input_embedding"]["embedding"].T

but this doesn't seem to actually tie the weights (during gradient descent the weights are updated independently). Do you know if there is an elegant way to share weights in Flax?

Thanks a lot!

avital · 2021-04-22T11:28:16Z

avital
Apr 22, 2021

@jheek -- do you have a example handy of how a variant of lift.transform_module can be used here?

0 replies

jheek · 2021-04-22T12:33:20Z

jheek
Apr 22, 2021
Maintainer

For now I would use something like this:

class Transformer(nn.Module):

  @nn.compact
  def apply(self, indices):
    embedding = nn.Embed(name="input_embedding")(indices)
    ...
    kernel = embedding.variables['params']['embedding']
    logits = nn.Dense(use_bias=False, features=kernel.shape[-1]).apply({'params': {'kernel': kernel}}, x)
    # maybe add bias to logits...

8 replies

andsteing Apr 22, 2021
Maintainer

And will this actually save memory or will params still contain both arrays?

jheek Apr 23, 2021
Maintainer

I think it could be a bit confusing for a user to see weights in state that is then not used no?

weights are initialised lazily so the dense should not have any weights in this case when you inspect the variable dict returned from Transformer.apply(...)

And will this actually save memory or will params still contain both arrays?

there will be no redundant params in the variable dict.

jheek Apr 23, 2021
Maintainer

can lift.transform_module be used for something like this maybe?

Yes but it has not been exposed in linnen yet so it only works in our under the hood core.

The problem is that we don't allow yet to transform the variables of one module to apply another.
You can already do this when you take manual control of the Scope but it's not something I would recommend:

def emb_to_kernel(variables):
  return {"params": {"kernel": variables["params"]["embedding"]}}
fn = lambda scope: nn.Dense(parent=scope)(x)
y = lift.transform_module(fn, trans_in_fn=emb_to_kernel)(self.embedding.scope)

patrickvonplaten Apr 23, 2021
Author

thanks a lot for your answers! We've merged your first proposed solution into transformers now here

LeoXinhaoLee Jul 19, 2023

Hi, I'm wondering how I could achieve the other direction: to let embedding use the kernel of nn.Dense? Thank you very much!

bilal2vec · 2021-04-23T15:06:18Z

bilal2vec
Apr 23, 2021

you can use .attend() for this iirc (https://flax.readthedocs.io/en/latest/_modules/flax/linen/linear.html#Embed)

2 replies

jheek Apr 23, 2021
Maintainer

Good point! I got so caught up in the weight sharing solutions that I forgot the obvious solution :)

davidshen84 Feb 2, 2024

Hi @bilal2vec , can you give an example of this? In the PyTorch example, you can easily assign the weight from one module to another. How can I use the attend function from the flax.linen.Embed module to do the same?

daskol · 2023-04-27T23:33:32Z

daskol
Apr 27, 2023

@patrickvonplaten @jheek Check out my solution. It works pretty well and looks quite idiomatic.

def tie(target, mappings, collections='params', transpose=True):
    """Tie weights of `target` module` enumerated in `mappings` from
    `collections`.

    Example::
        >>> class Model(nn.Module):
        ...     @nn.compact
        ...     def __call__(self, xs):
        ...         ys = nn.Embed(10, 8)(xs)
        ...         zs = nn.Dense(10)(ys)
        ...         return zs
        ...
        >>> rules = {('params', 'Embed_0', 'embedding'):
        ...          ('params', 'Dense_0', 'kernel')}
        >>> TiedModel = tie(Model, rules)
        >>> model = TiedModel()
        >>> variables = model.init(jax.random.PRNGKey(42),
        ...                        jnp.arange(6).reshape(2, 3))

    Args:
        target: the module or function to be transformed.
        mappings: weight sharing rules.
        collections: the collection(s) to be transformed.
        transpose: transpose tied weights or not.
    Returns:
        a wrapped version of ``target`` with shared weights.
    """
    if isinstance(mappings, dict):
        mappings = [*mappings.items()]

    def tie_in(variables):
        variables = flatten_dict(variables)
        for src, dst in mappings:
            if transpose:
                variables[dst] = variables[src].T
            else:
                variables[dst] = variables[src]
        return unflatten_dict(variables)

    def tie_out(variables):
        variables = flatten_dict(variables)
        for _, dst in mappings:
            variables.pop(dst, None)
        return unflatten_dict(variables)

    return nn.map_variables(target, collections, tie_in, tie_out, init=True)

3 replies

giganttheo Nov 23, 2023

hi @daskol , thank you for your solution, it is very helpful. It works as expected for Modules defined with nn.Module.

However I don't find how to make it work for the models defined in the transformers library for pretrained Flax models. Even when working directly with the .module (which is a nn.Module class), it still doesn't work. Do you have any idea how to fix this?

I tried three different methods:

wrapping the model.module

from transformers import FlaxT5ForConditionalGeneration
import flax.linen as nn

model = FlaxT5ForConditionalGeneration.from_pretrained("t5-small")
assert isinstance(model.module, nn.Module) #this is True

rules = {('encoder', 'block', '0'): ('encoder', 'block', '1')}
TiedModel = tie(model.module, rules)
t_model = TiedModel() #error ==> TransformTargetError: Linen transformations must be applied to Modules classes or functions taking a Module instance as the first argument. The provided target is not a Module class or callable: FlaxT5ForConditionalGenerationModule

wrapping the class

from transformers import FlaxT5ForConditionalGeneration
import flax.linen as nn

Model = FlaxT5ForConditionalGeneration

rules = {('encoder', 'block', '0'): ('encoder', 'block', '1')}
TiedModel = tie(Model, rules)

#first attempt
t_model = TiedModel() #TypeError: FlaxT5ForConditionalGeneration() missing 1 required positional argument: 'self'

#second attempt
t_model = TiedModel.from_pretrained("t5-small") #AttributeError: 'function' object has no attribute 'from_pretrained'

wrapping the class v2

from functools import partial

from transformers import FlaxT5ForConditionalGeneration
import flax.linen as nn

Model = partial(FlaxT5ForConditionalGeneration.from_pretrained, "t5-small")

rules = {('encoder', 'block', '0'): ('encoder', 'block', '1')}
TiedModel = tie(Model, rules)

t_model = TiedModel() #TypeError: wrap_method_once.<locals>.wrapped_module_method() missing 1 required positional argument: 'self'

daskol Nov 23, 2023

@giganttheo Thank you, I am glad that it was useful for you.

It is a quite tricky in case of HuggingFace's transformers and I personally tend to avoid using transformers library in favor of my own implementation or young-geng/EasyLM project. Nevertheless, there is a solution.

First of all, tie acts on types (flax.nn.Module but not an instance of flax.nn.Module) as you can see from its docstring. Specifically, model.module is an instance.

The second important issue is that FlaxT5ForConditionalGeneration actually provides a uniform facade to T5 which is compatible across different frameworks PT/JAX/TF and different model architectures. It uses quite complex machinery which is backed by FlaxPreTrainedModel type and does some subtle things. In particular, it uses module_class attribute of your FlaxT5ForConditionalGeneration type in order to instantiate flax.nn.Module and initialize weights. It worth to note that weights are initialized in constructor. 😮 This facts result in jitting of the model instance twice what can be very annoying.

The third issue is that tie routine understand only binding a set of specific weigh not dicts or trees of weights. Also, weight specification should start with params since Flax model weights are available under this key.

In concrete example below, the kernel of affine transformation on keys in the second transformer-block is tied to corresponding kernel in the first transformer-block.

rules = {
    ('params', 'encoder', 'block', '0', 'layer', '0', 'SelfAttention', 'k', 'kernel'):
    ('params', 'encoder', 'block', '1', 'layer', '0', 'SelfAttention', 'k', 'kernel'),
}

module_class = FlaxT5ForConditionalGeneration.module_class
FlaxT5ForConditionalGeneration.module_class = tie(module_class, rules)

model = FlaxT5ForConditionalGeneration.from_pretrained('t5-small')

giganttheo Nov 24, 2023

Nice, thank you very much. Some things are way much clearer now.

When working with custom models I wanted to keep as much from the transformers library as possible but I find that this "all-in-one" steam machine is useful to play with and kick-start the training but is quickly making things difficult to customize, so I might take the leap and move to more custom implementations as well. Thanks for the library references, it will surely prove helpful for my work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How to tie/share weights in a flax neural network #1264

{{title}}

Replies: 4 comments 13 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[Question] How to tie/share weights in a flax neural network #1264

Replies: 4 comments · 13 replies

jheek Apr 22, 2021 Maintainer

andsteing Apr 22, 2021 Maintainer

jheek Apr 23, 2021 Maintainer

jheek Apr 23, 2021 Maintainer

patrickvonplaten Apr 23, 2021 Author

jheek Apr 23, 2021 Maintainer

Replies: 4 comments 13 replies

jheek
Apr 22, 2021
Maintainer

andsteing Apr 22, 2021
Maintainer

jheek Apr 23, 2021
Maintainer

jheek Apr 23, 2021
Maintainer

patrickvonplaten Apr 23, 2021
Author

jheek Apr 23, 2021
Maintainer