tweak torch parameter registration mechanism #19908

haohuanw · 2024-06-23T21:26:30Z

this is a follow up from #19885 discussion where i am trying to make torch / keras well played together on tracking parameters.

the solution i ended up with:

since modules are properly tracked with torch module, every torch_params will only safe it's own variables. nested variable resolution will be done by torch with recurse=True
change back to use parameter list instead of dict. i did consider to keep using dict given the readability since now key in torch param could actually be variable.name with just tracking variables the current layer holds. however, current seed generator actually create duplicated variable names. if https://github.com/keras-team/keras/blob/master/keras/src/random/seed_generator.py#L80 can be changed to something like f"{self.name}_generator_state" it will work with ParameterDict approach.
in _post_track/untrack_variables, refresh the entire torch params and it's sublayers. this could be changed to not re-create all sublayers if this function ever becomes too slow.

i also added few torch specific tests to reflect some of the assumptions and usecases that torch user might have. eg. use state_dict.

codecov-commenter · 2024-06-23T21:31:39Z

Codecov Report

Attention: Patch coverage is 2.17391% with 45 lines in your changes missing coverage. Please review.

Project coverage is 73.44%. Comparing base (c8a7f28) to head (de3479c).
Report is 17 commits behind head on master.

Files	Patch %	Lines
keras/src/backend/torch/layer.py	0.00%	39 Missing ⚠️
keras/src/backend/torch/trainer.py	0.00%	4 Missing ⚠️
keras/src/testing/test_case.py	0.00%	1 Missing and 1 partial ⚠️

❗ There is a different number of reports uploaded between BASE (c8a7f28) and HEAD (de3479c). Click for more details.

HEAD has 1 upload less than BASE
| Flag | BASE (c8a7f28) | HEAD (de3479c) | |------|------|------| |keras|4|3|

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #19908      +/-   ##
==========================================
- Coverage   79.01%   73.44%   -5.57%     
==========================================
  Files         499      499              
  Lines       46441    46476      +35     
  Branches     8550     8556       +6     
==========================================
- Hits        36694    34134    -2560     
- Misses       8020    10670    +2650     
+ Partials     1727     1672      -55

Flag	Coverage Δ
keras	`73.37% <2.17%> (-5.51%)`	⬇️
keras-jax	`62.41% <2.17%> (+<0.01%)`	⬆️
keras-numpy	`57.21% <2.17%> (-0.01%)`	⬇️
keras-tensorflow	`63.60% <2.17%> (-0.04%)`	⬇️
keras-torch	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

haohuanw · 2024-06-23T23:14:20Z

the failing pytorch test is actually passing on my env:

(keras-dev-minimum) haohuanw@haohuanw-ThinkPad-X1-Extreme:~/Documents/keras$ KERAS_BACKEND=torch python integration_tests/numerical_test.py 
2024-06-23 16:13:12.028332: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-23 16:13:12.031879: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-23 16:13:12.080855: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-23 16:13:12.900432: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-06-23 16:13:14.305362: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-06-23 16:13:14.305867: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Checking training histories:
accuracy:
[0.20999999344348907]
[0.20999999344348907]
loss:
[2.6727349758148193]
[2.6727302074432373]
mae:
[0.1606517732143402]
[0.16065318882465363]
Training histories match.

Checking trained weights:
Trained weights match.

Checking predict:
Predict results match.

Checking evaluate:
[2.2113966941833496, 0.17798443138599396, 0.3799999952316284]
[2.2114176750183105, 0.17798538506031036, 0.3799999952316284]
Evaluate results match.

fchollet · 2024-06-24T05:13:44Z

if https://github.com/keras-team/keras/blob/master/keras/src/random/seed_generator.py#L80 can be changed to something like f"{self.name}_generator_state" it will work with ParameterDict approach.

The uniqueness of the variable path should come from the parent object name, not from the variable name (e.g. "dense_1/kernel"). What paths do you currently see for seed generators?

haohuanw · 2024-06-24T05:19:37Z

if https://github.com/keras-team/keras/blob/master/keras/src/random/seed_generator.py#L80 can be changed to something like f"{self.name}_generator_state" it will work with ParameterDict approach.

The uniqueness of the variable path should come from the parent object name, not from the variable name (e.g. "dense_1/kernel"). What paths do you currently see for seed generators?

for seed generator if using path it will be seed_generator_{idx}/seed_generator_state, using name it will be the seed_generator_state. the spirit of this change is to let torch module to handle the recursive collection so i was planning to use variable name but find out that there are collisions on seed generator state.

fchollet · 2024-06-24T05:50:01Z

the spirit of this change is to let torch module to handle the recursive collection so i was planning to use variable name but find out that there are collisions on seed generator state.

Variable names are never unique. For a unique string you can use variable.path.

haohuanw · 2024-06-24T06:18:52Z

the spirit of this change is to let torch module to handle the recursive collection so i was planning to use variable name but find out that there are collisions on seed generator state.

Variable names are never unique. For a unique string you can use variable.path.

i thought under same layer the variable name (excluding the variables from its sub layers) should be unique as an implicit requirement since otherwise variable.path would also be non unique?

my original thought is that self.torch_params will only have variables for the layer, excluding variables in the sub-layers since it will automatically get collected when calling named_parameters() since all sub layers are properly recognized as a sub torch module and it will respect recurse option. (eg. https://github.com/pytorch/pytorch/blob/662e9e10766b040bea000e18e54a4f9e69889fc1/torch/nn/modules/module.py#L2496C20-L2496C34 _named_members will include all registered sub layers.)

then i found out that all seed_generator actually can actually create with same variable name if there are multiple seed generator in one layer since seed generator is not a layer.

haohuanw · 2024-06-24T06:34:17Z

the spirit of this change is to let torch module to handle the recursive collection so i was planning to use variable name but find out that there are collisions on seed generator state.

Variable names are never unique. For a unique string you can use variable.path.

i thought under same layer the variable name (excluding the variables from its sub layers) should be unique as an implicit requirement since otherwise variable.path would also be non unique?

my original thought is that self.torch_params will only have variables for the layer, excluding variables in the sub-layers since it will automatically get collected when calling named_parameters() since all sub layers are properly recognized as a sub torch module and it will respect recurse option. (eg. https://github.com/pytorch/pytorch/blob/662e9e10766b040bea000e18e54a4f9e69889fc1/torch/nn/modules/module.py#L2496C20-L2496C34 _named_members will include all registered sub layers.)

then i found out that all seed_generator actually can actually create with same variable name if there are multiple seed generator in one layer since seed generator is not a layer.

and i do notice that i probably want to add a test with nested seed generator. in theory, seed states should be recursively collected by torch since it basically get all module._parameters for all its submodules.

fchollet

Code looks good -- thanks for the changes. I will apply docstring fixes after merging.

fchollet · 2024-06-24T23:56:31Z

the failing pytorch test is actually passing on my env:

Works for me locally as well. Might be a fluke.

fchollet · 2024-06-24T23:57:40Z

There are actually various tests that reliably fail here: https://btx.cloud.google.com/invocations/c55a2ca4-5df3-411b-bd52-7c9873e839ce/targets/keras%2Fgithub%2Fubuntu%2Fgpu%2Ftorch%2Fpresubmit/log (not the numerical integration test)

haohuanw · 2024-06-24T23:59:06Z

There are actually various tests that reliably fail here: https://btx.cloud.google.com/invocations/c55a2ca4-5df3-411b-bd52-7c9873e839ce/targets/keras%2Fgithub%2Fubuntu%2Fgpu%2Ftorch%2Fpresubmit/log (not the numerical integration test)

i will address those today / tmr 👍 - and is it possible to configure ci to run pytest regardless whether integration test passes or not?

fchollet · 2024-06-25T00:03:04Z

is it possible to configure ci to run pytest regardless whether integration test passes or not?

We'd have to move the integration testing to go after the general pytest command in .github/workflows/actions.yml (job name Run tests)

haohuanw · 2024-06-25T08:25:32Z

i am seeing a weird issue on keras/src/dtype_policies/dtype_policy_map_test.py::DTypePolicyMapTest::test_basic_usage with error like this:


error_msgs = {133475339071600: (<Dense name=subclass_dense, built=True>, ValueError("Layer 'subclass_dense' expected 4 variables, but received 3 variables during loading. Expected: ['bias', 'kernel', 'kernel', 'kernel_scale']"))}
warn_only = False

    def _raise_loading_failure(error_msgs, warn_only=False):
        first_key = list(error_msgs.keys())[0]
        ex_saveable, ex_error = error_msgs[first_key]
        msg = (
            f"A total of {len(error_msgs)} objects could not "
            "be loaded. Example error message for "
            f"object {ex_saveable}:\n\n"
            f"{ex_error}\n\n"
            "List of objects that could not be loaded:\n"
            f"{[x[0] for x in error_msgs.values()]}"
        )
        if warn_only:
            warnings.warn(msg)
        else:
>           raise ValueError(msg)
E           ValueError: A total of 1 objects could not be loaded. Example error message for object <Dense name=subclass_dense, built=True>:
E           
E           Layer 'subclass_dense' expected 4 variables, but received 3 variables during loading. Expected: ['bias', 'kernel', 'kernel', 'kernel_scale']
E           
E           List of objects that could not be loaded:
E           [<Dense name=subclass_dense, built=True>]

i am able to isolate that the model json looks good but restored model here: https://github.com/keras-team/keras/blob/master/keras/src/saving/saving_lib.py#L242 have duplicated <KerasVariable shape=(4, 8), dtype=int8, path=subclass/subclass_dense/kernel>. any idea how this change is relevant to this change?

fchollet · 2024-06-25T18:47:44Z

I don't understand the connection. You could try pruning things from your change until the test passes, then you'll have a good idea what particular lines are causing the issue.

haohuanw · 2024-06-27T01:55:43Z

@fchollet most of the unit tests are fixed with one issue left. torch basically requires user to use nn.ModuleList to wrap any modules, so for keras3 any modules that directly passing with a list won't work.

there are two options that i think could work, let me know your thoughts:

create keras.layers.LayerList that mimics what nn.ModuleList has so all modules could be properly tracked by torch modules. a check in setattr could be added to automatically wrap with LayerList when a list[Layer] is passed.
specific in torch backend, when a list[Layer] is observed, wrap it in nn.ModuleList() and then wrap it with TorchModuleWrapper(). this might work but i will need to double check the parameter tracking logic.
(doesn't seem to work) specific in torch backend, when a list[Layer] is observed, also call self.register_module() in setattr_hook to double register the layer. i tried this and it works in most of the cases except serialization since setattr_hook is not called during deserialization.

let me know what do you think.

fchollet · 2024-06-28T16:50:57Z

specific in torch backend, when a list[Layer] is observed, wrap it in nn.ModuleList() and then wrap it with TorchModuleWrapper(). this might work but i will need to double check the parameter tracking logic.

I think we could do this, via __setattr__ in TorchLayer. There should not be any downsides?

haohuanw · 2024-06-28T16:53:29Z

specific in torch backend, when a list[Layer] is observed, wrap it in nn.ModuleList() and then wrap it with TorchModuleWrapper(). this might work but i will need to double check the parameter tracking logic.

I think we could do this, via __setattr__ in TorchLayer. There should not be any downsides?

in theory - let me try it

haohuanw · 2024-06-30T18:59:18Z

specific in torch backend, when a list[Layer] is observed, wrap it in nn.ModuleList() and then wrap it with TorchModuleWrapper(). this might work but i will need to double check the parameter tracking logic.

I think we could do this, via __setattr__ in TorchLayer. There should not be any downsides?

it technically works but i think this will be a pretty impact workflow change for pytorch users:

every time referencing a layer list (for example run forward or quantize those layers), it needs to change from:
for l in self.layers to for l in self.layers.module which is not really ideal and only specific for torch.
another issue is when re-tracking the parameters since currently the idea is to have every layer only track it's own layers by doing it recursively. additional logic needs to be added to handle the special case where keras layer is wrapped into the torch wrapper.

I think supporting a keras.LayerList is actually a cleaner approach (not sure if it introduces any challenges for serialization) to better support pytorch backend without impacting much on tf/jax side. i think what we can do is to make this an opt-in feature where we warn users in TorchLayer that they have to use keras.LayerList to make sure torch params are being properly tracked but other backend users don't need to worry about using it.

tweak torch parameter registration mechanism

a44dbcd

google-ml-butler bot added the size:L label Jun 23, 2024

google-ml-butler bot assigned gbaned Jun 23, 2024

fchollet approved these changes Jun 24, 2024

View reviewed changes

google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Jun 24, 2024

kokoro-team removed the kokoro:force-run label Jun 24, 2024

fixed except 1 final test

df6b116

google-ml-butler bot removed the ready to pull Ready to be merged into the codebase label Jun 25, 2024

enforce named_paramters can only be used after tracing

de3479c

fchollet added the keras-team-review-pending Pending review by a Keras team member. label Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tweak torch parameter registration mechanism #19908

tweak torch parameter registration mechanism #19908

haohuanw commented Jun 23, 2024 •

edited

Loading

codecov-commenter commented Jun 23, 2024 •

edited

Loading

haohuanw commented Jun 23, 2024

fchollet commented Jun 24, 2024

haohuanw commented Jun 24, 2024

fchollet commented Jun 24, 2024

haohuanw commented Jun 24, 2024 •

edited

Loading

haohuanw commented Jun 24, 2024

fchollet left a comment

fchollet commented Jun 24, 2024

fchollet commented Jun 24, 2024

haohuanw commented Jun 24, 2024 •

edited

Loading

fchollet commented Jun 25, 2024

haohuanw commented Jun 25, 2024

fchollet commented Jun 25, 2024

haohuanw commented Jun 27, 2024

fchollet commented Jun 28, 2024

haohuanw commented Jun 28, 2024

haohuanw commented Jun 30, 2024

tweak torch parameter registration mechanism #19908

Are you sure you want to change the base?

tweak torch parameter registration mechanism #19908

Conversation

haohuanw commented Jun 23, 2024 • edited Loading

codecov-commenter commented Jun 23, 2024 • edited Loading

Codecov Report

haohuanw commented Jun 23, 2024

fchollet commented Jun 24, 2024

haohuanw commented Jun 24, 2024

fchollet commented Jun 24, 2024

haohuanw commented Jun 24, 2024 • edited Loading

haohuanw commented Jun 24, 2024

fchollet left a comment

Choose a reason for hiding this comment

fchollet commented Jun 24, 2024

fchollet commented Jun 24, 2024

haohuanw commented Jun 24, 2024 • edited Loading

fchollet commented Jun 25, 2024

haohuanw commented Jun 25, 2024

fchollet commented Jun 25, 2024

haohuanw commented Jun 27, 2024

fchollet commented Jun 28, 2024

haohuanw commented Jun 28, 2024

haohuanw commented Jun 30, 2024

haohuanw commented Jun 23, 2024 •

edited

Loading

codecov-commenter commented Jun 23, 2024 •

edited

Loading

haohuanw commented Jun 24, 2024 •

edited

Loading

haohuanw commented Jun 24, 2024 •

edited

Loading