Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tweak torch parameter registration mechanism #19908

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

haohuanw
Copy link
Contributor

@haohuanw haohuanw commented Jun 23, 2024

this is a follow up from #19885 discussion where i am trying to make torch / keras well played together on tracking parameters.

the solution i ended up with:

  1. since modules are properly tracked with torch module, every torch_params will only safe it's own variables. nested variable resolution will be done by torch with recurse=True
  2. change back to use parameter list instead of dict. i did consider to keep using dict given the readability since now key in torch param could actually be variable.name with just tracking variables the current layer holds. however, current seed generator actually create duplicated variable names. if https://github.com/keras-team/keras/blob/master/keras/src/random/seed_generator.py#L80 can be changed to something like f"{self.name}_generator_state" it will work with ParameterDict approach.
  3. in _post_track/untrack_variables, refresh the entire torch params and it's sublayers. this could be changed to not re-create all sublayers if this function ever becomes too slow.

i also added few torch specific tests to reflect some of the assumptions and usecases that torch user might have. eg. use state_dict.

@codecov-commenter
Copy link

codecov-commenter commented Jun 23, 2024

Codecov Report

Attention: Patch coverage is 2.17391% with 45 lines in your changes missing coverage. Please review.

Project coverage is 73.44%. Comparing base (c8a7f28) to head (de3479c).
Report is 17 commits behind head on master.

Files Patch % Lines
keras/src/backend/torch/layer.py 0.00% 39 Missing ⚠️
keras/src/backend/torch/trainer.py 0.00% 4 Missing ⚠️
keras/src/testing/test_case.py 0.00% 1 Missing and 1 partial ⚠️

❗ There is a different number of reports uploaded between BASE (c8a7f28) and HEAD (de3479c). Click for more details.

HEAD has 1 upload less than BASE | Flag | BASE (c8a7f28) | HEAD (de3479c) | |------|------|------| |keras|4|3|
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #19908      +/-   ##
==========================================
- Coverage   79.01%   73.44%   -5.57%     
==========================================
  Files         499      499              
  Lines       46441    46476      +35     
  Branches     8550     8556       +6     
==========================================
- Hits        36694    34134    -2560     
- Misses       8020    10670    +2650     
+ Partials     1727     1672      -55     
Flag Coverage Δ
keras 73.37% <2.17%> (-5.51%) ⬇️
keras-jax 62.41% <2.17%> (+<0.01%) ⬆️
keras-numpy 57.21% <2.17%> (-0.01%) ⬇️
keras-tensorflow 63.60% <2.17%> (-0.04%) ⬇️
keras-torch ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@haohuanw
Copy link
Contributor Author

the failing pytorch test is actually passing on my env:

(keras-dev-minimum) haohuanw@haohuanw-ThinkPad-X1-Extreme:~/Documents/keras$ KERAS_BACKEND=torch python integration_tests/numerical_test.py 
2024-06-23 16:13:12.028332: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-23 16:13:12.031879: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-23 16:13:12.080855: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-23 16:13:12.900432: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-06-23 16:13:14.305362: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-06-23 16:13:14.305867: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Checking training histories:
accuracy:
[0.20999999344348907]
[0.20999999344348907]
loss:
[2.6727349758148193]
[2.6727302074432373]
mae:
[0.1606517732143402]
[0.16065318882465363]
Training histories match.

Checking trained weights:
Trained weights match.

Checking predict:
Predict results match.

Checking evaluate:
[2.2113966941833496, 0.17798443138599396, 0.3799999952316284]
[2.2114176750183105, 0.17798538506031036, 0.3799999952316284]
Evaluate results match.

@fchollet
Copy link
Member

if https://github.com/keras-team/keras/blob/master/keras/src/random/seed_generator.py#L80 can be changed to something like f"{self.name}_generator_state" it will work with ParameterDict approach.

The uniqueness of the variable path should come from the parent object name, not from the variable name (e.g. "dense_1/kernel"). What paths do you currently see for seed generators?

@haohuanw
Copy link
Contributor Author

if https://github.com/keras-team/keras/blob/master/keras/src/random/seed_generator.py#L80 can be changed to something like f"{self.name}_generator_state" it will work with ParameterDict approach.

The uniqueness of the variable path should come from the parent object name, not from the variable name (e.g. "dense_1/kernel"). What paths do you currently see for seed generators?

for seed generator if using path it will be seed_generator_{idx}/seed_generator_state, using name it will be the seed_generator_state. the spirit of this change is to let torch module to handle the recursive collection so i was planning to use variable name but find out that there are collisions on seed generator state.

@fchollet
Copy link
Member

the spirit of this change is to let torch module to handle the recursive collection so i was planning to use variable name but find out that there are collisions on seed generator state.

Variable names are never unique. For a unique string you can use variable.path.

@haohuanw
Copy link
Contributor Author

haohuanw commented Jun 24, 2024

the spirit of this change is to let torch module to handle the recursive collection so i was planning to use variable name but find out that there are collisions on seed generator state.

Variable names are never unique. For a unique string you can use variable.path.

i thought under same layer the variable name (excluding the variables from its sub layers) should be unique as an implicit requirement since otherwise variable.path would also be non unique?

my original thought is that self.torch_params will only have variables for the layer, excluding variables in the sub-layers since it will automatically get collected when calling named_parameters() since all sub layers are properly recognized as a sub torch module and it will respect recurse option. (eg. https://github.com/pytorch/pytorch/blob/662e9e10766b040bea000e18e54a4f9e69889fc1/torch/nn/modules/module.py#L2496C20-L2496C34 _named_members will include all registered sub layers.)

then i found out that all seed_generator actually can actually create with same variable name if there are multiple seed generator in one layer since seed generator is not a layer.

@haohuanw
Copy link
Contributor Author

the spirit of this change is to let torch module to handle the recursive collection so i was planning to use variable name but find out that there are collisions on seed generator state.

Variable names are never unique. For a unique string you can use variable.path.

i thought under same layer the variable name (excluding the variables from its sub layers) should be unique as an implicit requirement since otherwise variable.path would also be non unique?

my original thought is that self.torch_params will only have variables for the layer, excluding variables in the sub-layers since it will automatically get collected when calling named_parameters() since all sub layers are properly recognized as a sub torch module and it will respect recurse option. (eg. https://github.com/pytorch/pytorch/blob/662e9e10766b040bea000e18e54a4f9e69889fc1/torch/nn/modules/module.py#L2496C20-L2496C34 _named_members will include all registered sub layers.)

then i found out that all seed_generator actually can actually create with same variable name if there are multiple seed generator in one layer since seed generator is not a layer.

and i do notice that i probably want to add a test with nested seed generator. in theory, seed states should be recursively collected by torch since it basically get all module._parameters for all its submodules.

Copy link
Member

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good -- thanks for the changes. I will apply docstring fixes after merging.

@google-ml-butler google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Jun 24, 2024
@fchollet
Copy link
Member

the failing pytorch test is actually passing on my env:

Works for me locally as well. Might be a fluke.

@fchollet
Copy link
Member

There are actually various tests that reliably fail here: https://btx.cloud.google.com/invocations/c55a2ca4-5df3-411b-bd52-7c9873e839ce/targets/keras%2Fgithub%2Fubuntu%2Fgpu%2Ftorch%2Fpresubmit/log (not the numerical integration test)

@haohuanw
Copy link
Contributor Author

haohuanw commented Jun 24, 2024

There are actually various tests that reliably fail here: https://btx.cloud.google.com/invocations/c55a2ca4-5df3-411b-bd52-7c9873e839ce/targets/keras%2Fgithub%2Fubuntu%2Fgpu%2Ftorch%2Fpresubmit/log (not the numerical integration test)

i will address those today / tmr 👍 - and is it possible to configure ci to run pytest regardless whether integration test passes or not?

@fchollet
Copy link
Member

is it possible to configure ci to run pytest regardless whether integration test passes or not?

We'd have to move the integration testing to go after the general pytest command in .github/workflows/actions.yml (job name Run tests)

@google-ml-butler google-ml-butler bot removed the ready to pull Ready to be merged into the codebase label Jun 25, 2024
@haohuanw
Copy link
Contributor Author

i am seeing a weird issue on keras/src/dtype_policies/dtype_policy_map_test.py::DTypePolicyMapTest::test_basic_usage with error like this:


error_msgs = {133475339071600: (<Dense name=subclass_dense, built=True>, ValueError("Layer 'subclass_dense' expected 4 variables, but received 3 variables during loading. Expected: ['bias', 'kernel', 'kernel', 'kernel_scale']"))}
warn_only = False

    def _raise_loading_failure(error_msgs, warn_only=False):
        first_key = list(error_msgs.keys())[0]
        ex_saveable, ex_error = error_msgs[first_key]
        msg = (
            f"A total of {len(error_msgs)} objects could not "
            "be loaded. Example error message for "
            f"object {ex_saveable}:\n\n"
            f"{ex_error}\n\n"
            "List of objects that could not be loaded:\n"
            f"{[x[0] for x in error_msgs.values()]}"
        )
        if warn_only:
            warnings.warn(msg)
        else:
>           raise ValueError(msg)
E           ValueError: A total of 1 objects could not be loaded. Example error message for object <Dense name=subclass_dense, built=True>:
E           
E           Layer 'subclass_dense' expected 4 variables, but received 3 variables during loading. Expected: ['bias', 'kernel', 'kernel', 'kernel_scale']
E           
E           List of objects that could not be loaded:
E           [<Dense name=subclass_dense, built=True>]

i am able to isolate that the model json looks good but restored model here: https://github.com/keras-team/keras/blob/master/keras/src/saving/saving_lib.py#L242 have duplicated <KerasVariable shape=(4, 8), dtype=int8, path=subclass/subclass_dense/kernel>. any idea how this change is relevant to this change?

@fchollet
Copy link
Member

I don't understand the connection. You could try pruning things from your change until the test passes, then you'll have a good idea what particular lines are causing the issue.

@haohuanw
Copy link
Contributor Author

@fchollet most of the unit tests are fixed with one issue left. torch basically requires user to use nn.ModuleList to wrap any modules, so for keras3 any modules that directly passing with a list won't work.

there are two options that i think could work, let me know your thoughts:

  1. create keras.layers.LayerList that mimics what nn.ModuleList has so all modules could be properly tracked by torch modules. a check in setattr could be added to automatically wrap with LayerList when a list[Layer] is passed.
  2. specific in torch backend, when a list[Layer] is observed, wrap it in nn.ModuleList() and then wrap it with TorchModuleWrapper(). this might work but i will need to double check the parameter tracking logic.
  3. (doesn't seem to work) specific in torch backend, when a list[Layer] is observed, also call self.register_module() in setattr_hook to double register the layer. i tried this and it works in most of the cases except serialization since setattr_hook is not called during deserialization.

let me know what do you think.

@fchollet fchollet added the keras-team-review-pending Pending review by a Keras team member. label Jun 27, 2024
@fchollet
Copy link
Member

specific in torch backend, when a list[Layer] is observed, wrap it in nn.ModuleList() and then wrap it with TorchModuleWrapper(). this might work but i will need to double check the parameter tracking logic.

I think we could do this, via __setattr__ in TorchLayer. There should not be any downsides?

@haohuanw
Copy link
Contributor Author

specific in torch backend, when a list[Layer] is observed, wrap it in nn.ModuleList() and then wrap it with TorchModuleWrapper(). this might work but i will need to double check the parameter tracking logic.

I think we could do this, via __setattr__ in TorchLayer. There should not be any downsides?

in theory - let me try it

@haohuanw
Copy link
Contributor Author

specific in torch backend, when a list[Layer] is observed, wrap it in nn.ModuleList() and then wrap it with TorchModuleWrapper(). this might work but i will need to double check the parameter tracking logic.

I think we could do this, via __setattr__ in TorchLayer. There should not be any downsides?

it technically works but i think this will be a pretty impact workflow change for pytorch users:

  1. every time referencing a layer list (for example run forward or quantize those layers), it needs to change from:
    for l in self.layers to for l in self.layers.module which is not really ideal and only specific for torch.

  2. another issue is when re-tracking the parameters since currently the idea is to have every layer only track it's own layers by doing it recursively. additional logic needs to be added to handle the special case where keras layer is wrapped into the torch wrapper.

I think supporting a keras.LayerList is actually a cleaner approach (not sure if it introduces any challenges for serialization) to better support pytorch backend without impacting much on tf/jax side. i think what we can do is to make this an opt-in feature where we warn users in TorchLayer that they have to use keras.LayerList to make sure torch params are being properly tracked but other backend users don't need to worry about using it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keras-team-review-pending Pending review by a Keras team member. size:L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants