Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with Rational Activations on very deep ResNets. #3

Open
23Uday opened this issue May 28, 2023 · 1 comment
Open

Training with Rational Activations on very deep ResNets. #3

23Uday opened this issue May 28, 2023 · 1 comment

Comments

@23Uday
Copy link

23Uday commented May 28, 2023

Hi,
I am using your pytorch implementation to train a Rational ResNet 164 on CIFAR 10 and while I can get the model to behave well for a ResNet with 18-38 layers, I cannot get it to train for very deep resnets without dramatically lowering the learning rate.
Here is 1 example with --lr 1e-6 --wd 1e-5
Train Epoch: 0 [0/47500 (0%)] Loss: 2.517
Train Epoch: 0 [1920/47500 (4%)] Loss: nan
While I understand that the model with rational activations is supposed to represent a rational function with degree 3layers, the training process for deeper models isn't clear.
Could you provide me some help ?

@NBoulle
Copy link
Owner

NBoulle commented May 30, 2023

Thanks for your interest in our work. We haven't tried training very deep rational networks so my intuition is limited here. There is a possibility that the weight initialization has a bad effect on the rational layers as the depth increases. One potential remedy would be to fine-tune a pretrained relu resnet by replacing the activation functions by rationals and just training the rational functions.
I'm curious to see why the loss becomes nan in your example. Perhaps you could plot the different rational functions (there should be approximatively one function per layer) to see if one of them becomes singular (with a simple pole) and which layer is affected.
Finally, and depending and the result of the above suggestion, there could be some numerical instabilities due to having an overall rational network of super large degree (3^164). I guess one could use rational functions for the first few layers (like 18-38 layers in your experiments to benefit from the extra approximation power) and then use ReLU for the rest of the networks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants