-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow algorithm, is that normal? #6
Comments
We did experience longer training times with SPE, but not 6×, more like 2× in the case of Note that we shared the positional codes across layers in most of our experiments. This means storing the result of Also, due to sample-wise sharing (i.e. among samples within a batch), SPE benefits from large batch sizes. If your batch size is small, you may indeed get a big performance hit. |
We did not do this in our experiments, we simply used the 1D indexing that the APE baseline uses. It should be possible to achieve 2D indexing with |
Dear @lucastononrodrigues, sorry for the delay. I would add that you can actually implement sineSPE with 2D signals quite straightforwardly. The trick would be to replace the \sum_{k=1}^K \lambda_{kd}^2 cos(2\pi f_{kd} (m-n) + \theta_{kd} ) In equation (18) of the paper by a 2D - vector \sum_{k=1}^K \lambda_{kd}^2 cos(2\pi \textbf{f}_{kd}^\top (\textbf{m}-\textbf{n}) + \theta_{kd} ) and this can be implemented pretty straightforwardly by simply replacing the I don't have the time right now to do it but would be glad to get a pull request for it, or we may also collaborate on this further, for instance on a fork of yours that we would merge later ? I would also be interested of course in identifying what is exactly slowing down your experiments, so that we may work on this. best antoine |
Hello,
I implemented the algorithm in the vision transformer architecture the following way:
The model I am using has 4 layers 6 heads and embedding dimension 384, patch_size=4.
Training 100 epochs with CIFAR100 converges to 42.3% and without SPE 45.3%. Although this can be expected, with SPE the training time is around 6x longer, is that normal?
Performers + ViT takes 39 minutes
Perfomers + ViT + SPE takes around 4 hours
For both I am using 2 Titan XP GPUs.
This is very problematic to me because I was considering scaling up those experiments with imagenet.
I would also like to know how can I implement the indexing T=N^2 for images (where did you do it in the lra benchmark?), according to section 2 of the paper.
Many thanks!
The text was updated successfully, but these errors were encountered: