Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maxvit #65

Open
wants to merge 69 commits into
base: main
Choose a base branch
from
Open

Maxvit #65

wants to merge 69 commits into from

Conversation

isamu-isozaki
Copy link
Collaborator

Draft pr where I'm adapting maxvit from lucidrian's code. The corresponding issue is here

@isamu-isozaki isamu-isozaki marked this pull request as draft May 1, 2023 01:31
@isamu-isozaki isamu-isozaki marked this pull request as ready for review May 2, 2023 02:15
@isamu-isozaki
Copy link
Collaborator Author

Needs testing but main code part should be done

@isamu-isozaki isamu-isozaki changed the title Maxvit WIP: Maxvit May 5, 2023
@isamu-isozaki
Copy link
Collaborator Author

On random input, it works. After testing vqgan on webdataset on sbatch script, I'll test this too

@isamu-isozaki
Copy link
Collaborator Author

I added a custom TransformerLayer for MaxVit. Let me know if anyone has ideas on formulating this differently! My next step is mainly testing this out and comparing the Vram usage. Then, I'll open for review

@isamu-isozaki
Copy link
Collaborator Author

Checked the paper and I noticed I was missing the latter half
missing steps

@isamu-isozaki
Copy link
Collaborator Author

Fixed!

@isamu-isozaki
Copy link
Collaborator Author

I can run the code without any shape errors as of now but now I'm noticing that the maxvit layers do oom while the counterpart doesn't. I think I'm initializing some parameters to be too large which I plan to check tomorrow.

@isamu-isozaki
Copy link
Collaborator Author

Ok! I found that the main issue with the memory was the feed-forward networks in each transformer layer. They have the most parameters in the transformer layers and in max vit, we needed 3 instead of just 1. So that makes the memory usage per layer roughly 3 times. I fixed it so the size of the model is only 2 times now. The checklist now is

  1. Do some batch-size tests
  2. Resolve coflicts
  3. Open for review

@isamu-isozaki
Copy link
Collaborator Author

isamu-isozaki commented Aug 12, 2023

Without max vit:
memory allocated before training=3.43GB
max memory allocated after one forward step=11.61GB
max memory allocated after optimizer step=12.88GB

With maxvit:
max memory allocated before training=6.26GB
max memory allocated after one forward step=15.32GB
max memory allocated after optimizer step=27.91GB

I think the main way google resolved this higher vram usage is to use optimizers like Lion and adafactor vs adamw since AdamW copies the weights of the model

With lion:
with maxvit=17.63GB
without maxvit=11.61GB

so proportional to the input weights it's better with maxvit.

@isamu-isozaki isamu-isozaki changed the title WIP: Maxvit Maxvit Aug 13, 2023
@isamu-isozaki
Copy link
Collaborator Author

@williamberman @patil-suraj @sayakpaul @pcuenca I think I'm pretty much done. Let me know if there are any experiments/code changes that are recommended!

The TLDR for this pr is this is the attention format that google used for the second stage of muse to reduce vram usage with the higher sequence length from using a f8 vqgan vs a f16 vqgan. This pr is heavily inspired from lucidrian's maxvit implementation here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant