-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LoRA] Implement hot-swapping of LoRA #9453
base: main
Are you sure you want to change the base?
[LoRA] Implement hot-swapping of LoRA #9453
Conversation
This PR adds the possibility to hot-swap LoRA adapters. It is WIP. Description As of now, users can already load multiple LoRA adapters. They can offload existing adapters or they can unload them (i.e. delete them). However, they cannot "hotswap" adapters yet, i.e. substitute the weights from one LoRA adapter with the weights of another, without the need to create a separate LoRA adapter. Generally, hot-swapping may not appear not super useful but when the model is compiled, it is necessary to prevent recompilation. See huggingface#9279 for more context. Caveats To hot-swap a LoRA adapter for another, these two adapters should target exactly the same layers and the "hyper-parameters" of the two adapters should be identical. For instance, the LoRA alpha has to be the same: Given that we keep the alpha from the first adapter, the LoRA scaling would be incorrect for the second adapter otherwise. Theoretically, we could override the scaling dict with the alpha values derived from the second adapter's config, but changing the dict will trigger a guard for recompilation, defeating the main purpose of the feature. I also found that compilation flags can have an impact on whether this works or not. E.g. when passing "reduce-overhead", there will be errors of the type: > input name: arg861_1. data pointer changed from 139647332027392 to 139647331054592 I don't know enough about compilation to determine whether this is problematic or not. Current state This is obviously WIP right now to collect feedback and discuss which direction to take this. If this PR turns out to be useful, the hot-swapping functions will be added to PEFT itself and can be imported here (or there is a separate copy in diffusers to avoid the need for a min PEFT version to use this feature). Moreover, more tests need to be added to better cover this feature, although we don't necessarily need tests for the hot-swapping functionality itself, since those tests will be added to PEFT. Furthermore, as of now, this is only implemented for the unet. Other pipeline components have yet to implement this feature. Finally, it should be properly documented. I would like to collect feedback on the current state of the PR before putting more time into finalizing it.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this. I left some comments.
cc @apolinario |
does most lora have same scaling? |
So I played around a little bit, I have two main question: Do we support hotswap with different lora ranks? the rank config is not checked in the I think we should also look into supporting hot-swap with different scaling, I checked some popular loras on our hub, I think most of them have different ranks/alphas so this feature will be a lot more impactful if we are able to support different rank & scaling - based on this thread #9279, I understand that the change in the "scaling" dict would trigger a recompilation. But maybe there are ways to avoid it? for example, if this trigger recompile import torch
scaling = {}
def fn(x, key):
return x * scaling[key]
opt_fn = torch.compile(fn, backend="eager")
x = torch.rand(4)
scaling["first"] = 1.0
opt_fn(x, "first")
print(f" finish first run, updating scaling")
scaling["first"] = 2.0
opt_fn(x, "first") this won't import torch
scaling = {}
def fn(x, key):
return x * scaling[key]
opt_fn = torch.compile(fn, backend="eager")
x = torch.rand(4)
scaling["first"] = torch.tensor(1.0)
opt_fn(x, "first")
print(f" finish first run, updating scaling")
scaling["first"] = torch.tensor(2.0)
opt_fn(x, "first") I'm very excited about having this in diffusers ! think would be a super nice feature, especially for production use case :) |
I agree with your point on supporting LoRAs with different scaling in this context. With backend="eager", we may not get the full benefits of A good way to verify it would be to measure the performance of a pipeline with eager Cc: @anijain2305.
I will let @BenjaminBossan comment further but this might require a lot of changes within the tuner modules inside |
Thanks for all the feedback. I haven't forgotten about this PR, I was just occupied with other things. I'll come back to this as soon as I have a bit of time on my hands. The idea of using a tensor instead of float for scaling is intriguing, thanks for testing it. It might just work OOTB, as torch broadcasts 0-dim tensors automatically. Another possibility would be to multiply the scaling directly into one of the weights, so that the original alpha can be retained, but that is probably very error prone. Regarding different ranks, I have yet to test that. |
Yes, |
If different ranks become a problem, then https://huggingface.co/sayakpaul/lower-rank-flux-lora could provide a meaningful direction. |
Indeed, although avoiding recompilation altogether with different ranks would be even greater for real time swap applications |
yep can be a nice feature indeed! |
Indeed. For different ranks, things that come to mind:
|
A reverse direction of what I showed in #9453 is also possible (increase the rank of a LoRA): |
hi @BenjaminBossan and they work for the 4 loras I tested (all with different ranks and scaling) - I'm not as familiar with peft and just made enough changes for the purpose of the experiment & provide a reference point, so the code is very hacky there. sorry for that! to test , # testing hotswap PR
# TORCH_LOGS="guards,recompiles" TORCH_COMPILE_DEBUG=1 TORCH_LOGS_OUT=traces.txt python yiyi_test_3.py
from diffusers import DiffusionPipeline
import torch
import time
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
branch = "test-hotswap"
loras = [
"Norod78/sd15-megaphone-lora", # rank 16, scaling 0.5
"artificialguybr/coloringbook-redmond-1-5v-coloring-book-lora-for-liberteredmond-sd-1-5", # rank 64, scaling 1.0
"Norod78/SD15-Rubber-Duck-LoRA", # rank 16, scaling 0.5
"wooyvern/sd-1.5-dark-fantasy-1.1", # rank 128, scaling 1.0
]
prompts =[
"Marge Simpson holding a megaphone in her hand with her town in the background",
"A lion, minimalist, Coloring Book, ColoringBookAF",
"The girl with a pearl earring Rubber duck",
"<lora:fantasyV1.1:1>, a painting of a skeleton with a long cloak and a group of skeletons in a forest with a crescent moon in the background, David Wojnarowicz, dark art, a screenprint, psychedelic art",
]
def print_rank_scaling(pipe):
print(f" rank: {pipe.unet.peft_config['default_0'].r}")
print(f" scaling: {pipe.unet.down_blocks[0].attentions[0].proj_in.scaling}")
# pipe_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipe_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"
pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
pipe.unet = pipe.unet.to(memory_format=torch.channels_last)
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
for i, (lora_repo, prompt) in enumerate(zip(loras, prompts)):
hotswap = False if i == 0 else True
print(f"\nProcessing LoRA {i}: {lora_repo}")
print(f" prompt: {prompt}")
print(f" hotswap: {hotswap}")
# Start timing for the entire iteration
start_time = time.time()
# Load LoRA weights
pipe.load_lora_weights(lora_repo, hotswap=hotswap, adapter_name = "default_0")
print_rank_scaling(pipe)
# Time image generation
generator = torch.Generator(device="cuda").manual_seed(42)
generate_start_time = time.time()
image = pipe(prompt, num_inference_steps=50, generator=generator).images[0]
generate_time = time.time() - generate_start_time
# Save the image
image.save(f"yiyi_test_3_out_{branch}_lora{i}.png")
# Unload LoRA weights
pipe.unload_lora_weights()
# Calculate and print total time for this iteration
total_time = time.time() - start_time
print(f"Image generation time: {generate_time:.2f} seconds")
print(f"Total time for LoRA {i}: {total_time:.2f} seconds")
mem_bytes = torch.cuda.max_memory_allocated()
print(f"total Memory: {mem_bytes/(1024*1024):.3f} MB") output
confirm outputs are same as in main |
Very cool! Could you also try logging the traces just to confirm it does not trigger any recompilation? TORCH_LOGS="guards,recompiles" TORCH_LOGS_OUT=traces.txt python my_code.py |
I did and it doesn't |
also, I think, from the user experience perspective, it might be more convenient to have a "hotswap" mode that, once it's on, everything will be hot-swapped by default. I think, it is not something you use on and off, no? maybe be a question for @apolinario |
I think that is the case, yes! I also agree that the ability to hot-swap LoRAs (with But just in case it becomes a memory problem, users can explore the LoRA resizing path to have everything to a small unified rank (if it doesn't lead too much quality degradation). |
- SDXL - SD3 - Flux No changes were made to load_lora_into_transformer.
That's great to hear. I'm just not that familiar with what is highly demanded, so I asked. @sayakpaul I updated the PR with |
Oh, Rest of the changes are fine. Thank you! |
For SD3 and Flux. Use shorter docstring for brevity.
@sayakpaul I updated There was also a merge conflict caused by #10187. I did my best to resolve it, but please double check. I think as a consequence of that PR, there is now an error about requiring Edit: Some hotswap tests are failing locally, I'll investigate. |
Full docstring is preferred. |
@slow | ||
@require_torch_2 | ||
@require_torch_accelerator | ||
@require_peft_backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also impose a minimum peft
version?
Rest of the changes are good. |
@sayakpaul I extended the docstrings as discussed and added PEFT version guards to the tests. Some tests are still failing locally for me, the cause is most likely this. For that reason, I also haven't applied |
Replied. |
Thanks, now the tests are passing again!
Should I run |
I think this should be fine. Ccing @stevhliu to decide where this |
Nice! Let's add the hotswap feature in this LoRA section |
@yiyixuxu could you review this once more? The failing tests are unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nice, thank you @BenjaminBossan!
Co-authored-by: Steven Liu <[email protected]>
* [WIP][LoRA] Implement hot-swapping of LoRA This PR adds the possibility to hot-swap LoRA adapters. It is WIP. Description As of now, users can already load multiple LoRA adapters. They can offload existing adapters or they can unload them (i.e. delete them). However, they cannot "hotswap" adapters yet, i.e. substitute the weights from one LoRA adapter with the weights of another, without the need to create a separate LoRA adapter. Generally, hot-swapping may not appear not super useful but when the model is compiled, it is necessary to prevent recompilation. See huggingface#9279 for more context. Caveats To hot-swap a LoRA adapter for another, these two adapters should target exactly the same layers and the "hyper-parameters" of the two adapters should be identical. For instance, the LoRA alpha has to be the same: Given that we keep the alpha from the first adapter, the LoRA scaling would be incorrect for the second adapter otherwise. Theoretically, we could override the scaling dict with the alpha values derived from the second adapter's config, but changing the dict will trigger a guard for recompilation, defeating the main purpose of the feature. I also found that compilation flags can have an impact on whether this works or not. E.g. when passing "reduce-overhead", there will be errors of the type: > input name: arg861_1. data pointer changed from 139647332027392 to 139647331054592 I don't know enough about compilation to determine whether this is problematic or not. Current state This is obviously WIP right now to collect feedback and discuss which direction to take this. If this PR turns out to be useful, the hot-swapping functions will be added to PEFT itself and can be imported here (or there is a separate copy in diffusers to avoid the need for a min PEFT version to use this feature). Moreover, more tests need to be added to better cover this feature, although we don't necessarily need tests for the hot-swapping functionality itself, since those tests will be added to PEFT. Furthermore, as of now, this is only implemented for the unet. Other pipeline components have yet to implement this feature. Finally, it should be properly documented. I would like to collect feedback on the current state of the PR before putting more time into finalizing it. * Reviewer feedback * Reviewer feedback, adjust test * Fix, doc * Make fix * Fix for possible g++ error * Add test for recompilation w/o hotswapping * Make hotswap work Requires huggingface/peft#2366 More changes to make hotswapping work. Together with the mentioned PEFT PR, the tests pass for me locally. List of changes: - docstring for hotswap - remove code copied from PEFT, import from PEFT now - adjustments to PeftAdapterMixin.load_lora_adapter (unfortunately, some state dict renaming was necessary, LMK if there is a better solution) - adjustments to UNet2DConditionLoadersMixin._process_lora: LMK if this is even necessary or not, I'm unsure what the overall relationship is between this and PeftAdapterMixin.load_lora_adapter - also in UNet2DConditionLoadersMixin._process_lora, I saw that there is no LoRA unloading when loading the adapter fails, so I added it there (in line with what happens in PeftAdapterMixin.load_lora_adapter) - rewritten tests to avoid shelling out, make the test more precise by making sure that the outputs align, parametrize it - also checked the pipeline code mentioned in this comment: huggingface#9453 (comment); when running this inside the with torch._dynamo.config.patch(error_on_recompile=True) context, there is no error, so I think hotswapping is now working with pipelines. * Address reviewer feedback: - Revert deprecated method - Fix PEFT doc link to main - Don't use private function - Clarify magic numbers - Add pipeline test Moreover: - Extend docstrings - Extend existing test for outputs != 0 - Extend existing test for wrong adapter name * Change order of test decorators parameterized.expand seems to ignore skip decorators if added in last place (i.e. innermost decorator). * Split model and pipeline tests Also increase test coverage by also targeting conv2d layers (support of which was added recently on the PEFT PR). * Reviewer feedback: Move decorator to test classes ... instead of having them on each test method. * Apply suggestions from code review Co-authored-by: hlky <[email protected]> * Reviewer feedback: version check, TODO comment * Add enable_lora_hotswap method * Reviewer feedback: check _lora_loadable_modules * Revert changes in unet.py * Add possibility to ignore enabled at wrong time * Fix docstrings * Log possible PEFT error, test * Raise helpful error if hotswap not supported I.e. for the text encoder * Formatting * More linter * More ruff * Doc-builder complaint * Update docstring: - mention no text encoder support yet - make it clear that LoRA is meant - mention that same adapter name should be passed * Fix error in docstring * Update more methods with hotswap argument - SDXL - SD3 - Flux No changes were made to load_lora_into_transformer. * Add hotswap argument to load_lora_into_transformer For SD3 and Flux. Use shorter docstring for brevity. * Extend docstrings * Add version guards to tests * Formatting * Fix LoRA loading call to add prefix=None See: huggingface#10187 (comment) * Run make fix-copies * Add hot swap documentation to the docs * Apply suggestions from code review Co-authored-by: Steven Liu <[email protected]> --------- Co-authored-by: Benjamin Bossan <[email protected]> Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: Benjamin Bossan <[email protected]> Co-authored-by: hlky <[email protected]> Co-authored-by: YiYi Xu <[email protected]> Co-authored-by: Steven Liu <[email protected]>
Anything else from my side that needs doing? :) Before merging, I'd suggest running the full test suite to ensure nothing was broken (I'm not sure how much is covered by the standard CI being run on PRs). |
This PR adds the possibility to hot-swap LoRA adapters. It is WIP.
Description
As of now, users can already load multiple LoRA adapters. They can offload existing adapters or they can unload them (i.e. delete them). However, they cannot "hotswap" adapters yet, i.e. substitute the weights from one LoRA adapter with the weights of another, without the need to create a separate LoRA adapter.
Generally, hot-swapping may not appear not super useful but when the model is compiled, it is necessary to prevent recompilation. See #9279 for more context.
Caveats
To hot-swap a LoRA adapter for another, these two adapters should target exactly the same layers and the "hyper-parameters" of the two adapters should be identical. For instance, the LoRA alpha has to be the same: Given that we keep the alpha from the first adapter, the LoRA scaling would be incorrect for the second adapter otherwise.
Theoretically, we could override the scaling dict with the alpha values derived from the second adapter's config, but changing the dict will trigger a guard for recompilation, defeating the main purpose of the feature.
I also found that compilation flags can have an impact on whether this works or not. E.g. when passing "reduce-overhead", there will be errors of the type:
I don't know enough about compilation to determine whether this is problematic or not.
Current state
This is obviously WIP right now to collect feedback and discuss which direction to take this. If this PR turns out to be useful, the hot-swapping functions will be added to PEFT itself and can be imported here (or there is a separate copy in diffusers to avoid the need for a min PEFT version to use this feature).
Moreover, more tests need to be added to better cover this feature, although we don't necessarily need tests for the hot-swapping functionality itself, since those tests will be added to PEFT.
Furthermore, as of now, this is only implemented for the unet. Other pipeline components have yet to implement this feature.
Finally, it should be properly documented.
I would like to collect feedback on the current state of the PR before putting more time into finalizing it.
What does this PR do?
Fixes # (issue)
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.