Model Parallel assertion error #17

NickLucche · 2022-10-02T09:39:41Z

I put up a g4dn.12xlarge instance with 4 T4's, tried a command but ended up with AssertionError :/

[ec2-user@ip ~]$ docker run --name stable-diffusion --gpus all -it -e DEVICES=0,1,2,3 -e MODEL_PARALLEL=1 -e TOKEN=token -p 7860:7860 nicklucche/stable-diffusion:multi-gpu Loading model.. Looking for a valid assignment in which to split model parts to device(s): [0, 1, 2, 3] Free GPU memory (per device): [8665, 8665, 8665, 8665] Search has found that 17 model(s) can be split over 4 device(s)! Assignments: [{0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}] Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.34k/1.34k [00:00<00:00, 739kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.5k/12.5k [00:00<00:00, 12.9MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 342/342 [00:00<00:00, 182kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 543/543 [00:00<00:00, 307kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.63k/4.63k [00:00<00:00, 2.48MB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 608M/608M [00:07<00:00, 77.8MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 117kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 122kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 572/572 [00:00<00:00, 317kB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 246M/246M [00:03<00:00, 72.5MB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 58.8MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 472/472 [00:00<00:00, 563kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 788/788 [00:00<00:00, 1.07MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 62.3MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 772/772 [00:00<00:00, 1.07MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.72G/1.72G [00:22<00:00, 75.2MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 71.2k/71.2k [00:00<00:00, 37.7MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 550/550 [00:00<00:00, 300kB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 167M/167M [00:02<00:00, 74.1MB/s] Traceback (most recent call last): File "server.py", line 9, in <module> from main import inference, MP as model_parallel File "/app/main.py", line 55, in <module> n_procs, devices, model_parallel_assignment=model_ass, **kwargs File "/app/parallel.py", line 149, in from_pretrained assert d AssertionError

It's loading a lot of models, 17 in fact. Might that be the culprit?

Anyways, if I can participate in testing or help in any way, I'm here to do so :)
Also wondering why it says only 8665MB of free memory when nvidia-smi told me I had 15360MiB per GPU free just before that.

Originally posted by @huotarih in #8 (comment)

The text was updated successfully, but these errors were encountered:

NickLucche · 2022-10-02T09:51:18Z

Ok I've taken a first look at this bug, it seems indeed memory related (it shouldn't try to allocate 17 models given the memory capacity).
I'll take some more time to dig into that and probably refactor this model parallel feature, I think it has gotten far from the intended usage to support small GPUs and I've probably tried to optimize some behavior that shouldn't be present.

Meanwhile, I suggest you use the standard "Data Parallel" mode that's more stable and much more suitable for a setup in which you have GPUs with same memory size! Feel free to report issues on that if you find any.

To do so, simply run the same command omitting the -e MODEL_PARALLEL=1 option.

docker run --name stable-diffusion --gpus all -it -e DEVICES=0,1,2,3 -e TOKEN=token -p 7860:7860 nicklucche/stable-diffusion:latest
This should spawn 4 workers that evenly share the workload and easily provide a speedup of 4, up to memory limits (e.g. if you generate 4 images it will take you the same time as generating one).

Thanks once again for submitting bug reports!

mchaker · 2022-10-02T16:37:26Z

I'm still interested in the smaller or heterogeneous GPU feature! I think it would be great to be able to use my 3070s too (and good for hobbyists) 😊

NickLucche · 2022-10-19T09:14:23Z

Sorry for the long delay, I fixed most of things here more than a week ago, but I'm still dealing with a nasty bug that I haven't quite figured out yet (tensor sizes don't match after wrapping the different components).
I hope I can get back to you with good news soon!

mchaker · 2022-10-19T09:54:51Z

Sending good energy and support!! :)

NickLucche added the bug Something isn't working label Oct 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Parallel assertion error #17

Model Parallel assertion error #17

NickLucche commented Oct 2, 2022 •

edited

Loading

NickLucche commented Oct 2, 2022

mchaker commented Oct 2, 2022

NickLucche commented Oct 19, 2022

mchaker commented Oct 19, 2022

Model Parallel assertion error #17

Model Parallel assertion error #17

Comments

NickLucche commented Oct 2, 2022 • edited Loading

NickLucche commented Oct 2, 2022

mchaker commented Oct 2, 2022

NickLucche commented Oct 19, 2022

mchaker commented Oct 19, 2022

NickLucche commented Oct 2, 2022 •

edited

Loading