Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: invalid device ordinal #3

Closed
elter-tef opened this issue Nov 15, 2022 · 14 comments · May be fixed by #16
Closed

RuntimeError: CUDA error: invalid device ordinal #3

elter-tef opened this issue Nov 15, 2022 · 14 comments · May be fixed by #16

Comments

@elter-tef
Copy link

When I load model I have this error.

Traceback (most recent call last):
File "", line 1, in
File "test/env/lib/python3.9/site-packages/galai/init.py", line 39, in load_model
model._load_checkpoint(checkpoint_path=get_checkpoint_path(name))
File "test/env/lib/python3.9/site-packages/galai/model.py", line 63, in _load_checkpoint
load_checkpoint_and_dispatch(
File "test/env/lib/python3.9/site-packages/accelerate/big_modeling.py", line 366, in load_checkpoint_and_dispatch
load_checkpoint_in_model(
File "test/env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 701, in load_checkpoint_in_model
set_module_tensor_to_device(model, param_name, param_device, value=param)
File "test/env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 124, in set_module_tensor_to_device
new_value = value.to(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

@dginev
Copy link

dginev commented Nov 15, 2022

Trying this with

model = galai.load_model("base")

it looks like there is a device map that expects 8 GPUs, if I'm seeing this right:

{'decoder.embed_tokens': 0,
 'decoder.embed_positions': 0,
 'decoder.layer_norm': 0,
 'decoder.layers.0': 0,
 'decoder.layers.1': 0,
 'decoder.layers.2': 0,
 'decoder.layers.3': 1,
 'decoder.layers.4': 1,
 'decoder.layers.5': 1,
 'decoder.layers.6': 2,
 'decoder.layers.7': 2,
 'decoder.layers.8': 2,
 'decoder.layers.9': 3,
 'decoder.layers.10': 3,
 'decoder.layers.11': 3,
 'decoder.layers.12': 4,
 'decoder.layers.13': 4,
 'decoder.layers.14': 4,
 'decoder.layers.15': 5,
 'decoder.layers.16': 5,
 'decoder.layers.17': 5,
 'decoder.layers.18': 6,
 'decoder.layers.19': 6,
 'decoder.layers.20': 6,
 'decoder.layers.21': 7,
 'decoder.layers.22': 7,
 'decoder.layers.23': 7}

@ZQ-Dev8
Copy link

ZQ-Dev8 commented Nov 15, 2022

If you have less than the default number of GPUs (8), you have to specify how many when you load the model. Try:
model = gal.load_model(name = 'base', num_gpus = 1)

@dginev
Copy link

dginev commented Nov 15, 2022

Thanks @dcruiz01 that worked out like a charm.
Unsure if it deserves a mention in the README, but much appreciated for letting us know! We can probably close this issue.

@metaphorz
Copy link

Confirmed. Had same error and num_gpus = 1 resolved it.

@KnutJaegersberg
Copy link

Please mention that in your documentation / readme.

@KnutJaegersberg
Copy link

A model size between base and standard would be nice. I barely can't fit standard on my RTX 3090, I think.

@KnutJaegersberg
Copy link

Do you offer 8 bit versions/compatibility, like BLOOM?

@KnutJaegersberg
Copy link

I see, dtype='float16' does the job sorry. Please mention in readme. Many folks will want to try on a local gpu as well.

@KnutJaegersberg
Copy link

Hmm.. 8 bit would still be handy to play with larger models. Is that possible?

@zzj0402
Copy link

zzj0402 commented Nov 18, 2022

Num of GPUs defaults to None.

@Bachstelze
Copy link

Bachstelze commented Nov 18, 2022

If you have less than the default number of GPUs (8)

Who has a default number of 8 GPUs?

@ZQ-Dev8
Copy link

ZQ-Dev8 commented Nov 18, 2022

If you have less than the default number of GPUs (8)

Who has a default number of 8 GPUs?

people that work at Meta AI, probably XD

@FurkanGozukara
Copy link

If you have less than the default number of GPUs (8), you have to specify how many when you load the model. Try: model = gal.load_model(name = 'base', num_gpus = 1)

why this isnt written on main page

@mkardas
Copy link
Collaborator

mkardas commented Dec 9, 2022

galai 1.1.0 uses all available GPUs by default which should fix the issue. One can still manually specify the number of GPUs using num_gpus parameter. Setting num_gpus=0 (or keeping the default None if no GPUs are available) will load the model to RAM. 8 bit inference is not supported yet. Please reopen if you still experience any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants