Add low_cpu_mem_usage flag in inference test #221

lokoppakmsft · 2022-11-15T23:07:59Z

No description provided.

lokoppakmsft · 2022-11-15T23:11:19Z

poedator · 2023-08-31T15:42:20Z

Can confirm: when the model is loaded in safetensors format, this can reduce the memory usage by a factor of 5+.

When experimenting with llama2-70b, we found that the memory usage before this fix was over 260GB per process before it OOMed. After the fix, it took <250GB in total. This is likely because safetensors can memmap the weight files into the same os-wide cache such that different ranks point to the same memory.

To reproduce:

deepspeed --num_gpus 4 inference-test.py --model meta-llama/Llama-2-70b-hf  --batch_size 2 --dtype float16 --max_new_tokens 32 --test_performance

lokoppakmsft added 2 commits November 15, 2022 23:06

Add low_cpu_mem_usage flag in inference test

50814b2

fix hardcoded False

e1aa27c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add low_cpu_mem_usage flag in inference test #221

Add low_cpu_mem_usage flag in inference test #221

Uh oh!

lokoppakmsft commented Nov 15, 2022

Uh oh!

lokoppakmsft commented Nov 15, 2022

Uh oh!

poedator commented Aug 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add low_cpu_mem_usage flag in inference test #221

Are you sure you want to change the base?

Add low_cpu_mem_usage flag in inference test #221

Uh oh!

Conversation

lokoppakmsft commented Nov 15, 2022

Uh oh!

lokoppakmsft commented Nov 15, 2022

Uh oh!

poedator commented Aug 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants