Reduce peak memory for prompt_logprobs requests #907

50h100a · 2024-12-16T20:12:43Z

First order of business, make prompt_logprobs "compatible" with prefix caching.
It can't take advantage of the caching, but at least it will run.

Second order of business, reduce the peak memory usage of the samplers.
This PR slightly reduces the memory load, but not nearly enough:
On single-GPU, sampling can still take dozens of gigabytes at peak memory. (8b model at 16k was >10gb)
On multi-GPU, sampling is no cheaper, and there's also a colossal memory spike when gathering the logits.

Thoughts:

In this PR, some operations are split into smaller batches. Can we split the entire sampling process the same way? Leaving it mostly unchanged, but only handling a fixed k of rows at a time?
No idea what the fix is for the gather spikes, deferring to @AlpinDale on that. That might not even be the specific issue, just where it ran out of VRAM for me, but there's something about multi-GPU that's aggravating the memory peaks.

AlpinDale · 2024-12-19T17:35:37Z

Will probably need some restructuring after #925

50h100a added 2 commits December 16, 2024 19:47

do not use cached chunks for prompt_logprobs

bc1a2bd

reduce sampler peak memory usage

d49ead7

AlpinDale self-requested a review December 19, 2024 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce peak memory for prompt_logprobs requests #907

Reduce peak memory for prompt_logprobs requests #907

50h100a commented Dec 16, 2024

AlpinDale commented Dec 19, 2024

Reduce peak memory for prompt_logprobs requests #907

Are you sure you want to change the base?

Reduce peak memory for prompt_logprobs requests #907

Conversation

50h100a commented Dec 16, 2024

AlpinDale commented Dec 19, 2024