Torch Memory Saver

A PyTorch library that allows tensor memory to be temporarily released and resumed later.

Please refer to sgl-project/sglang#2542 (comment) for details.

Examples and Features

Basic Example

# 1. For tensors that wants to be paused, create them within `region`
with torch_memory_saver.region():
    pauseable_tensor = torch.full((1_000_000_000,), 100, dtype=torch.uint8, device='cuda')

# 2. After `pause`, CUDA memory is released for those tensors.
# For example, check `nvidia-smi`'s memory usage to verify.
torch_memory_saver.pause()

# 3. After `resume`, CUDA memory is re-occupied for those tensors.
torch_memory_saver.resume()

During the pause, physical memory is released and virtual address is preserved. When resume, virtual address is kept unchanged, while physical memory is re-allocated

Multiple Tags

Please refer to sgl-project/sglang#7009 for details.

# 1. Create tensors with different tags
with torch_memory_saver.region(tag="type1"):
    tensor1 = torch.full((5_000_000_000,), 100, dtype=torch.uint8, device='cuda')

with torch_memory_saver.region(tag="type2"):
    tensor2 = torch.full((5_000_000_000,), 100, dtype=torch.uint8, device='cuda')

# 2. Pause and resume with different tags selectively
torch_memory_saver.pause("type1")
torch_memory_saver.pause("type2")

torch_memory_saver.resume("type2")
torch_memory_saver.resume("type1")

torch_memory_saver.pause("type1")
torch_memory_saver.resume("type1")

Release Memory in CUDA Graph

Not only does torch_memory_saver make tensors compatible with CUDA graph, but we can also release the memory held by CUDA graph (i.e. the intermediate tensors).

API: Change torch.cuda.graph(...) to torch_memory_saver.cuda_graph(...)

CPU Backup

By default, in order to save time, the content is thrown away. This is useful for, for example, KV cache that are to be staled, or model weights that are to be updated.

If you want the tensor content to be kept unchanged, use enable_cpu_backup.

with torch_memory_saver.region(enable_cpu_backup=True):
    tensor1 = torch.full((5_000_000_000,), 42, dtype=torch.uint8, device='cuda')

torch_memory_saver.pause()
torch_memory_saver.resume()

assert tensor1[0] == 42, "content is kept unchanged"

Hook Modes

There are two hook modes:

preload: Use LD_PRELOAD to hook CUDA's malloc and free API to change allocation behavior.
torch: Use torch's custom allocator API to change allocation behavior.

The mode can be chosen by:

torch_memory_saver.hook_mode = "torch"

Example of RL with CUDA Graph

Please refer to rl_example.py for details.

Development

make reinstall

You can use this command for local testing:

pytest /path/to/torch_memory_saver/test

Or this one to test a single case (e.g. the simple one here):

pytest /path/to/torch_memory_saver/test/test_examples.py::test_simple -s

Name		Name	Last commit message	Last commit date
Latest commit History 405 Commits
csrc		csrc
scripts		scripts
test		test
torch_memory_saver		torch_memory_saver
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Torch Memory Saver

Examples and Features

Basic Example

Multiple Tags

Release Memory in CUDA Graph

CPU Backup

Hook Modes

Example of RL with CUDA Graph

Development

About

Uh oh!

Releases

Packages

Languages

License

RLsys-Foundation/torch_memory_saver

Folders and files

Latest commit

History

Repository files navigation

Torch Memory Saver

Examples and Features

Basic Example

Multiple Tags

Release Memory in CUDA Graph

CPU Backup

Hook Modes

Example of RL with CUDA Graph

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages