RAM offloading like AutoGPTQ? #109
manyotherfunctions
started this conversation in
Ideas
Replies: 1 comment
-
This would be horrendously slow without a dedicated CPU inference engine. And Llama.cpp is already very well optimized for this use case. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Are there any plans to add the ability to split the model between VRAM and system RAM like AutoGPTQ does? For example the oobabooga webui, through AutoGPTQ, lets you load even a 65B parameter model on a 8GB VRAM GPU, where only 1GB is loaded in VRAM and the rest in RAM
Beta Was this translation helpful? Give feedback.
All reactions