-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: Add timeout to stop the server automatically when idling for too long. #10742
base: master
Are you sure you want to change the base?
Conversation
Add standby-timeout. A timeout for automatically terminating the server after being unused for a certain amount of time
This is not a clean implementation and I would say that it's not very useful for most users. The problem is that after the timeout passed, the server shut itself down and user still need to re-launch it manually if they want to use it again. |
Can you elaborate? I don't feel that there is anything particularly smelly in this PR, can you show that code pieces you mean by that?
I mean this is kinda the idea. The software driving the server is supposed to start it, if the user started the server by themself I think its fair to assume that they will start the server again manually. If you want something to automatically start the server then you end up at what Ollama is doing, which is required to wrap their own API around llama.cpps completely because they can't monitor requests otherwise. This is the same problem that I outlined in the OP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is comments for why I think the implementation is not clean.
The software driving the server is supposed to start it, if the user started the server by themself I think its fair to assume that they will start the server again manually
Then, why not simply delegate the task to stop the server after certain amount of time to "The software driving" that you mentioned?
If you want something to automatically start the server then you end up at what Ollama is doing
I would say that this is preferable and most users want that. Having the server auto stop/start itself makes more sense than having it to automatically stop and user have to manually start it. It's a bad UX and no one gonna use it.
Because the software driving the server is unable to get that information unless you wrap the entire API. Also this would force an application to terminate the server when it terminates. For terminal applications being restarted is a common thing to happen. You don't want to wait 10+ seconds before talking to the LLM every time you Ctrl+C'd to do something else. |
I don't see why it can't. A reverse proxy can know whether a connection is still on-going or not, and it can terminate the server after an amount of time since the last request ended.
Then why shouldn't it terminate the server when itself (the parent process) is terminated? Why do you want to leave a zombie process in the system?
It only common if the restart is done automatically. Again, I can't see anyone gonna need this feature unless the server can re-load itself, which as you said, is the same thing ollama is doing. |
A zombie/defunct process is not what you are talking about. Those happen when the linux kernel has an open reference to a task_struct because somebody called get_task_struct, but not put_task_struct. What you mean is a disowned, unused process. But even then you are blatantly ignoring what the PR is doing. The idea is that when the server becomes unused it stops itself to save on resources. So essentially this PR is handling the state you are talking about. Also starting the server can take some time especially for big models, even more if the hard drive the model resides on isn't the fastest, so there is a legitimate interest in keeping models loaded.
There are 2 options how this feature is used. Either a human wants the server to shut down and specifies it explicitly in this case, why would it ever restart? Or it is used by another program, in this case the logic looks something like this: def request_completion(text):
if not is_llama_cpp_running() or not is_parameters_equal():
start_llama_cpp();
# do the request: If the server shuts down and the program still wants it to be alive then it will be restarted automatically.
A reverse proxy sounds like it would work, but quickly falls apart if you realize the implications. Now another way of doing the reverse proxy would be to make the reverse proxy start the server when it gets a request. This way you don't need to check if the server is running or not. Apart from the horrible response time on the first request, which is something Ollama actually fails to account for and thus requires your read timeouts to be several minutes, the even bigger problem is that you aren't even solving the 2 different clients problem. Now you have the choice do you want to implement one of these (broken) systems or are you fine with adding 10 lines of code to the server that make the server handle this problem properly itself? With this PR you can make a client simply check the options from the process arguments (something that you can query on Linux, FreeBSD, Windows and probably macOS as well) and then use the already existing server if it is the same as you wanted to start anyways or if it isn't started yet then start it yourself. You can then forget about it, because it will terminate itself, but if it ever does go offline and you actually still need it you just restart it.
Well, I would, otherwise I wouldn't have added it. I commonly forget about llama.cpp servers that I started in tmux until I have some program telling me that I only have 500 megabytes of VRAM left. Also I would prefer to not import my GGUFs into Ollama just to use them in a chat ui and also not wait 10s every time I start my chat ui since I need to wait for the server to start and load the model. I'm sorry for making this so long, but it appears that you have fundamentally misunderstood why it is needed. |
Yes, it does. The OS process(es) never exit on your phone. It waits for the interrupt from the power button to turn on the screen. Even when your screen is off, the CPU is still working at some extent. The same can be said for this PR, what you are doing is to exit the whole server process once deadline passed. This is equivalent to power off your phone, which "exit" the OS. So let me ask: do you power on/off your phone each time you use it, or you put it in "standby" mode? I don't want to discuss more on this PR, no point for me to continue. If any other contributors/collaborators wants to take over this PR, please feel free to do so. Thank you. P/s: if this PR ever get merged, please at least change the param |
This adds a new feature. I called the new param "standby-timeout" because it basically does what standby does on phones. If you don't use it in a long time, then it turns itself off. By default this feature is disabled, as timeouts <= 0 imply it being disabled.
This does on one side allow for better resource management and help with forgotten server instances, as I often launch them in a tmux session and forget about them, then wonder why my GPU only has a tenth of its usual VRAM available. But it also allows for some applications using the server as their way of communicating with llama.cpp. Basically a program can implement a check for if llama-server is running and if it is not run it with a standby-timeout of lets say 10 minutes and use llama.cpp as usual. The program may be restarted several times without the server stopping. This is common for terminal-applications/terminal-chat-uis. And if the user hasn't used the chat-ui in the 10 minutes defined earlier then llama.cpp will shutdown and the next start up of the chat ui will take a bit longer. Before a chat-ui would've been required to add a watchdog process that shuts down llama.cpp manually and on top of that it somehow needs to communicate with that watchdog process to inform it about the requests that llama.cpp receives, because it can't really look into the server queue by itself (except maybe if it acts like a debugger and reads the servers memory).
I have implemented this by using wait_for instead of wait on the std::condition_variable. I must admit that I'm not very familiar with this part of the STL. It seems to do what I want based on my testing, but it would be nice to hear from someone familiar with the condition_variable api, that this isn't completely wrong. I tried to implement this without breaking anything running in production. This is why it is turned off by default. I also added a way of specifying shutdown reasons in the shutdown handler, there is the termination_signal which was the previous behavior and now there is also standby_timeout as one of the reasons. The shutdown_handler didn't use the signal before, but in case there were patches that people maintained locally it should be fairly simple to adjust them by simply doing a std::holds_alternative.