server: Add timeout to stop the server automatically when idling for too long. #10742

Sumandora · 2024-12-09T23:42:45Z

This adds a new feature. I called the new param "standby-timeout" because it basically does what standby does on phones. If you don't use it in a long time, then it turns itself off. By default this feature is disabled, as timeouts <= 0 imply it being disabled.

This does on one side allow for better resource management and help with forgotten server instances, as I often launch them in a tmux session and forget about them, then wonder why my GPU only has a tenth of its usual VRAM available. But it also allows for some applications using the server as their way of communicating with llama.cpp. Basically a program can implement a check for if llama-server is running and if it is not run it with a standby-timeout of lets say 10 minutes and use llama.cpp as usual. The program may be restarted several times without the server stopping. This is common for terminal-applications/terminal-chat-uis. And if the user hasn't used the chat-ui in the 10 minutes defined earlier then llama.cpp will shutdown and the next start up of the chat ui will take a bit longer. Before a chat-ui would've been required to add a watchdog process that shuts down llama.cpp manually and on top of that it somehow needs to communicate with that watchdog process to inform it about the requests that llama.cpp receives, because it can't really look into the server queue by itself (except maybe if it acts like a debugger and reads the servers memory).

I have implemented this by using wait_for instead of wait on the std::condition_variable. I must admit that I'm not very familiar with this part of the STL. It seems to do what I want based on my testing, but it would be nice to hear from someone familiar with the condition_variable api, that this isn't completely wrong. I tried to implement this without breaking anything running in production. This is why it is turned off by default. I also added a way of specifying shutdown reasons in the shutdown handler, there is the termination_signal which was the previous behavior and now there is also standby_timeout as one of the reasons. The shutdown_handler didn't use the signal before, but in case there were patches that people maintained locally it should be fairly simple to adjust them by simply doing a std::holds_alternative.

Add standby-timeout. A timeout for automatically terminating the server after being unused for a certain amount of time

…d in seconds

ngxson · 2024-12-10T17:28:49Z

This is not a clean implementation and I would say that it's not very useful for most users. The problem is that after the timeout passed, the server shut itself down and user still need to re-launch it manually if they want to use it again.

Sumandora · 2024-12-10T18:35:50Z

This is not a clean implementation

Can you elaborate? I don't feel that there is anything particularly smelly in this PR, can you show that code pieces you mean by that?

and I would say that it's not very useful for most users. The problem is that after the timeout passed, the server shut itself down and user still need to re-launch it manually if they want to use it again.

I mean this is kinda the idea. The software driving the server is supposed to start it, if the user started the server by themself I think its fair to assume that they will start the server again manually. If you want something to automatically start the server then you end up at what Ollama is doing, which is required to wrap their own API around llama.cpps completely because they can't monitor requests otherwise. This is the same problem that I outlined in the OP.

ngxson

Here is comments for why I think the implementation is not clean.

The software driving the server is supposed to start it, if the user started the server by themself I think its fair to assume that they will start the server again manually

Then, why not simply delegate the task to stop the server after certain amount of time to "The software driving" that you mentioned?

If you want something to automatically start the server then you end up at what Ollama is doing

I would say that this is preferable and most users want that. Having the server auto stop/start itself makes more sense than having it to automatically stop and user have to manually start it. It's a bad UX and no one gonna use it.

examples/server/server.cpp

…= false for termination

…g as argument

Sumandora · 2024-12-11T07:42:44Z

Then, why not simply delegate the task to stop the server after certain amount of time to "The software driving" that you mentioned?

Because the software driving the server is unable to get that information unless you wrap the entire API. Also this would force an application to terminate the server when it terminates. For terminal applications being restarted is a common thing to happen. You don't want to wait 10+ seconds before talking to the LLM every time you Ctrl+C'd to do something else.

ngxson · 2024-12-11T08:54:01Z

Because the software driving the server is unable to get that information

I don't see why it can't. A reverse proxy can know whether a connection is still on-going or not, and it can terminate the server after an amount of time since the last request ended.

Also this would force an application to terminate the server when it terminates

Then why shouldn't it terminate the server when itself (the parent process) is terminated? Why do you want to leave a zombie process in the system?

For terminal applications being restarted is a common thing to happen.

It only common if the restart is done automatically.

Again, I can't see anyone gonna need this feature unless the server can re-load itself, which as you said, is the same thing ollama is doing.

Sumandora · 2024-12-11T14:18:05Z

Then why shouldn't it terminate the server when itself (the parent process) is terminated? Why do you want to leave a zombie process in the system?

A zombie/defunct process is not what you are talking about. Those happen when the linux kernel has an open reference to a task_struct because somebody called get_task_struct, but not put_task_struct. What you mean is a disowned, unused process. But even then you are blatantly ignoring what the PR is doing. The idea is that when the server becomes unused it stops itself to save on resources. So essentially this PR is handling the state you are talking about. Also starting the server can take some time especially for big models, even more if the hard drive the model resides on isn't the fastest, so there is a legitimate interest in keeping models loaded.

It only common if the restart is done automatically.

There are 2 options how this feature is used. Either a human wants the server to shut down and specifies it explicitly in this case, why would it ever restart? Or it is used by another program, in this case the logic looks something like this:

def request_completion(text):
    if not is_llama_cpp_running() or not is_parameters_equal():
        start_llama_cpp();
        
    # do the request:

If the server shuts down and the program still wants it to be alive then it will be restarted automatically.
Let me put it this way: If your phone goes into standby, does it turn on by itself automatically?

I don't see why it can't. A reverse proxy can know whether a connection is still on-going or not, and it can terminate the server after an amount of time since the last request ended.

A reverse proxy sounds like it would work, but quickly falls apart if you realize the implications.
First of all, you are required to create a daemon service that plays the reverse proxy, which is effectively more work than I (or likely many others, for that matter) want to do.
Second of all what would happen if you start the interface twice? Something that can definitely happen if you are working with multiple terminals and you forget that you already have it open in another one. You are required to either spawn a second server and at this point your GPU memory is already occupied, so you are forced to run off CPU. And even worse: it's occupied with the same model, so loading it a second time is quite literally the biggest waste of resources, or you check to see if llama.cpp is already running. If it is then you need to check if it was spawned by the daemon, because if not then the user ran it manually. Now how do you detect the daemon? If it has a constant name, then this is easy. However this falls apart when 2 different applications are using different daemons. There is no official reverse proxy so there is no common name that you can check for. At this point you are forced to start the same model again, because you can't accurately detect if it is a reverse proxy and even if you could. You don't know on what port it runs.

Now another way of doing the reverse proxy would be to make the reverse proxy start the server when it gets a request. This way you don't need to check if the server is running or not. Apart from the horrible response time on the first request, which is something Ollama actually fails to account for and thus requires your read timeouts to be several minutes, the even bigger problem is that you aren't even solving the 2 different clients problem.

Now you have the choice do you want to implement one of these (broken) systems or are you fine with adding 10 lines of code to the server that make the server handle this problem properly itself?

With this PR you can make a client simply check the options from the process arguments (something that you can query on Linux, FreeBSD, Windows and probably macOS as well) and then use the already existing server if it is the same as you wanted to start anyways or if it isn't started yet then start it yourself. You can then forget about it, because it will terminate itself, but if it ever does go offline and you actually still need it you just restart it.

Again, I can't see anyone gonna need this feature unless the server can re-load itself, which as you said, is the same thing ollama is doing.

Well, I would, otherwise I wouldn't have added it. I commonly forget about llama.cpp servers that I started in tmux until I have some program telling me that I only have 500 megabytes of VRAM left. Also I would prefer to not import my GGUFs into Ollama just to use them in a chat ui and also not wait 10s every time I start my chat ui since I need to wait for the server to start and load the model.

I'm sorry for making this so long, but it appears that you have fundamentally misunderstood why it is needed.
I understand if you think this is out of scope for llama.cpp and you want this problem to be solved by other software like, as you mentioned, Ollama, but then please also say it like that and don't make up reasons for why this wouldn't work, because it would. It is just a question of if it is desired to be a feature/handled case of llama.cpp.

ngxson · 2024-12-11T14:31:30Z

Let me put it this way: If your phone goes into standby, does it turn on by itself automatically?

Yes, it does. The OS process(es) never exit on your phone. It waits for the interrupt from the power button to turn on the screen. Even when your screen is off, the CPU is still working at some extent.

The same can be said for this PR, what you are doing is to exit the whole server process once deadline passed. This is equivalent to power off your phone, which "exit" the OS. So let me ask: do you power on/off your phone each time you use it, or you put it in "standby" mode?

I don't want to discuss more on this PR, no point for me to continue. If any other contributors/collaborators wants to take over this PR, please feel free to do so. Thank you.

P/s: if this PR ever get merged, please at least change the param --standby-timeout to something else because it's misleading. "Standby" means that the process is still there, but some parts are unloaded (for example, llama_context maybe deallocated but llama_model may still be there). But in this case the whole process is terminate, it is not a standby.

Sumandora added 4 commits December 9, 2024 22:56

server: Add standby-timeout

9a8df14

Add standby-timeout. A timeout for automatically terminating the server after being unused for a certain amount of time

server: Improve wording to make clear that standby-timeout is measure…

0468a01

…d in seconds

server: Update README to include standby-timeout

4fd985a

server: Adhere to naming conventions for shutdown_reasons

a4108f5

github-actions bot added examples server labels Dec 9, 2024

ngxson requested changes Dec 10, 2024

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

examples/server/server.cpp Outdated Show resolved Hide resolved

examples/server/server.cpp Outdated Show resolved Hide resolved

examples/server/server.cpp Outdated Show resolved Hide resolved

Sumandora added 3 commits December 11, 2024 08:32

server: Return shutdown_handler to its initial state and use running …

acbac00

…= false for termination

server: Initialize standby_timeout over constructor instead of passin…

4fd58a8

…g as argument

server: Propagate standby_timeout after it has been initialized

7006dd7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: Add timeout to stop the server automatically when idling for too long. #10742

server: Add timeout to stop the server automatically when idling for too long. #10742

Sumandora commented Dec 9, 2024

ngxson commented Dec 10, 2024

Sumandora commented Dec 10, 2024 •

edited

Loading

ngxson left a comment •

edited

Loading

Sumandora commented Dec 11, 2024

ngxson commented Dec 11, 2024 •

edited

Loading

Sumandora commented Dec 11, 2024

ngxson commented Dec 11, 2024 •

edited

Loading

server: Add timeout to stop the server automatically when idling for too long. #10742

Are you sure you want to change the base?

server: Add timeout to stop the server automatically when idling for too long. #10742

Conversation

Sumandora commented Dec 9, 2024

ngxson commented Dec 10, 2024

Sumandora commented Dec 10, 2024 • edited Loading

ngxson left a comment • edited Loading

Choose a reason for hiding this comment

Sumandora commented Dec 11, 2024

ngxson commented Dec 11, 2024 • edited Loading

Sumandora commented Dec 11, 2024

ngxson commented Dec 11, 2024 • edited Loading

Sumandora commented Dec 10, 2024 •

edited

Loading

ngxson left a comment •

edited

Loading

ngxson commented Dec 11, 2024 •

edited

Loading

ngxson commented Dec 11, 2024 •

edited

Loading