Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: Add timeout to stop the server automatically when idling for too long. #10742

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Sumandora
Copy link

This adds a new feature. I called the new param "standby-timeout" because it basically does what standby does on phones. If you don't use it in a long time, then it turns itself off. By default this feature is disabled, as timeouts <= 0 imply it being disabled.

This does on one side allow for better resource management and help with forgotten server instances, as I often launch them in a tmux session and forget about them, then wonder why my GPU only has a tenth of its usual VRAM available. But it also allows for some applications using the server as their way of communicating with llama.cpp. Basically a program can implement a check for if llama-server is running and if it is not run it with a standby-timeout of lets say 10 minutes and use llama.cpp as usual. The program may be restarted several times without the server stopping. This is common for terminal-applications/terminal-chat-uis. And if the user hasn't used the chat-ui in the 10 minutes defined earlier then llama.cpp will shutdown and the next start up of the chat ui will take a bit longer. Before a chat-ui would've been required to add a watchdog process that shuts down llama.cpp manually and on top of that it somehow needs to communicate with that watchdog process to inform it about the requests that llama.cpp receives, because it can't really look into the server queue by itself (except maybe if it acts like a debugger and reads the servers memory).

I have implemented this by using wait_for instead of wait on the std::condition_variable. I must admit that I'm not very familiar with this part of the STL. It seems to do what I want based on my testing, but it would be nice to hear from someone familiar with the condition_variable api, that this isn't completely wrong. I tried to implement this without breaking anything running in production. This is why it is turned off by default. I also added a way of specifying shutdown reasons in the shutdown handler, there is the termination_signal which was the previous behavior and now there is also standby_timeout as one of the reasons. The shutdown_handler didn't use the signal before, but in case there were patches that people maintained locally it should be fairly simple to adjust them by simply doing a std::holds_alternative.

@ngxson
Copy link
Collaborator

ngxson commented Dec 10, 2024

This is not a clean implementation and I would say that it's not very useful for most users. The problem is that after the timeout passed, the server shut itself down and user still need to re-launch it manually if they want to use it again.

@Sumandora
Copy link
Author

Sumandora commented Dec 10, 2024

This is not a clean implementation

Can you elaborate? I don't feel that there is anything particularly smelly in this PR, can you show that code pieces you mean by that?

and I would say that it's not very useful for most users. The problem is that after the timeout passed, the server shut itself down and user still need to re-launch it manually if they want to use it again.

I mean this is kinda the idea. The software driving the server is supposed to start it, if the user started the server by themself I think its fair to assume that they will start the server again manually. If you want something to automatically start the server then you end up at what Ollama is doing, which is required to wrap their own API around llama.cpps completely because they can't monitor requests otherwise. This is the same problem that I outlined in the OP.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is comments for why I think the implementation is not clean.

The software driving the server is supposed to start it, if the user started the server by themself I think its fair to assume that they will start the server again manually

Then, why not simply delegate the task to stop the server after certain amount of time to "The software driving" that you mentioned?

If you want something to automatically start the server then you end up at what Ollama is doing

I would say that this is preferable and most users want that. Having the server auto stop/start itself makes more sense than having it to automatically stop and user have to manually start it. It's a bad UX and no one gonna use it.

examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
@Sumandora
Copy link
Author

Then, why not simply delegate the task to stop the server after certain amount of time to "The software driving" that you mentioned?

Because the software driving the server is unable to get that information unless you wrap the entire API. Also this would force an application to terminate the server when it terminates. For terminal applications being restarted is a common thing to happen. You don't want to wait 10+ seconds before talking to the LLM every time you Ctrl+C'd to do something else.

@ngxson
Copy link
Collaborator

ngxson commented Dec 11, 2024

Because the software driving the server is unable to get that information

I don't see why it can't. A reverse proxy can know whether a connection is still on-going or not, and it can terminate the server after an amount of time since the last request ended.

Also this would force an application to terminate the server when it terminates

Then why shouldn't it terminate the server when itself (the parent process) is terminated? Why do you want to leave a zombie process in the system?

For terminal applications being restarted is a common thing to happen.

It only common if the restart is done automatically.

Again, I can't see anyone gonna need this feature unless the server can re-load itself, which as you said, is the same thing ollama is doing.

@Sumandora
Copy link
Author

Then why shouldn't it terminate the server when itself (the parent process) is terminated? Why do you want to leave a zombie process in the system?

A zombie/defunct process is not what you are talking about. Those happen when the linux kernel has an open reference to a task_struct because somebody called get_task_struct, but not put_task_struct. What you mean is a disowned, unused process. But even then you are blatantly ignoring what the PR is doing. The idea is that when the server becomes unused it stops itself to save on resources. So essentially this PR is handling the state you are talking about. Also starting the server can take some time especially for big models, even more if the hard drive the model resides on isn't the fastest, so there is a legitimate interest in keeping models loaded.

It only common if the restart is done automatically.

There are 2 options how this feature is used. Either a human wants the server to shut down and specifies it explicitly in this case, why would it ever restart? Or it is used by another program, in this case the logic looks something like this:

def request_completion(text):
    if not is_llama_cpp_running() or not is_parameters_equal():
        start_llama_cpp();
        
    # do the request:

If the server shuts down and the program still wants it to be alive then it will be restarted automatically.
Let me put it this way: If your phone goes into standby, does it turn on by itself automatically?

I don't see why it can't. A reverse proxy can know whether a connection is still on-going or not, and it can terminate the server after an amount of time since the last request ended.

A reverse proxy sounds like it would work, but quickly falls apart if you realize the implications.
First of all, you are required to create a daemon service that plays the reverse proxy, which is effectively more work than I (or likely many others, for that matter) want to do.
Second of all what would happen if you start the interface twice? Something that can definitely happen if you are working with multiple terminals and you forget that you already have it open in another one. You are required to either spawn a second server and at this point your GPU memory is already occupied, so you are forced to run off CPU. And even worse: it's occupied with the same model, so loading it a second time is quite literally the biggest waste of resources, or you check to see if llama.cpp is already running. If it is then you need to check if it was spawned by the daemon, because if not then the user ran it manually. Now how do you detect the daemon? If it has a constant name, then this is easy. However this falls apart when 2 different applications are using different daemons. There is no official reverse proxy so there is no common name that you can check for. At this point you are forced to start the same model again, because you can't accurately detect if it is a reverse proxy and even if you could. You don't know on what port it runs.

Now another way of doing the reverse proxy would be to make the reverse proxy start the server when it gets a request. This way you don't need to check if the server is running or not. Apart from the horrible response time on the first request, which is something Ollama actually fails to account for and thus requires your read timeouts to be several minutes, the even bigger problem is that you aren't even solving the 2 different clients problem.

Now you have the choice do you want to implement one of these (broken) systems or are you fine with adding 10 lines of code to the server that make the server handle this problem properly itself?

With this PR you can make a client simply check the options from the process arguments (something that you can query on Linux, FreeBSD, Windows and probably macOS as well) and then use the already existing server if it is the same as you wanted to start anyways or if it isn't started yet then start it yourself. You can then forget about it, because it will terminate itself, but if it ever does go offline and you actually still need it you just restart it.

Again, I can't see anyone gonna need this feature unless the server can re-load itself, which as you said, is the same thing ollama is doing.

Well, I would, otherwise I wouldn't have added it. I commonly forget about llama.cpp servers that I started in tmux until I have some program telling me that I only have 500 megabytes of VRAM left. Also I would prefer to not import my GGUFs into Ollama just to use them in a chat ui and also not wait 10s every time I start my chat ui since I need to wait for the server to start and load the model.

I'm sorry for making this so long, but it appears that you have fundamentally misunderstood why it is needed.
I understand if you think this is out of scope for llama.cpp and you want this problem to be solved by other software like, as you mentioned, Ollama, but then please also say it like that and don't make up reasons for why this wouldn't work, because it would. It is just a question of if it is desired to be a feature/handled case of llama.cpp.

@ngxson
Copy link
Collaborator

ngxson commented Dec 11, 2024

Let me put it this way: If your phone goes into standby, does it turn on by itself automatically?

Yes, it does. The OS process(es) never exit on your phone. It waits for the interrupt from the power button to turn on the screen. Even when your screen is off, the CPU is still working at some extent.

The same can be said for this PR, what you are doing is to exit the whole server process once deadline passed. This is equivalent to power off your phone, which "exit" the OS. So let me ask: do you power on/off your phone each time you use it, or you put it in "standby" mode?

I don't want to discuss more on this PR, no point for me to continue. If any other contributors/collaborators wants to take over this PR, please feel free to do so. Thank you.


P/s: if this PR ever get merged, please at least change the param --standby-timeout to something else because it's misleading. "Standby" means that the process is still there, but some parts are unloaded (for example, llama_context maybe deallocated but llama_model may still be there). But in this case the whole process is terminate, it is not a standby.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants