SDK-level resource management and cache #250

iboB · 2025-02-12T14:27:08Z

iboB
Feb 12, 2025
Maintainer

Application have multiple sessions with the same model which start and stop. We don't want to load the same model for every session. Ideally we want to cache the model.

Plugin-level cache

This only makes sense if this is the only plugin on the system. If we have multiple plugins, they will have no way of communicating between each other and even though, say, ilib-whisper has a stale model that it can free, ilib-llama has no way of telling it to do so.

We want to have a central SDK-level cache which can free items when needed.

Plain LRU

Plain LRU is too restrictive. If the system can load multiple models it a shame to only have a single slot in the cache. Especially for apps which need two or more models. Such apps will end up thrashing the cache on every run.

Multi-element cache

Ok, so we will have this, but the problem is then, how do we know what to free when a plugin requests space? In the simplest case if we have several CPU-RAM and several GPU-Memory models and resource space is requested, how do we know which ones to free.

The smart and really complex approach would be to make the SDK hardware-aware. It can then associate assets with hardware slots. This we leave (potentially) for the future
The stupid approach would be to free random (or oldest) assets and retry, until the plugin reports a success.
The semi-smart approach would be to still have the plugins report their idea of space and then associate the plugin and space id-s on the fly. Learning in a way that for example ilib-whisper-cuda-0 and ilib-llama-vulkan-1 are the same thing (this can be learned based on the fact that freeing space in ilib-whisper-cuda-0led to enough space inilib-llama-vulkan-1`

For now we will go with 2. and maybe do 3. in the future. 1. is for the distant future if ever.

iboB · 2025-02-13T08:58:57Z

iboB
Feb 13, 2025
Maintainer Author

One problem with the global resource cache is that it needs to be async. It can't just destroy resources in its own execution context, as this might cause races with the provider's. So in order to free a resource it needs to notify the submitter.

Now this also means that a requester for space, can't just have a synchronous function, but it needs to be an asynchronous one.

And this also means that the destroyer of a resource needs to notify the cache that the resource has been freed.

This complicates the integration a lot. Can we have something simpler?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDK-level resource management and cache #250

{{title}}

Replies: 1 comment

{{title}}

Select a reply

SDK-level resource management and cache #250

iboB Feb 12, 2025 Maintainer

Plugin-level cache

Plain LRU

Multi-element cache

Replies: 1 comment

iboB Feb 13, 2025 Maintainer Author

iboB
Feb 12, 2025
Maintainer

iboB
Feb 13, 2025
Maintainer Author