Plugins and Token Count #7200
-
Is my understand correct that the more plugins are loaded to Kernel, the more prompt tokens would be required for planner to decide on the plugin? Is this very much the same that when that many functions are loaded to available tools (eg: 10 functions) ? or does the prompt tokens only consist of called functions (eg: 2) during the chat completion invocation? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
I have this Plugin list and triggered 2 conversations: A) Time - get current time Conversation 1:
Query 2) Use RAG Search (B) - prompt_tokens=3013, completion_tokens=215, total_tokens=3228 (Breakdown Q2 + Q1 Total)
Query 3) Use RAG Search (B) - prompt_tokens=9972, completion_tokens=433, total_tokens=10405 (Breakdown Q3 + Q2 Total)
Conversation 2: Query1) below is exactly the same as Query 3) from Conversation #1
My observation when comparing Conv 2) Q1 with Conv 1) Q3, from Q3 breakdown it seems to be bloated, which i can only guess due to the retrieval function RAG results from Conv 1) Q2. While i see the tool_call's response is useful for citations or reference, is there a way to optimize the token count, for example by omitting the previous tool_call's results from the history? I don't think that would be necessary for the subsequent LLM call since the assistant output would be more relevant. |
Beta Was this translation helpful? Give feedback.
-
Yes. Every time you send a request to the service with function calling enabled, a description of each function needs to be included in the request, so the more functions there are and the longer the description of each, the more prompt tokens will be consumed. This is one of the reasons it's a good idea to try to partition your plugins/functions per use, such that you only send with a given request/prompt the relevant functions. Another important reason is, in general with today's models, you'll get better planning out of them if you limit the number of functions offered for it to choose from. This is one of the reasons Kernel is enabled to be transient in nature, with it being lightweight enough to just create on demand, populate with just the plugins you need, and throw it away after. The AddKernel method for registering a Kernel in the dependency injection container registers it as transient, for example. |
Beta Was this translation helpful? Give feedback.
Yes. Every time you send a request to the service with function calling enabled, a description of each function needs to be included in the request, so the more functions there are and the longer the description of each, the more prompt tokens will be consumed.
This is one of the reasons it's a good idea to try to partition your plugins/functions per use, such that you only send with a given request/prompt the relevant functions. Another important reason is, in general with today's models, you'll get better planning out of them if you limit t…