Skip to content

support prompt caching for claude models to cut latency and api costs#2348

Open
thechaitanyaanand wants to merge 2 commits into
huggingface:mainfrom
thechaitanyaanand:feature/prompt-caching
Open

support prompt caching for claude models to cut latency and api costs#2348
thechaitanyaanand wants to merge 2 commits into
huggingface:mainfrom
thechaitanyaanand:feature/prompt-caching

Conversation

@thechaitanyaanand
Copy link
Copy Markdown

This PR adds support for ephemeral prompt caching for anthropic models (claude) within the model interaction layer.

  1. The Problem: When running multi-step agent loops (like ReAct), the system prompt, available tools, and message history get sent back and forth to the model on every single step. This leads to redundant token processing, which slows down response times and increases costs.

  2. Where it matters: For longer runs or complex agents with many tools, this can cut down token latency and costs by up to 90% since Claude can retrieve the prompt context from its cache instantly instead of reading it again.

  3. Added a clean unit test in tests/test_models.py (test_prepare_completion_kwargs_prompt_caching) to ensure the headers are formatted correctly for Claude, and completely ignored for other models.

@thechaitanyaanand thechaitanyaanand changed the title Add prompt caching adapter for Anthropic models and include unit tests support prompt caching for claude models to cut latency and api costs Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant