support prompt caching for claude models to cut latency and api costs by thechaitanyaanand · Pull Request #2348 · huggingface/smolagents

thechaitanyaanand · 2026-06-06T17:10:23Z

This PR adds support for ephemeral prompt caching for anthropic models (claude) within the model interaction layer.

The Problem: When running multi-step agent loops (like ReAct), the system prompt, available tools, and message history get sent back and forth to the model on every single step. This leads to redundant token processing, which slows down response times and increases costs.
Where it matters: For longer runs or complex agents with many tools, this can cut down token latency and costs by up to 90% since Claude can retrieve the prompt context from its cache instantly instead of reading it again.
Added a clean unit test in tests/test_models.py (test_prepare_completion_kwargs_prompt_caching) to ensure the headers are formatted correctly for Claude, and completely ignored for other models.

thechaitanyaanand and others added 2 commits June 6, 2026 22:22

Add prompt caching adapter for Anthropic models and include unit tests

0a2a63e

remove artifact from briefed-cli in agent file

445c1da

thechaitanyaanand changed the title ~~Add prompt caching adapter for Anthropic models and include unit tests~~ support prompt caching for claude models to cut latency and api costs Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support prompt caching for claude models to cut latency and api costs#2348

support prompt caching for claude models to cut latency and api costs#2348
thechaitanyaanand wants to merge 2 commits into
huggingface:mainfrom
thechaitanyaanand:feature/prompt-caching

thechaitanyaanand commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thechaitanyaanand commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant