Skip to content

Commit 731d9ef

Browse files
authored
Merge pull request #23840 from dvdksn/cagent-a2a-evals-misc
cagent: model provider setup, a2a, evals, mcp/dmr links
2 parents 249de51 + 8a0f520 commit 731d9ef

File tree

12 files changed

+824
-26
lines changed

12 files changed

+824
-26
lines changed

_vale/config/vocabularies/Docker/accept.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -294,4 +294,4 @@ Zsh
294294
[Ww]alkthrough
295295
[Tt]oolsets?
296296
[Rr]erank(ing|ed)?
297-
297+
[Ee]vals?

content/manuals/ai/cagent/best-practices.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Best practices
33
description: Patterns and techniques for building effective cagent agents
44
keywords: [cagent, best practices, patterns, agent design, optimization]
5-
weight: 20
5+
weight: 40
66
---
77

88
Patterns you learn from building and running cagent agents. These aren't

content/manuals/ai/cagent/evals.md

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
---
2+
title: Evals
3+
description: Test your agents with saved conversations
4+
keywords: [cagent, evaluations, testing, evals]
5+
weight: 80
6+
---
7+
8+
Evaluations (evals) help you track how your agent's behavior changes over time.
9+
When you save a conversation as an eval, you can replay it later to see if the
10+
agent responds differently. Evals measure consistency, not correctness - they
11+
tell you if behavior changed, not whether it's right or wrong.
12+
13+
## What are evals
14+
15+
An eval is a saved conversation you can replay. When you run evals, cagent
16+
replays the user messages and compares the new responses against the original
17+
saved conversation. High scores mean the agent behaved similarly; low scores
18+
mean behavior changed.
19+
20+
What you do with that information depends on why you saved the conversation.
21+
You might save successful conversations to catch regressions, or save failure
22+
cases to document known issues and track whether they improve.
23+
24+
## Common workflows
25+
26+
How you use evals depends on what you're trying to accomplish:
27+
28+
Regression testing: Save conversations where your agent performs well. When you
29+
make changes later (upgrade models, update prompts, refactor code), run the
30+
evals. High scores mean behavior stayed consistent, which is usually what you
31+
want. Low scores mean something changed - examine the new behavior to see if
32+
it's still correct.
33+
34+
Tracking improvements: Save conversations where your agent struggles or fails.
35+
As you make improvements, run these evals to see how behavior evolves. Low
36+
scores indicate the agent now behaves differently, which might mean you fixed
37+
the issue. You'll need to manually verify the new behavior is actually better.
38+
39+
Documenting edge cases: Save interesting or unusual conversations regardless of
40+
quality. Use them to understand how your agent handles edge cases and whether
41+
that behavior changes over time.
42+
43+
Evals measure whether behavior changed. You determine if that change is good or
44+
bad.
45+
46+
## Creating an eval
47+
48+
Save a conversation from an interactive session:
49+
50+
```console
51+
$ cagent run ./agent.yaml
52+
```
53+
54+
Have a conversation with your agent, then save it as an eval:
55+
56+
```console
57+
> /eval test-case-name
58+
Eval saved to evals/test-case-name.json
59+
```
60+
61+
The conversation is saved to the `evals/` directory in your current working
62+
directory. You can organize eval files in subdirectories if needed.
63+
64+
## Running evals
65+
66+
Run all evals in the default directory:
67+
68+
```console
69+
$ cagent eval ./agent.yaml
70+
```
71+
72+
Use a custom eval directory:
73+
74+
```console
75+
$ cagent eval ./agent.yaml ./my-evals
76+
```
77+
78+
Run evals against an agent from a registry:
79+
80+
```console
81+
$ cagent eval agentcatalog/myagent
82+
```
83+
84+
Example output:
85+
86+
```console
87+
$ cagent eval ./agent.yaml
88+
--- 0
89+
First message: tell me something interesting about kil
90+
Eval file: c7e556c5-dae5-4898-a38c-73cc8e0e6abe
91+
Tool trajectory score: 1.000000
92+
Rouge-1 score: 0.447368
93+
Cost: 0.00
94+
Output tokens: 177
95+
```
96+
97+
## Understanding results
98+
99+
For each eval, cagent shows:
100+
101+
- **First message** - The initial user message from the saved conversation
102+
- **Eval file** - The UUID of the eval file being run
103+
- **Tool trajectory score** - How similarly the agent used tools (0-1 scale,
104+
higher is better)
105+
- **[ROUGE-1](https://en.wikipedia.org/wiki/ROUGE_(metric)) score** - Text
106+
similarity between responses (0-1 scale, higher is better)
107+
- **Cost** - The cost for this eval run
108+
- **Output tokens** - Number of tokens generated
109+
110+
Higher scores mean the agent behaved more similarly to the original recorded
111+
conversation. A score of 1.0 means identical behavior.
112+
113+
### What the scores mean
114+
115+
**Tool trajectory score** measures whether the agent called the same tools in
116+
the same order as the original conversation. Lower scores might indicate the
117+
agent found a different approach to solve the problem, which isn't necessarily
118+
wrong but worth investigating.
119+
120+
**Rouge-1 score** measures how similar the response text is to the original.
121+
This is a heuristic measure - different wording might still be correct, so use
122+
this as a signal rather than absolute truth.
123+
124+
### Interpreting your results
125+
126+
Scores close to 1.0 mean your changes maintained consistent behavior - the
127+
agent is using the same approach and producing similar responses. This is
128+
generally good; your changes didn't break existing functionality.
129+
130+
Lower scores mean behavior changed compared to the saved conversation. This
131+
could be a regression where the agent now performs worse, or it could be an
132+
improvement where the agent found a better approach.
133+
134+
When scores drop, examine the actual behavior to determine if it's better or
135+
worse. The eval files are stored as JSON in your evals directory - open the
136+
file to see the original conversation. Then test your modified agent with the
137+
same input to compare responses. If the new response is better, save a new
138+
conversation to replace the eval. If it's worse, you found a regression.
139+
140+
The scores guide you to what changed. Your judgment determines if the change is
141+
good or bad.
142+
143+
## When to use evals
144+
145+
Evals help you track behavior changes over time. They're useful for catching
146+
regressions when you upgrade models or dependencies, documenting known failure
147+
cases you want to fix, and understanding how edge cases evolve as you iterate.
148+
149+
Evals aren't appropriate for determining which agent configuration works best -
150+
they measure similarity to a saved conversation, not correctness. Use manual
151+
testing to evaluate different configurations and decide which works better.
152+
153+
Save conversations worth tracking. Build a collection of important workflows,
154+
interesting edge cases, and known issues. Run your evals when making changes to
155+
see what shifted.
156+
157+
## What's next
158+
159+
- Check the [CLI reference](reference/cli.md#eval) for all `cagent eval`
160+
options
161+
- Learn [best practices](best-practices.md) for building effective agents
162+
- Review [example configurations](https://github.com/docker/cagent/tree/main/examples)
163+
for different agent types
Lines changed: 50 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,53 @@
11
---
2-
build:
3-
render: never
42
title: Integrations
5-
weight: 50
3+
description: Connect cagent agents to editors, MCP clients, and other agents
4+
keywords: [cagent, integration, acp, mcp, a2a, editor, protocol]
5+
weight: 60
66
---
7+
8+
cagent agents can integrate with different environments depending on how you
9+
want to use them. Each integration type serves a specific purpose.
10+
11+
## Integration types
12+
13+
### ACP - Editor integration
14+
15+
Run cagent agents directly in your editor (Neovim, Zed). The agent sees your
16+
editor's file context and can read and modify files through the editor's
17+
interface. Use ACP when you want an AI coding assistant embedded in your
18+
editor.
19+
20+
See [ACP integration](./acp.md) for setup instructions.
21+
22+
### MCP - Tool integration
23+
24+
Expose cagent agents as tools in MCP clients like Claude Desktop or Claude
25+
Code. Your agents appear in the client's tool list, and the client can call
26+
them when needed. Use MCP when you want Claude Desktop (or another MCP client)
27+
to have access to your specialized agents.
28+
29+
See [MCP integration](./mcp.md) for setup instructions.
30+
31+
### A2A - Agent-to-agent communication
32+
33+
Run cagent agents as HTTP servers that other agents or systems can call using
34+
the Agent-to-Agent protocol. Your agent becomes a service that other systems
35+
can discover and invoke over the network. Use A2A when you want to build
36+
multi-agent systems or expose your agent as an HTTP service.
37+
38+
See [A2A integration](./a2a.md) for setup instructions.
39+
40+
## Choosing the right integration
41+
42+
| Feature | ACP | MCP | A2A |
43+
| ------------- | ------------------ | ------------------ | -------------------- |
44+
| Use case | Editor integration | Agents as tools | Agent-to-agent calls |
45+
| Transport | stdio | stdio/SSE | HTTP |
46+
| Discovery | Editor plugin | Server manifest | Agent card |
47+
| Best for | Code editing | Tool integration | Multi-agent systems |
48+
| Communication | Editor calls agent | Client calls tools | Between agents |
49+
50+
Choose ACP if you want your agent embedded in your editor while you code.
51+
Choose MCP if you want Claude Desktop (or another MCP client) to be able to
52+
call your specialized agents as tools. Choose A2A if you're building
53+
multi-agent systems where agents need to call each other over HTTP.

0 commit comments

Comments
 (0)