Skip to content

perf: parallelize MCP server initialization in get_all_tool_definitions#138

Open
JasonOA888 wants to merge 2 commits intoMiroMindAI:mainfrom
JasonOA888:perf/parallel-mcp-init
Open

perf: parallelize MCP server initialization in get_all_tool_definitions#138
JasonOA888 wants to merge 2 commits intoMiroMindAI:mainfrom
JasonOA888:perf/parallel-mcp-init

Conversation

@JasonOA888
Copy link
Copy Markdown

Partially addresses #137

Problem:
MCP tool servers were being initialized sequentially in a for loop:

  • tool-python (E2B sandbox): ~33s
  • search_and_scrape_webpage (Serper): ~21s
  • jina_scrape_llm_summary: ~17s
  • Total: ~71s per task

With 1266 BC-EN tasks, this adds significant overhead to evaluation runs.

Solution:

  • Refactored server connection logic into _get_server_tools() helper function
  • Used asyncio.gather() to connect to all servers in parallel
  • Expected savings: ~40-50s per task (parallel time = max of individual times, not sum)

Changes:

  • libs/miroflow-tools/src/miroflow_tools/manager.py:
    • New internal async function _get_server_tools(config)
    • Parallel execution via asyncio.gather(..., return_exceptions=True)
    • Graceful handling of exceptions from parallel execution

Error handling preserved:

  • Failed connections still add an error entry to results
  • Exceptions are logged and handled without crashing the entire initialization

Benchmark impact:

  • BC-EN (1266 tasks): ~40-50s × 1266 = ~14-17 hours saved per run
  • BC-ZH (289 tasks): ~4 hours saved per run

Partially addresses MiroMindAI#137

MCP tool servers were being initialized sequentially in a for loop,
causing ~70-80s overhead per task (tool-python ~33s, search ~21s, jina ~17s).

This change:
- Refactors server connection logic into a helper function _get_server_tools()
- Uses asyncio.gather() to connect to all servers in parallel
- Expected savings: ~40-50s per task initialization

The parallel approach maintains the same error handling behavior:
- Failed connections still add an error entry
- Exceptions from asyncio.gather are logged and handled gracefully
Copy link
Copy Markdown
Member

@Vanint Vanint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good performance improvement — parallelizing the MCP server init is a clear win. A couple of issues to address before merging:

Must Fix

  1. Exception handling loses server identity: When return_exceptions=True catches a failure, the current log message has no indication of which server failed:
f"Unexpected error during parallel server initialization: {result}"

Use zip to preserve the mapping:

for config, result in zip(self.server_configs, results):
    if isinstance(result, Exception):
        self._log("error", "ToolManager | Parallel Init Error",
                  f"Server '{config['name']}' failed: {result}")
    else:
        all_servers_for_prompt.append(result)
  1. Exception path drops the fallback error entry: In the original sequential code, a failed server still gets an entry with {"error": ...} appended to all_servers_for_prompt, so downstream code knows the server exists but failed. In the new code, if an exception escapes past the internal try/except in _get_server_tools, the outer handler just logs and skips — the server silently disappears from the result. Should add a fallback entry in the outer exception branch as well:
if isinstance(result, Exception):
    self._log(...)
    all_servers_for_prompt.append({
        "name": config["name"],
        "tools": [{"error": f"Unable to fetch tools: {result}"}]
    })

Suggestion

  1. Confirm task_log is safe under concurrent writes: Multiple _get_server_tools coroutines now call self._log concurrently. If task_log.log_step isn't designed for concurrent access, logs could interleave or error. Worth a quick check.

Clean change overall — fix the exception handling gaps and this is good to go.

- Use zip(configs, results) to identify which server failed
- Append fallback error entry instead of silently dropping failed servers
- Matches original sequential behavior where failed servers stay in result list
@JasonOA888
Copy link
Copy Markdown
Author

Thanks for the thorough review @Vanint! Both issues addressed in the latest push (2f4c336):

  1. Server identity preserved — now using zip(self.server_configs, results) so error logs clearly show Server '{config[name]}' failed: {result}.

  2. Fallback error entry — failed servers now get a {"name": ..., "tools": [{"error": ...}]} entry appended, matching the original sequential behavior where failed servers stay in the result list.

Re: concurrent _loglog_step calls self.step_logs.append() which is atomic under CPython's GIL, so concurrent writes from coroutines are safe. Happy to add an asyncio.Lock guard if you'd prefer belt-and-suspenders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants