Skip to content

Conversation

simonprovost
Copy link
Collaborator

@simonprovost simonprovost commented Jul 18, 2025

Hiya guys!

I hope everything is great for every1ne. I'm very happy to finally release this PR; it's been a tough week of learning, designing, developing, and reworking to propose this new architecture of the M3 library —— I know it hasn't been discussed, but hear me out —— and it turns out it's not too bad-looking, hopefully! 🤞

Important

Please take note, I am worlds away from wanting anyone to take personally that I enhanced the majority of the architecture; the goal of the PR is with sole purpose of enhancing the library's long-term trajectory, and, as Leo mentioned in the email, walking towards this ``toolbox''. The prior codebase was enjoyable to stroll through and investigate, thanks for such a cool tool, but I believe what is presented below might be an excellent next step for achieving the potential toolbox.

❝ In a nutshell

The new M3's architecture is (1) tool-agnostic, (2) chaining-based method API-style shaped, (3) highly typed checked to avoid side effects when data flows in and out, (4) ready to scale to more than one MCP tool / scales MIMIC to more datasets from Physionet, and (5) introduces presets (see below). Plus, a lil bonus, UI-wise it also is evolving, see below.

On top of things, it should function similarly to 0.2.0, hopefully. Speaking of which, I would much appreciate (maybe @rafiattrach ?) if someone may kindly jump in the branch and try setting up the MIMIC tool with the BigQuery / Authentication way, as I do not have such access. Mistakes happen easily in such a large refactoring, so your eyes and try would be safer, please 🙏

❝ The New Architecture

Below will be explored the ground lines of the new architecture, while the code says it all I believe it won't hurt to discuss about it below for context purposes in case it is needed.

Prior all, two things to note. (1) The new design is not only tool-agnostic, but it also follows the Scikit-learn pipeline philosophy, allowing users to construct an M3 pipeline in the same way as Sklearn pipelines are. That is, we can stack (compose) any M3 tools offered and connect the dots between those we want to play with. Once composed, instead of running fit(.), we can build(.) it, and instead of calling predict(.), we can run(.) it. (2) As previously stated, the library uses a chaining-based, API-style approach. The rationale is to prevent having constructors with 50 parameters in the long run; having those chaining methods makes it more resilient and user-friendly e.g, M3().with_config(<...>).with_tool(<tool_1>).with_tool(<tool_2>) or M3().with_preset(<preset_of_interest>), etc.

More specifically,

Core Components (core/)

This core layer manages the library's essential building components, enforcing broadly speaking consistency across the library and its ecosystem. Should not be much utilised on the user's side more on the authors' / contributors' sides. As follows:

  • M3 Tool Framework: Abstract BaseTool enforces uniform structure for M3 supported tools (e.g. MIMIC), defining lifecycle methods (e.g., initialize, teardown) and actions. Includes as well: BaseToolCLI for standardized CLI commands (e.g., init, configure) so that each tools' CLIs are following a similar structure. Anyway, all of these are the components for creating the M3 supported tools, as such MIMIC has been refactored accordingly.

  • MCP Config Generation: Enables generation of configs for various MCP hosts (e.g., FastMCP, ClaudeDesktop) via MCPConfigGenerator base class. On the long term, one may want to support more export to different MCP host, this is where the new primitives will reside and will be automatically leveraged by the whole system.

  • Presets: Pre-configured Python M3 pipelines created via Preset base class. Basically allows for M3 pipelines to be ran based on a fixed defined configuration within the script, great for (1) benchmarks reproduction, (2) fast instantiation like defaults. E.g., default_m3 creates the MIMIC tool with the SQLite backend and default dataset like does m3 run config claude in 0.2.0 basically.

  • M3 Configuration: M3Config class manages log levels, env vars, paths (e.g., data dirs), validation (e.g., for tools), etc. Supports serialization (to_dict, from_dict) and env application for consistent setup.

Tools Section (tools/)

Yay finally this is where all the M3 tools will reside. What's nice is that there is an auto-registration and validation of all the tools checking they follow well the architecture they need prior registration and allow being used throughout the library. Currently supports MIMIC only. As follows:

  • Tool Registry: Auto-registers and validates M3 tools via registry.py, ensuring structural compliance (e.g., main class, CLI presence) etc. On the long term if the tools evolve, more validation checks could be performed here as already available.

  • MIMIC Tool: Core implementation for MIMIC-IV. Nothing new for you here but I might have missed some stuff out of the BigQuery/OAuth2 route as explained in the In a nutshell section. Additionally, what's actually cool however in the MIMIC tool is the configurations. YAML-based declarative files showcasing the supported datasets (datasets.yaml) capable of being called via the init (more datasets could simply be added here, MIMIC-type of course), environment variables (env_vars.yaml) whether they're required etc, and also referencing all the security checks (security.yaml) that was hardcoded in py-files before.

Main M3 Orchestration (src/m3/)

Basically the main entry point of the library either programmatically or via CLI. As follows:

  • M3 Class: Chaining-based API in m3.py (e.g., .with_config(...), .with_tool(...), .with_preset(...)), avoiding complex constructors; supports build (for MCP hosts like FastMCP/Claude), run (starts MCP server), save/load (JSON serialization), and validation/initialization.

  • M3 CLI: Enhanced UI typer compared to before interface in cli.py plus leveraging the new M3 class as much as possible. I'll not say much more about the main outer CLI of the library as the videos below recap it very well I guess.

A great foundation always makes it easier on the long term. ❞

Side notes. The unit tests should now be more flourishing too, despite the possibility of certain loopholes :) but more than 75% of the codebase is being tested against 30% in 0.2.0. Very lastly. Code been enforced in typing via Beartype (O(1) runtime checks), refer to #45 for the why. Removed redundant top of the file docstrings (favor class/method docs), added the great TheFuzz for fuzzy error handling when say you are looking for a tool but you made a typo when calling it (e.g., the system says Did you mean X?), and UI enhancements for better usability.

Caution

While refactoring is complete, docstrings are more than minimal (one-liner per classes only) pending (1) final PR state post-reviews to avoid wasting time to such writing, and (2) repo org migration + ReadTheDocs setup (#40); bear with me—very happy to fully docstring once ready! :)

❝ Stop Waffling & Show Some BeforeAfter

m3 --help:

1-help.mp4

Various m3 utilities CLI commands:

2-utilities.mp4

UI enhancements when downloading datasets via m3 run mimic init

3-mimic_init.mp4

m3 run config claude:

4-mimic_claude.mp4

Leverage presets (basically doing m3 run config Claude of above in one line too):

5-presets.mp4

Build the first M3 Pipeline w/ two tools in play for Claude 🎉

I've removed the tool as it was a non-useful / out of scope one just for the sake of the example :)

6-multi_tools_pipeline.mp4

Hope it helps,

Cheeeeers!

@simonprovost
Copy link
Collaborator Author

simonprovost commented Jul 18, 2025

cc-ing @rafiattrach @rajna-fani @MoreiraP12.

So sorry for the long long read guys, this was somewhat necessary as we do not meet online, and to justify the +6,736; −3,689. If you have any questions, fire them up in the comments, even prior reviews 🫡

@simonprovost simonprovost force-pushed the refactor/toolbox_based_architecture branch 2 times, most recently from 7736b74 to ce7fe71 Compare July 18, 2025 20:44
@simonprovost simonprovost self-assigned this Jul 18, 2025
@simonprovost simonprovost added the enhancement New feature or request label Jul 18, 2025
@simonprovost simonprovost requested a review from rafiattrach July 18, 2025 21:36
@simonprovost simonprovost force-pushed the refactor/toolbox_based_architecture branch from ee94d58 to 83e7d37 Compare August 1, 2025 22:45
@simonprovost
Copy link
Collaborator Author

Now externalised in https://github.com/MCP-Pipeline/MCPStack via the initiative @ https://github.com/MCP-Pipeline. Thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant