Fix MLflow experiment tracker crash with non-existent runs #4227

htahir1 · 2025-11-20T09:26:03Z

Describe changes

Fixes the MLflow experiment tracker crashing with RESOURCE_DOES_NOT_EXIST error when attempting to resume runs on Azure ML.

Changes

Added validation to check if cached run_id exists on MLflow server before resuming
If run doesn't exist, gracefully creates a new run instead of crashing
Added warning log to inform users when creating a new run due to stale cache
Added test coverage for the missing run scenario

Why

The issue occurred when ZenML's cached run_id was out of sync with the MLflow server state (common in distributed/cloud environments like Azure ML). Previously this caused a hard failure. Now it's handled gracefully.

Testing

Added unit test that verifies graceful handling of missing runs
Existing tests continue to pass
Fix preserves backward compatibility

Fixes #4207

Pre-requisites

Please ensure you have done the following:

I have read the CONTRIBUTING.md document.
I have added tests to cover my changes.
I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
IMPORTANT: I made sure that my changes are reflected properly in the following resources:
- ZenML Docs
- Dashboard: Needs to be communicated to the frontend team.
- Templates: Might need adjustments (that are not reflected in the template tests) in case of non-breaking changes and deprecations.
- Projects: Depending on the version dependencies, different projects might get affected.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Other (add details above)

Previously, the MLflow experiment tracker would crash with a RESOURCE_DOES_NOT_EXIST error when attempting to resume a run that existed in ZenML's cache but not on the MLflow server. This was particularly problematic with Azure ML deployments. The fix validates that a cached run_id actually exists on the MLflow server before attempting to resume it. If validation fails, the code gracefully creates a new run instead of crashing. Changes: - Added MlflowException import - Added run existence validation in prepare_step_run() - Log warning when creating new run instead of resuming stale one - Added test to verify graceful handling of missing runs Fixes #4207

github-actions · 2025-11-20T09:29:39Z

Documentation Link Check Results

❌ Absolute links check failed
There are broken absolute links in the documentation. See workflow logs for details
✅ Relative links check passed
_{Last checked: 2025-11-20 09:29:38 UTC}

htahir1 · 2025-11-21T08:09:39Z

@claude full-review

claude · 2025-11-21T08:10:12Z

Claude finished @htahir1's task —— View job

Full PR Review

Tasks

Read and analyze changed files
Review implementation for correctness and best practices
Review test coverage
Provide comprehensive feedback

Summary

This PR effectively fixes the MLflow experiment tracker crash when attempting to resume non-existent runs. The implementation is solid, the test coverage is appropriate, and the fix follows ZenML coding standards. I have a few minor suggestions for improvement below.

Detailed Review

Implementation (src/zenml/integrations/mlflow/experiment_trackers/mlflow_experiment_tracker.py:199-209)

Positives:

✅ The fix correctly addresses the root cause by validating run existence before resuming
✅ Graceful degradation - creates a new run instead of crashing
✅ Proper exception handling with specific MlflowException catch
✅ Informative warning message helps with debugging
✅ Minimal change - doesn't affect other code paths

Suggestions:

Error message verbosity (line 205-207): The warning message includes the full exception {e} which is good for debugging, but consider if this might expose sensitive information in logs (tracking URIs, etc.). If MLflow exceptions can contain sensitive data, you may want to be more selective about what you log.

Exception specificity: The code catches all MlflowException instances. While this works, you could be more specific by checking the error code:

except MlflowException as e:
    if "RESOURCE_DOES_NOT_EXIST" in str(e):
        logger.warning(...)
        run_id = None
    else:
        raise  # Re-raise unexpected MLflow errors

This would make the fix more targeted and not hide other MLflow issues.

Comment clarity (line 199): The comment "Validate that the run exists before attempting to resume it" is accurate but could explain why this validation is needed (i.e., cached run_id may be stale in distributed environments).

Test Coverage (tests/integration/integrations/mlflow/experiment_trackers/test_mlflow_experiment_tracker.py:252-316)

Positives:

✅ Test directly addresses the issue described in MLflow is broken on AzureML #4207
✅ Comprehensive mocking of MLflow dependencies
✅ Clear test documentation explaining the scenario
✅ Proper assertion verifying that run_id=None when creating new run
✅ Good test name: test_mlflow_experiment_tracker_handles_missing_run

Suggestions:

Test edge cases: Consider adding a test for the happy path where the run does exist and should be resumed successfully. This would demonstrate that the fix doesn't break the existing resume functionality.
Mock location (line 252-255): The test patches mlflow.start_run, mlflow.get_run, etc. at the module level. Consider patching them where they're used (e.g., "zenml.integrations.mlflow.experiment_trackers.mlflow_experiment_tracker.mlflow.get_run") for more precise control, though the current approach is acceptable for integration tests.
Test assertion (line 312-314): The assertion message is helpful, but you could also verify that the warning was logged using caplog or similar pytest fixtures to ensure users are properly notified.

Code Style & Standards

Alignment with ZenML Guidelines:

✅ Follows Google Python style for docstrings
✅ Type hints are preserved
✅ Import properly added and alphabetically sorted
✅ Error handling follows early-return guard clause pattern
✅ Logging is appropriately used
⚠️ Comment could be more explanatory per CLAUDE.md guidelines (explain why, not just what)

Security & Performance

✅ No security concerns introduced
✅ Performance impact is negligible (one extra MLflow API call only when resuming)
✅ No SQL injection, XSS, or other OWASP vulnerabilities
✅ Backward compatible - doesn't break existing functionality

Documentation

The PR description is excellent and explains the issue well
No user-facing documentation changes needed (this is internal error handling)
Commit message is descriptive and follows conventions

CI Failure Note

The CI check shows a broken absolute links issue in documentation. This appears to be unrelated to your changes since you only modified Python code. However, you'll want to verify this isn't blocking the PR merge.

Recommendation

APPROVE with minor suggestions for improvement. The fix is correct, well-tested, and solves the reported issue. The suggestions above are optional enhancements, not blockers.

…v7GBrcu

github-actions · 2025-11-21T08:44:35Z

✅ Branch tenant has been deployed! Access it at: https://staging.cloud.zenml.io/workspaces/claude-zenml-issue-4207-01mavdnlzme5h4rgyv7gbrcu/projects

…v7GBrcu

stefannica

I don't understand how this fixes the issue because the run ID is fetched from MLflow itself before being used (resumed).

stefannica · 2025-11-24T08:49:16Z

src/zenml/integrations/mlflow/experiment_trackers/mlflow_experiment_tracker.py

+        # Validate that the run exists before attempting to resume it
+        if run_id:
+            try:
+                mlflow.get_run(run_id)
+            except MlflowException as e:
+                # Run doesn't exist on the MLflow server, create a new one
+                logger.warning(
+                    f"Run with id {run_id} not found in MLflow tracking server. "
+                    f"Creating a new run instead. Error: {e}"
+                )
+                run_id = None
+


I don't expect this MlflowException to ever be raised given that the run is fetched from MLflow itself above with the get_run_id call. If the run does indeed not exist, then the get_run_id implementation is broken.

github-actions bot added the internal To filter out internal PRs and issues label Nov 20, 2025

htahir1 added run-slow-ci Tag that is used to trigger the slow-ci and removed internal To filter out internal PRs and issues labels Nov 20, 2025

htahir1 force-pushed the claude/zenml-issue-4207-01MAvdNLZMe5H4rgYv7GBrcu branch from 7335d52 to 28899d3 Compare November 20, 2025 09:27

htahir1 requested review from AlexejPenner and safoinme November 20, 2025 09:29

htahir1 mentioned this pull request Nov 20, 2025

MLflow is broken on AzureML #4207

Open

1 task

Merge branch 'develop' into claude/zenml-issue-4207-01MAvdNLZMe5H4rgY…

eaa90ca

…v7GBrcu

htahir1 added the staging-workspace Label to use for the staging-workspace workflow. label Nov 21, 2025

Merge branch 'develop' into claude/zenml-issue-4207-01MAvdNLZMe5H4rgY…

2be290b

…v7GBrcu

bcdurak linked an issue Nov 24, 2025 that may be closed by this pull request

MLflow is broken on AzureML #4207

Open

1 task

htahir1 requested a review from stefannica November 24, 2025 08:16

stefannica requested changes Nov 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix MLflow experiment tracker crash with non-existent runs #4227

Fix MLflow experiment tracker crash with non-existent runs #4227

Uh oh!

htahir1 commented Nov 20, 2025

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

htahir1 commented Nov 21, 2025

Uh oh!

claude bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

stefannica left a comment

Uh oh!

stefannica Nov 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix MLflow experiment tracker crash with non-existent runs #4227

Are you sure you want to change the base?

Fix MLflow experiment tracker crash with non-existent runs #4227

Uh oh!

Conversation

htahir1 commented Nov 20, 2025

Describe changes

Changes

Why

Testing

Pre-requisites

Types of changes

Uh oh!

github-actions bot commented Nov 20, 2025

Documentation Link Check Results

Uh oh!

htahir1 commented Nov 21, 2025

Uh oh!

claude bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Full PR Review

Tasks

Summary

Detailed Review

Implementation (src/zenml/integrations/mlflow/experiment_trackers/mlflow_experiment_tracker.py:199-209)

Test Coverage (tests/integration/integrations/mlflow/experiment_trackers/test_mlflow_experiment_tracker.py:252-316)

Code Style & Standards

Security & Performance

Documentation

CI Failure Note

Recommendation

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

stefannica left a comment

Choose a reason for hiding this comment

Uh oh!

stefannica Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

claude bot commented Nov 21, 2025 •

edited

Loading

stefannica Nov 24, 2025 •

edited

Loading