Skip to content

Conversation

@htahir1
Copy link
Contributor

@htahir1 htahir1 commented Nov 20, 2025

Describe changes

Fixes the MLflow experiment tracker crashing with RESOURCE_DOES_NOT_EXIST error when attempting to resume runs on Azure ML.

Changes

  • Added validation to check if cached run_id exists on MLflow server before resuming
  • If run doesn't exist, gracefully creates a new run instead of crashing
  • Added warning log to inform users when creating a new run due to stale cache
  • Added test coverage for the missing run scenario

Why

The issue occurred when ZenML's cached run_id was out of sync with the MLflow server state (common in distributed/cloud environments like Azure ML). Previously this caused a hard failure. Now it's handled gracefully.

Testing

  • Added unit test that verifies graceful handling of missing runs
  • Existing tests continue to pass
  • Fix preserves backward compatibility

Fixes #4207

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • IMPORTANT: I made sure that my changes are reflected properly in the following resources:
    • ZenML Docs
    • Dashboard: Needs to be communicated to the frontend team.
    • Templates: Might need adjustments (that are not reflected in the template tests) in case of non-breaking changes and deprecations.
    • Projects: Depending on the version dependencies, different projects might get affected.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

@github-actions github-actions bot added the internal To filter out internal PRs and issues label Nov 20, 2025
@htahir1 htahir1 added run-slow-ci Tag that is used to trigger the slow-ci and removed internal To filter out internal PRs and issues labels Nov 20, 2025
Previously, the MLflow experiment tracker would crash with a
RESOURCE_DOES_NOT_EXIST error when attempting to resume a run
that existed in ZenML's cache but not on the MLflow server.
This was particularly problematic with Azure ML deployments.

The fix validates that a cached run_id actually exists on the
MLflow server before attempting to resume it. If validation fails,
the code gracefully creates a new run instead of crashing.

Changes:
- Added MlflowException import
- Added run existence validation in prepare_step_run()
- Log warning when creating new run instead of resuming stale one
- Added test to verify graceful handling of missing runs

Fixes #4207
@htahir1 htahir1 force-pushed the claude/zenml-issue-4207-01MAvdNLZMe5H4rgYv7GBrcu branch from 7335d52 to 28899d3 Compare November 20, 2025 09:27
@github-actions
Copy link
Contributor

Documentation Link Check Results

Absolute links check failed
There are broken absolute links in the documentation. See workflow logs for details
Relative links check passed
Last checked: 2025-11-20 09:29:38 UTC

@htahir1 htahir1 mentioned this pull request Nov 20, 2025
1 task
@htahir1
Copy link
Contributor Author

htahir1 commented Nov 21, 2025

@claude full-review

@claude
Copy link

claude bot commented Nov 21, 2025

Claude finished @htahir1's task —— View job


Full PR Review

Tasks

  • Read and analyze changed files
  • Review implementation for correctness and best practices
  • Review test coverage
  • Provide comprehensive feedback

Summary

This PR effectively fixes the MLflow experiment tracker crash when attempting to resume non-existent runs. The implementation is solid, the test coverage is appropriate, and the fix follows ZenML coding standards. I have a few minor suggestions for improvement below.

Detailed Review

Implementation (src/zenml/integrations/mlflow/experiment_trackers/mlflow_experiment_tracker.py:199-209)

Positives:

  • ✅ The fix correctly addresses the root cause by validating run existence before resuming
  • ✅ Graceful degradation - creates a new run instead of crashing
  • ✅ Proper exception handling with specific MlflowException catch
  • ✅ Informative warning message helps with debugging
  • ✅ Minimal change - doesn't affect other code paths

Suggestions:

  1. Error message verbosity (line 205-207): The warning message includes the full exception {e} which is good for debugging, but consider if this might expose sensitive information in logs (tracking URIs, etc.). If MLflow exceptions can contain sensitive data, you may want to be more selective about what you log.

  2. Exception specificity: The code catches all MlflowException instances. While this works, you could be more specific by checking the error code:

    except MlflowException as e:
        if "RESOURCE_DOES_NOT_EXIST" in str(e):
            logger.warning(...)
            run_id = None
        else:
            raise  # Re-raise unexpected MLflow errors

    This would make the fix more targeted and not hide other MLflow issues.

  3. Comment clarity (line 199): The comment "Validate that the run exists before attempting to resume it" is accurate but could explain why this validation is needed (i.e., cached run_id may be stale in distributed environments).

Test Coverage (tests/integration/integrations/mlflow/experiment_trackers/test_mlflow_experiment_tracker.py:252-316)

Positives:

  • ✅ Test directly addresses the issue described in MLflow is broken on AzureML #4207
  • ✅ Comprehensive mocking of MLflow dependencies
  • ✅ Clear test documentation explaining the scenario
  • ✅ Proper assertion verifying that run_id=None when creating new run
  • ✅ Good test name: test_mlflow_experiment_tracker_handles_missing_run

Suggestions:

  1. Test edge cases: Consider adding a test for the happy path where the run does exist and should be resumed successfully. This would demonstrate that the fix doesn't break the existing resume functionality.

  2. Mock location (line 252-255): The test patches mlflow.start_run, mlflow.get_run, etc. at the module level. Consider patching them where they're used (e.g., "zenml.integrations.mlflow.experiment_trackers.mlflow_experiment_tracker.mlflow.get_run") for more precise control, though the current approach is acceptable for integration tests.

  3. Test assertion (line 312-314): The assertion message is helpful, but you could also verify that the warning was logged using caplog or similar pytest fixtures to ensure users are properly notified.

Code Style & Standards

Alignment with ZenML Guidelines:

  • ✅ Follows Google Python style for docstrings
  • ✅ Type hints are preserved
  • ✅ Import properly added and alphabetically sorted
  • ✅ Error handling follows early-return guard clause pattern
  • ✅ Logging is appropriately used
  • ⚠️ Comment could be more explanatory per CLAUDE.md guidelines (explain why, not just what)

Security & Performance

  • ✅ No security concerns introduced
  • ✅ Performance impact is negligible (one extra MLflow API call only when resuming)
  • ✅ No SQL injection, XSS, or other OWASP vulnerabilities
  • ✅ Backward compatible - doesn't break existing functionality

Documentation

  • The PR description is excellent and explains the issue well
  • No user-facing documentation changes needed (this is internal error handling)
  • Commit message is descriptive and follows conventions

CI Failure Note

The CI check shows a broken absolute links issue in documentation. This appears to be unrelated to your changes since you only modified Python code. However, you'll want to verify this isn't blocking the PR merge.

Recommendation

APPROVE with minor suggestions for improvement. The fix is correct, well-tested, and solves the reported issue. The suggestions above are optional enhancements, not blockers.


@htahir1 htahir1 added the staging-workspace Label to use for the staging-workspace workflow. label Nov 21, 2025
@github-actions
Copy link
Contributor

✅ Branch tenant has been deployed! Access it at: https://staging.cloud.zenml.io/workspaces/claude-zenml-issue-4207-01mavdnlzme5h4rgyv7gbrcu/projects

@bcdurak bcdurak linked an issue Nov 24, 2025 that may be closed by this pull request
1 task
@htahir1 htahir1 requested a review from stefannica November 24, 2025 08:16
Copy link
Contributor

@stefannica stefannica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this fixes the issue because the run ID is fetched from MLflow itself before being used (resumed).

Comment on lines +199 to +210
# Validate that the run exists before attempting to resume it
if run_id:
try:
mlflow.get_run(run_id)
except MlflowException as e:
# Run doesn't exist on the MLflow server, create a new one
logger.warning(
f"Run with id {run_id} not found in MLflow tracking server. "
f"Creating a new run instead. Error: {e}"
)
run_id = None

Copy link
Contributor

@stefannica stefannica Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect this MlflowException to ever be raised given that the run is fetched from MLflow itself above with the get_run_id call. If the run does indeed not exist, then the get_run_id implementation is broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-slow-ci Tag that is used to trigger the slow-ci staging-workspace Label to use for the staging-workspace workflow.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MLflow is broken on AzureML

4 participants