-
Notifications
You must be signed in to change notification settings - Fork 558
Fix MLflow experiment tracker crash with non-existent runs #4227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Previously, the MLflow experiment tracker would crash with a RESOURCE_DOES_NOT_EXIST error when attempting to resume a run that existed in ZenML's cache but not on the MLflow server. This was particularly problematic with Azure ML deployments. The fix validates that a cached run_id actually exists on the MLflow server before attempting to resume it. If validation fails, the code gracefully creates a new run instead of crashing. Changes: - Added MlflowException import - Added run existence validation in prepare_step_run() - Log warning when creating new run instead of resuming stale one - Added test to verify graceful handling of missing runs Fixes #4207
7335d52 to
28899d3
Compare
Documentation Link Check Results❌ Absolute links check failed |
|
@claude full-review |
|
Claude finished @htahir1's task —— View job Full PR ReviewTasks
SummaryThis PR effectively fixes the MLflow experiment tracker crash when attempting to resume non-existent runs. The implementation is solid, the test coverage is appropriate, and the fix follows ZenML coding standards. I have a few minor suggestions for improvement below. Detailed ReviewImplementation (src/zenml/integrations/mlflow/experiment_trackers/mlflow_experiment_tracker.py:199-209)Positives:
Suggestions:
Test Coverage (tests/integration/integrations/mlflow/experiment_trackers/test_mlflow_experiment_tracker.py:252-316)Positives:
Suggestions:
Code Style & StandardsAlignment with ZenML Guidelines:
Security & Performance
Documentation
CI Failure NoteThe CI check shows a broken absolute links issue in documentation. This appears to be unrelated to your changes since you only modified Python code. However, you'll want to verify this isn't blocking the PR merge. RecommendationAPPROVE with minor suggestions for improvement. The fix is correct, well-tested, and solves the reported issue. The suggestions above are optional enhancements, not blockers. |
|
✅ Branch tenant has been deployed! Access it at: https://staging.cloud.zenml.io/workspaces/claude-zenml-issue-4207-01mavdnlzme5h4rgyv7gbrcu/projects |
stefannica
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this fixes the issue because the run ID is fetched from MLflow itself before being used (resumed).
| # Validate that the run exists before attempting to resume it | ||
| if run_id: | ||
| try: | ||
| mlflow.get_run(run_id) | ||
| except MlflowException as e: | ||
| # Run doesn't exist on the MLflow server, create a new one | ||
| logger.warning( | ||
| f"Run with id {run_id} not found in MLflow tracking server. " | ||
| f"Creating a new run instead. Error: {e}" | ||
| ) | ||
| run_id = None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect this MlflowException to ever be raised given that the run is fetched from MLflow itself above with the get_run_id call. If the run does indeed not exist, then the get_run_id implementation is broken.
Describe changes
Fixes the MLflow experiment tracker crashing with
RESOURCE_DOES_NOT_EXISTerror when attempting to resume runs on Azure ML.Changes
Why
The issue occurred when ZenML's cached run_id was out of sync with the MLflow server state (common in distributed/cloud environments like Azure ML). Previously this caused a hard failure. Now it's handled gracefully.
Testing
Fixes #4207
Pre-requisites
Please ensure you have done the following:
developand the open PR is targetingdevelop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.Types of changes