Testing for Data Sanitization and Dataset Manager#16
Testing for Data Sanitization and Dataset Manager#16Sudhendra merged 19 commits intoSudhendra:mainfrom
Conversation
|
@Gautam-Galada address the following comments:
|
Response to Review CommentsAll comments have been addressed. Changes organized by file:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 10 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Fix NameError on malformed JSON lines by initializing stats/unsanitized_data before the JSONL loading loop instead of after it - Move dagshub/mlflow from core dependencies to optional [mlflow] extra - Remove stale RULE C FIX docstring, duplicate import re, and unused Tuple import; use builtin tuple[] instead of typing.Tuple - Add test for malformed JSON line handling in sanitize_and_extract
fix: resolve 3 remaining issues from PR #16 review
PR Title
Add sanitization tests, dataset manager coverage, and MLflow logging utility
Description
This PR introduces comprehensive test coverage for sanitization validation rules and dataset switching functionality, and adds a post-training MLflow logging utility.
Testing Improvements
MLflow Logging Utility
scripts/mlflow_logger.pyfor post-training logging of existing runsdagshubandmlflowas runtime dependenciesTesting