Skip to content

Testing for Data Sanitization and Dataset Manager#16

Merged
Sudhendra merged 19 commits intoSudhendra:mainfrom
Gautam-Galada:phase-6-loss_function
Feb 11, 2026
Merged

Testing for Data Sanitization and Dataset Manager#16
Sudhendra merged 19 commits intoSudhendra:mainfrom
Gautam-Galada:phase-6-loss_function

Conversation

@Gautam-Galada
Copy link
Contributor

@Gautam-Galada Gautam-Galada commented Feb 4, 2026

PR Title

Add sanitization tests, dataset manager coverage, and MLflow logging utility


Description

This PR introduces comprehensive test coverage for sanitization validation rules and dataset switching functionality, and adds a post-training MLflow logging utility.

Testing Improvements

  • 25 test cases covering sanitization rules A–D (ratio, symbol validation, negation preservation, semantic usage)
  • Tests for dataset manager state management and file operations
  • Zero-division safety checks for compression ratio calculations
  • Graceful handling of missing optional modules via pytest skip markers
  • Ruff-compliant test code (E731, E712)

MLflow Logging Utility

  • Adds scripts/mlflow_logger.py for post-training logging of existing runs
  • Logs parameters, metrics, artifacts, and loss curves to MLflow/DagsHub
  • Introduces dagshub and mlflow as runtime dependencies

Testing

pytest tests/test_sanitization_manager.py -v

@Sudhendra
Copy link
Owner

Sudhendra commented Feb 6, 2026

@Gautam-Galada address the following comments:

  • scripts/dataset_manager.py: The paths are hardcoded to data/training/train.jsonl, but the training config and docs point at data/validated/train.jsonl. Please either wire this to the configured training path (preferred) or make the paths configurable via CLI args so the swap actually affects training runs.
  • scripts/data_sanitization.py rule_b_orphaned_symbols: This rejects any compression starting with @, which will drop valid code samples that begin with decorators (and contradicts your own expectation of high @ usage in code). Please make Rule B code-aware (allow leading @ for code) or skip the “symbol at start” check for code samples.
  • scripts/data_sanitization.py extract_verbose_compressed: If this script is intentionally scoped only to data/training/train.jsonl, please encode that assumption explicitly (docstring + validation). Add a guard that verifies the expected chat-message shape (e.g., messages with a user entry containing Compress:) and fails loudly or logs a dedicated parse error when the format doesn’t match, plus a small test fixture that reflects the actual train.jsonl structure to prevent silent drops.
  • scripts/data_sanitization.py compute_compression_ratio: Our evaluation pipeline uses token-based ratios (src/utils/tokenizers.py compression_ratio), but this script uses word counts and inverts the ratio (verbose/compressed). Please switch to token counts and align ratio direction with the shared helper; for Rule A, just check compressed_tokens <= verbose_tokens (or use compression_ratio <= 1.0) to match the rest of the codebase.
  • scripts/dataset_manager.py verify_files_exist: The error message tells users to run sanitize_training_data_v2.py, but the file is scripts/data_sanitization.py. Please update the message to the correct script name.
  • tests/test_sanitization_manager.py: There’s no coverage of the main flow (sanitize_dataset) or dataset swapping (update_to_sanitized/revert_to_original). Please add tests for those critical paths, including state/log updates and backup behavior.
  • scripts/data_sanitization.py is_code_sample: The {...} regex is broad and can misclassify NL (e.g., brace-heavy text or JSON fragments) as code. Please tighten this heuristic or require multiple signals before classifying as code.
  • scripts/data_sanitization.py sanitize_dataset: Missing verbose/compressed pairs are counted as Rule A failures, which conflates parse/format errors with quality failures. Please track these as a separate “parse errors” bucket in stats and reporting.
  • Optional but recommended: Add CLI args for --input/--output in scripts/data_sanitization.py and --train-path in scripts/dataset_manager.py so these tools can be used with non-default dataset layouts.

@Gautam-Galada
Copy link
Contributor Author

Response to Review Comments

All comments have been addressed. Changes organized by file:


scripts/dataset_manager.py

Hardcoded paths → Added CLI args with dynamic config

  • --train-file and --sanitized arguments (lines 245-259)
  • get_config() builds paths from args or defaults (lines 29-47)
  • Backup/state files derived from train file location

Error message fix → Removed incorrect script reference

  • Generic helpful message instead (line 125)

scripts/data_sanitization.py

Rule B code-aware → Allows @ for decorators

  • rule_b_orphaned_symbols() accepts is_code parameter (line 310)
  • Explicit exemption for leading @ in code samples (lines 318-322)
  • Test coverage: test_allows_decorator_for_code()

Format validation → Explicit scope + guards

  • Docstring documents expected chat-message format (lines 169-184)
  • Validates dict structure, messages key, role presence (lines 186-237)
  • Descriptive error messages with sample IDs for debugging
  • Test fixture: test_extracts_correctly() validates actual format

Token-based compression ratio → Aligned with shared helper

  • Imports compression_ratio from src/utils/tokenizers.py (line 23)
  • Same parameter order and ratio direction (lines 242-249)
  • Rule A checks ratio <= 1.0 to match evaluation pipeline (line 267)

is_code_sample heuristic → Multi-signal approach

  • Requires 2+ strong signals OR 1 very strong signal (lines 165-166)
  • Tightened {...} check: requires semicolons/return/assignment inside braces (lines 160-162)
  • Won't misclassify JSON or brace-heavy text as code

Parse errors separation → Dedicated tracking bucket

  • Parse errors tracked separately from Rule A failures (lines 374-382)
  • Separate stats: parse_errors vs rule_a_failed (lines 358-359)
  • Dedicated reporting section (lines 478-487)

CLI args → Custom paths supported

  • --input, --sanitized, --unsanitized arguments (lines 500-519)
  • Examples in help text for non-default layouts

tests/test_sanitization_manager.py

Comprehensive test coverage → 37 test methods added

Main flow:

  • test_sanitize_dataset_splits_correctly() - validates good/bad sample separation
  • test_unicode_symbols_preserved() - ensures proper encoding

Dataset swapping:

  • test_update_creates_backup_first_time() - backup creation
  • test_update_blocks_consecutive_updates() - idempotency protection
  • test_revert_restores_from_backup() - restoration flow
  • test_revert_blocks_consecutive_reverts() - safety mechanism

State/log updates:

  • test_save_and_load() - state persistence
  • test_operations_append_to_log() - change history tracking

Helper function coverage:

  • Code detection, extraction, all validation rules
  • Edge cases: empty strings, unicode handling

Configuration Alignment

Verified paths match across:

  • training.yamldata.dir: "data/training"
  • dataset_manager.pydata/training/train.jsonl (default)
  • data_sanitization.pydata/training/train.jsonl (default)

Format validation matches actual train.jsonl structure (chat messages with system/user/assistant roles).


All requirements met. Changes maintain backward compatibility while adding flexibility for custom workflows.

@Sudhendra Sudhendra requested a review from Copilot February 9, 2026 16:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Sudhendra Sudhendra merged commit 87179f8 into Sudhendra:main Feb 11, 2026
3 checks passed
Sudhendra added a commit that referenced this pull request Feb 12, 2026
- Fix NameError on malformed JSON lines by initializing stats/unsanitized_data
  before the JSONL loading loop instead of after it
- Move dagshub/mlflow from core dependencies to optional [mlflow] extra
- Remove stale RULE C FIX docstring, duplicate import re, and unused Tuple import;
  use builtin tuple[] instead of typing.Tuple
- Add test for malformed JSON line handling in sanitize_and_extract
Sudhendra added a commit that referenced this pull request Feb 12, 2026
fix: resolve 3 remaining issues from PR #16 review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments