Testing for Data Sanitization and Dataset Manager by Gautam-Galada · Pull Request #16 · Sudhendra/compression-layer

Gautam-Galada · 2026-02-04T20:22:10Z

PR Title

Add sanitization tests, dataset manager coverage, and MLflow logging utility

Description

This PR introduces comprehensive test coverage for sanitization validation rules and dataset switching functionality, and adds a post-training MLflow logging utility.

Testing Improvements

25 test cases covering sanitization rules A–D (ratio, symbol validation, negation preservation, semantic usage)
Tests for dataset manager state management and file operations
Zero-division safety checks for compression ratio calculations
Graceful handling of missing optional modules via pytest skip markers
Ruff-compliant test code (E731, E712)

MLflow Logging Utility

Adds scripts/mlflow_logger.py for post-training logging of existing runs
Logs parameters, metrics, artifacts, and loss curves to MLflow/DagsHub
Introduces dagshub and mlflow as runtime dependencies

Testing

pytest tests/test_sanitization_manager.py -v

Sudhendra · 2026-02-06T19:30:32Z

@Gautam-Galada address the following comments:

scripts/dataset_manager.py: The paths are hardcoded to data/training/train.jsonl, but the training config and docs point at data/validated/train.jsonl. Please either wire this to the configured training path (preferred) or make the paths configurable via CLI args so the swap actually affects training runs.
scripts/data_sanitization.py rule_b_orphaned_symbols: This rejects any compression starting with @, which will drop valid code samples that begin with decorators (and contradicts your own expectation of high @ usage in code). Please make Rule B code-aware (allow leading @ for code) or skip the “symbol at start” check for code samples.
scripts/data_sanitization.py extract_verbose_compressed: If this script is intentionally scoped only to data/training/train.jsonl, please encode that assumption explicitly (docstring + validation). Add a guard that verifies the expected chat-message shape (e.g., messages with a user entry containing Compress:) and fails loudly or logs a dedicated parse error when the format doesn’t match, plus a small test fixture that reflects the actual train.jsonl structure to prevent silent drops.
scripts/data_sanitization.py compute_compression_ratio: Our evaluation pipeline uses token-based ratios (src/utils/tokenizers.py compression_ratio), but this script uses word counts and inverts the ratio (verbose/compressed). Please switch to token counts and align ratio direction with the shared helper; for Rule A, just check compressed_tokens <= verbose_tokens (or use compression_ratio <= 1.0) to match the rest of the codebase.
scripts/dataset_manager.py verify_files_exist: The error message tells users to run sanitize_training_data_v2.py, but the file is scripts/data_sanitization.py. Please update the message to the correct script name.
tests/test_sanitization_manager.py: There’s no coverage of the main flow (sanitize_dataset) or dataset swapping (update_to_sanitized/revert_to_original). Please add tests for those critical paths, including state/log updates and backup behavior.
scripts/data_sanitization.py is_code_sample: The {...} regex is broad and can misclassify NL (e.g., brace-heavy text or JSON fragments) as code. Please tighten this heuristic or require multiple signals before classifying as code.
scripts/data_sanitization.py sanitize_dataset: Missing verbose/compressed pairs are counted as Rule A failures, which conflates parse/format errors with quality failures. Please track these as a separate “parse errors” bucket in stats and reporting.
Optional but recommended: Add CLI args for --input/--output in scripts/data_sanitization.py and --train-path in scripts/dataset_manager.py so these tools can be used with non-default dataset layouts.

Gautam-Galada · 2026-02-07T21:58:30Z

Response to Review Comments

All comments have been addressed. Changes organized by file:

`scripts/dataset_manager.py`

Hardcoded paths → Added CLI args with dynamic config

--train-file and --sanitized arguments (lines 245-259)
get_config() builds paths from args or defaults (lines 29-47)
Backup/state files derived from train file location

Error message fix → Removed incorrect script reference

Generic helpful message instead (line 125)

`scripts/data_sanitization.py`

Rule B code-aware → Allows @ for decorators

rule_b_orphaned_symbols() accepts is_code parameter (line 310)
Explicit exemption for leading @ in code samples (lines 318-322)
Test coverage: test_allows_decorator_for_code()

Format validation → Explicit scope + guards

Docstring documents expected chat-message format (lines 169-184)
Validates dict structure, messages key, role presence (lines 186-237)
Descriptive error messages with sample IDs for debugging
Test fixture: test_extracts_correctly() validates actual format

Token-based compression ratio → Aligned with shared helper

Imports compression_ratio from src/utils/tokenizers.py (line 23)
Same parameter order and ratio direction (lines 242-249)
Rule A checks ratio <= 1.0 to match evaluation pipeline (line 267)

is_code_sample heuristic → Multi-signal approach

Requires 2+ strong signals OR 1 very strong signal (lines 165-166)
Tightened {...} check: requires semicolons/return/assignment inside braces (lines 160-162)
Won't misclassify JSON or brace-heavy text as code

Parse errors separation → Dedicated tracking bucket

Parse errors tracked separately from Rule A failures (lines 374-382)
Separate stats: parse_errors vs rule_a_failed (lines 358-359)
Dedicated reporting section (lines 478-487)

CLI args → Custom paths supported

--input, --sanitized, --unsanitized arguments (lines 500-519)
Examples in help text for non-default layouts

`tests/test_sanitization_manager.py`

Comprehensive test coverage → 37 test methods added

Main flow:

test_sanitize_dataset_splits_correctly() - validates good/bad sample separation
test_unicode_symbols_preserved() - ensures proper encoding

Dataset swapping:

test_update_creates_backup_first_time() - backup creation
test_update_blocks_consecutive_updates() - idempotency protection
test_revert_restores_from_backup() - restoration flow
test_revert_blocks_consecutive_reverts() - safety mechanism

State/log updates:

test_save_and_load() - state persistence
test_operations_append_to_log() - change history tracking

Helper function coverage:

Code detection, extraction, all validation rules
Edge cases: empty strings, unicode handling

Configuration Alignment

Verified paths match across:

training.yaml → data.dir: "data/training"
dataset_manager.py → data/training/train.jsonl (default)
data_sanitization.py → data/training/train.jsonl (default)

Format validation matches actual train.jsonl structure (chat messages with system/user/assistant roles).

All requirements met. Changes maintain backward compatibility while adding flexibility for custom workflows.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/utils/tokenizers.py

tests/test_analysis_data.py

pyproject.toml

scripts/data_sanitization.py

scripts/analysis_data.py

scripts/mlflow_logger.py

scripts/data_sanitization.py

scripts/analysis_data.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Fix NameError on malformed JSON lines by initializing stats/unsanitized_data before the JSONL loading loop instead of after it - Move dagshub/mlflow from core dependencies to optional [mlflow] extra - Remove stale RULE C FIX docstring, duplicate import re, and unused Tuple import; use builtin tuple[] instead of typing.Tuple - Add test for malformed JSON line handling in sanitize_and_extract

fix: resolve 3 remaining issues from PR #16 review

Gautam-Galada added 2 commits February 4, 2026 14:18

test: for sanitizaiton and dataset.json swap (mlx)

304e1f0

feat: data sanitization

1c3c15f

Sudhendra and others added 4 commits February 7, 2026 01:30

import order

ab3ca94

chore: allow encoding text with no disallowed specials

8992869

Merge branch 'Sudhendra:main' into phase-6-loss_function

27c5380

fix: dataset analysis, anitization and manager with unittesting

2306998

Gautam-Galada added 6 commits February 7, 2026 16:02

chore: fix ruff unused import

e070884

feat: post training mlflow logger

cea71b4

fix: package added to pyproject.toml

a812a82

fix: all packages added to pyproject.toml

de25c4e

fix: monkey patch in test_mlflow_logger

364bd5e

feat: visualization of nl-v2 data

d7b5c20

Sudhendra requested a review from Copilot February 9, 2026 16:26

Copilot AI reviewed Feb 9, 2026

View reviewed changes

Sudhendra requested a review from Copilot February 10, 2026 17:22

Copilot started reviewing on behalf of Sudhendra February 10, 2026 17:22 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

Gautam-Galada and others added 7 commits February 10, 2026 12:18

chore: in extract_symbol_sequence for comment

a6db22c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

chore : removing redefinitions

b69fa50

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: detect contracted negation reliably

31555af

fix: harden sanitization robustness and make mlflow logging configurable

196b115

test: Tokenizer Special Token Handling

289e2c4

fix : added cli and testing for mlflow

decb402

chore: apply ruff formatting

20621d4

Sudhendra merged commit 87179f8 into Sudhendra:main Feb 11, 2026
3 checks passed

Sudhendra mentioned this pull request Feb 12, 2026

fix: resolve 3 remaining issues from PR #16 review #18

Merged

Sudhendra added a commit that referenced this pull request Feb 12, 2026

Merge pull request #18 from Sudhendra/fix/pr16-review-issues

48f6f73

fix: resolve 3 remaining issues from PR #16 review

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing for Data Sanitization and Dataset Manager#16

Testing for Data Sanitization and Dataset Manager#16
Sudhendra merged 19 commits intoSudhendra:mainfrom
Gautam-Galada:phase-6-loss_function

Gautam-Galada commented Feb 4, 2026 •

edited

Loading

Uh oh!

Sudhendra commented Feb 6, 2026 •

edited

Loading

Uh oh!

Gautam-Galada commented Feb 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Gautam-Galada commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Title

Description

Testing Improvements

MLflow Logging Utility

Testing

Uh oh!

Sudhendra commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gautam-Galada commented Feb 7, 2026

Response to Review Comments

scripts/dataset_manager.py

scripts/data_sanitization.py

tests/test_sanitization_manager.py

Configuration Alignment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Gautam-Galada commented Feb 4, 2026 •

edited

Loading

Sudhendra commented Feb 6, 2026 •

edited

Loading

`scripts/dataset_manager.py`

`scripts/data_sanitization.py`

`tests/test_sanitization_manager.py`