[WIP] Task Decontamniation #873

praateekmahajan · 2025-07-31T19:58:40Z

Description

   # this class is new... so we should get a one time speedup
    evalsets = ParallelEvalSetBuilder(
        eval_set_specs=[
            (ArcEasyEvalSet, (), {}),
            (ArcChallengeEvalSet, (), {}),
            (WSCEvalSet, (), {}),
            (OpenBookQAEvalSet, (), {}),
            (SquadEvalSet, (), {}),
            (WinograndeEvalSet, (), {}),
            (DropEvalSet, (), {}),
            (WebQAEvalSet, (), {}),
            (RTEEvalSet, (), {}),
        ]
    ).build()
    
    # the original TaskDecontamination is now broken into `EvalSetNGramFrequencyStage` and `EvalSetNGramRemovalStage`
    pipeline1 = Pipeline(
        name="compute_ngram_frequencies",
        description="Compute ngram frequencies from evaluation sets and find matches in dataset",
        stages=[
            JsonlReader(file_paths=args.input_path, columns=[args.text_field], reader="pandas", files_per_partition=1),
            EvalSetNGramFrequencyStage(
                eval_set_ngrams=evalsets[0],
                eval_set_ngrams_frequency_sorted=evalsets[1],
                text_field=args.text_field,
                max_ngram_size=args.max_ngram_size,
            ),
        ],
    )
    results1 = pipeline1.run(executor)

    pipeline2 = Pipeline(
        name="split_documents",
        description="Split documents based on contaminated ngrams",
        stages=[
            JsonlReader(file_paths=args.input_path, columns=[args.text_field], reader="pandas", files_per_partition=1),
            EvalSetNGramRemovalStage(
                matched_ngrams=[r.get_matched_ngrams() for r in results1],
                ngram_freq=results1[0].get_ngram_freq(),
                text_field=args.text_field,
                max_ngram_size=args.max_ngram_size,
                max_matches=args.max_matches,
                min_document_length=args.min_document_length,
                remove_char_each_side=args.remove_char_each_side,
                max_splits=args.max_splits,
                removed_dir=args.removed_dir,
            ),
            JsonlWriter(output_dir=args.output_path, file_extension="jsonl"),
        ],
    )
   results2 = pipeline2.run(executor)

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>

copy-pr-bot · 2025-08-05T16:45:32Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

github-actions · 2025-08-20T02:11:11Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

sarahyurick · 2025-08-27T18:40:27Z

Hi, please target main instead of ray-api now, thanks!

move files over + ruff

87973fe

Signed-off-by: Praateek <[email protected]>

copy-pr-bot bot temporarily deployed to test July 31, 2025 19:58 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci July 31, 2025 19:59 Inactive

praateekmahajan added 2 commits July 31, 2025 13:45

...

3ccfa56

Signed-off-by: Praateek <[email protected]>

i think this works?

d71022a

Signed-off-by: Praateek <[email protected]>

copy-pr-bot bot temporarily deployed to test August 4, 2025 23:51 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci August 4, 2025 23:51 Error

remove extra text_utils

f91e4ce

Signed-off-by: Praateek <[email protected]>

copy-pr-bot bot temporarily deployed to test August 4, 2025 23:54 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci August 4, 2025 23:54 Inactive

add parallel builder

fc7be34

Signed-off-by: Praateek <[email protected]>

copy-pr-bot bot temporarily deployed to test August 5, 2025 00:24 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci August 5, 2025 00:24 Inactive

praateekmahajan marked this pull request as draft August 5, 2025 16:45

praateekmahajan mentioned this pull request Aug 6, 2025

[data] Failed to submit task to actor after ray.shutdown() and re-ray.init() in data pipeline for an existing cluster ray-project/ray#54841

Closed

github-actions bot added the Stale label Aug 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Task Decontamniation #873

[WIP] Task Decontamniation #873

Uh oh!

praateekmahajan commented Jul 31, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Aug 5, 2025

Uh oh!

github-actions bot commented Aug 20, 2025

Uh oh!

sarahyurick commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP] Task Decontamniation #873

Are you sure you want to change the base?

[WIP] Task Decontamniation #873

Uh oh!

Conversation

praateekmahajan commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Aug 5, 2025

Uh oh!

github-actions bot commented Aug 20, 2025

Uh oh!

sarahyurick commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

praateekmahajan commented Jul 31, 2025 •

edited

Loading