Skip to content

Conversation

@praateekmahajan
Copy link
Contributor

@praateekmahajan praateekmahajan commented Jul 31, 2025

Description

   # this class is new... so we should get a one time speedup
    evalsets = ParallelEvalSetBuilder(
        eval_set_specs=[
            (ArcEasyEvalSet, (), {}),
            (ArcChallengeEvalSet, (), {}),
            (WSCEvalSet, (), {}),
            (OpenBookQAEvalSet, (), {}),
            (SquadEvalSet, (), {}),
            (WinograndeEvalSet, (), {}),
            (DropEvalSet, (), {}),
            (WebQAEvalSet, (), {}),
            (RTEEvalSet, (), {}),
        ]
    ).build()
    
    # the original TaskDecontamination is now broken into `EvalSetNGramFrequencyStage` and `EvalSetNGramRemovalStage`
    pipeline1 = Pipeline(
        name="compute_ngram_frequencies",
        description="Compute ngram frequencies from evaluation sets and find matches in dataset",
        stages=[
            JsonlReader(file_paths=args.input_path, columns=[args.text_field], reader="pandas", files_per_partition=1),
            EvalSetNGramFrequencyStage(
                eval_set_ngrams=evalsets[0],
                eval_set_ngrams_frequency_sorted=evalsets[1],
                text_field=args.text_field,
                max_ngram_size=args.max_ngram_size,
            ),
        ],
    )
    results1 = pipeline1.run(executor)

    pipeline2 = Pipeline(
        name="split_documents",
        description="Split documents based on contaminated ngrams",
        stages=[
            JsonlReader(file_paths=args.input_path, columns=[args.text_field], reader="pandas", files_per_partition=1),
            EvalSetNGramRemovalStage(
                matched_ngrams=[r.get_matched_ngrams() for r in results1],
                ngram_freq=results1[0].get_ngram_freq(),
                text_field=args.text_field,
                max_ngram_size=args.max_ngram_size,
                max_matches=args.max_matches,
                min_document_length=args.min_document_length,
                remove_char_each_side=args.remove_char_each_side,
                max_splits=args.max_splits,
                removed_dir=args.removed_dir,
            ),
            JsonlWriter(output_dir=args.output_path, file_extension="jsonl"),
        ],
    )
   results2 = pipeline2.run(executor)

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 5, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions
Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 20, 2025
@sarahyurick
Copy link
Contributor

Hi, please target main instead of ray-api now, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants