Skip to content

Conversation

goutamvenkat-anyscale
Copy link

@goutamvenkat-anyscale goutamvenkat-anyscale commented Sep 5, 2025

Install deps: uv sync --extra text
Cmd:

python nightly_benchmarking/common_crawl_benchmark.py \
  --download_path {session}/scratch/downloads \
  --output_path {session}/scratch/output \
  --output_format parquet \
  --crawl_type main \
  --start_snapshot 2023-01 \
  --end_snapshot 2023-10 \
  --html_extraction justext \
  --url_limit 768 \
  --add_filename_column \
  --executor ray_data \
  --ray_data_cast_as_actor \
  --benchmark_results_path {session}/results

Run on 64 CPUs
ray_data_cast_as_actor casts all stages in the pipeline to actors

Signed-off-by: Goutam V. <[email protected]>
Copy link

copy-pr-bot bot commented Sep 5, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant