Llama Nemotron Data Curation tutorial #1063

sarahyurick · 2025-09-10T18:54:46Z

Re-adds https://github.com/NVIDIA-NeMo/Curator/tree/dask/tutorials/llama-nemotron-data-curation with Ray backend instead of Dask.

Related issue: #706.

TODO:

Verify reasoning on versus reasoning off composition of the input data

syurick@ipp1-3302:~$ jq -r '.reasoning' /raid/syurick/Llama-Nemotron-Post-Training-Dataset/SFT/chat/chat.jsonl | awk '{counts[$0]++} END {for (k in counts) print counts[k], k}' | sort -nr
31218 off
8574 on
syurick@ipp1-3302:~$ jq -r '.reasoning' /raid/syurick/Llama-Nemotron-Post-Training-Dataset/SFT/math/math_v1.1.jsonl | awk '{counts[$0]++} END {for (k in counts) print counts[k], k}' | sort -nr
2225427 on

Verify correctness for heuristic filters
Verify correctness for model filters
Verify end-to-end run
If time allows: generate filtering statistics

Signed-off-by: Sarah Yurick <[email protected]>

copy-pr-bot · 2025-09-10T18:54:50Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick

@Maghoumi left some open comments, but nothing major IMO. Please let me know what you think.

sarahyurick · 2025-09-18T22:32:48Z

tutorials/text/llama-nemotron-data-curation/README.md

+- Remove non-English samples
+- Remove samples with total length (system prompt, input, and output responses) longer than 16k tokens (with chat template applied via the tokenizer)
+- Remove samples with output responses longer than 8k tokens (with chat template applied via the tokenizer)
+- Remove any columns specified by the `--remove-columns` parameter. We recommend removing the columns specified above (so that the remaining columns are the "input", "output", and "completion_token_count" columns)


Is this reasonable? Does training.jsonl need to keep any additional columns?

sarahyurick · 2025-09-18T22:34:00Z

tutorials/text/llama-nemotron-data-curation/filters/model_filters.py

+                # For maximum accuracy it should be tokenize=True here,
+                # but for speedups we just use the length of the base string instead of tokenizing it
+                # (we consider this to be acceptable since 1 token is approx 4 characters for English
+                # and we are only using the CompletionTokenCountFilter as a proxy for text complexity)
+                tokenize=False,


IIRC we aligned on this when implementing the Dask tutorial. Just wanted to double check that this is still okay.

sarahyurick · 2025-09-18T22:34:38Z

tutorials/text/llama-nemotron-data-curation/filters/model_filters.py

+    def _setup(self, local_files_only: bool = True) -> None:
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            self.tokenizer_identifier,
+            use_fast=True,  # TODO: Test without this


This is from the Dask implementation. I want to verify that it still helps.

No time difference here, so I will remove use_fast=True.

sarahyurick · 2025-09-18T22:39:33Z

tutorials/text/llama-nemotron-data-curation/utils/jsonl_utils.py

+
+
+@ray.remote
+def split_jsonl_by_size(input_file: str, target_size_mb: int, output_dir: str, output_prefix: str) -> None:


We need this function to essentially enable partitions of single JSONL files. It is basically mimicking reshard_jsonl which was deprecated in the Dask to Ray migration: https://github.com/NVIDIA-NeMo/Curator/blob/dask/nemo_curator/scripts/make_data_shards.py

I think it is okay for just the tutorial for now, but we should add it back to the codebase eventually.

sarahyurick · 2025-09-18T22:41:05Z

tutorials/text/llama-nemotron-data-curation/utils/jsonl_utils.py

+def interleave_datasets(dir1: str, dir2: str, out_path: str) -> None:
+    gen1 = stream_jsonl_files(dir1)
+    gen2 = stream_jsonl_files(dir2)
+
+    with open(out_path, "w") as out:
+        for line1, line2 in itertools.zip_longest(gen1, gen2):
+            if line1 is not None:
+                out.write(line1 + "\n")
+            if line2 is not None:
+                out.write(line2 + "\n")
+
+    print(f"Interleaved datasets from directories {dir1} and {dir2} into file {out_path}")


This function interleaves 2 datasets, row by row. In the Dask version, we had the option to interleave per row or per partition of data (with interleaving per partition of data as the default). Should we maintain that logic in the Ray version too?

Signed-off-by: Sarah Yurick <[email protected]>

create llama nemotron tutorial script and readme

9b406f4

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick and others added 6 commits September 10, 2025 16:23

enable all heuristic filters

4ac4273

Signed-off-by: Sarah Yurick <[email protected]>

Merge branch 'main' into llama_nemotron_tutorial

43c5afc

add tokenizer steps

eb0517f

Signed-off-by: Sarah Yurick <[email protected]>

add noqa

805977a

Signed-off-by: Sarah Yurick <[email protected]>

add full tutorial

711c488

Signed-off-by: Sarah Yurick <[email protected]>

Merge branch 'main' into llama_nemotron_tutorial

039fbe1

sarahyurick marked this pull request as ready for review September 18, 2025 22:26

copy-pr-bot bot temporarily deployed to test September 18, 2025 22:26 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci September 18, 2025 22:26 Failure

copy-pr-bot bot temporarily deployed to nemo-ci September 18, 2025 22:26 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci September 18, 2025 22:26 Failure

copy-pr-bot bot temporarily deployed to nemo-ci September 18, 2025 22:26 Inactive

sarahyurick commented Sep 18, 2025

View reviewed changes

sarahyurick assigned Maghoumi Sep 18, 2025

fix empty filter test

53a9d48

Signed-off-by: Sarah Yurick <[email protected]>

copy-pr-bot bot temporarily deployed to test September 18, 2025 22:51 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci September 18, 2025 22:51 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci September 26, 2025 21:27 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci September 26, 2025 21:45 Failure

Merge branch 'main' into llama_nemotron_tutorial

5a25200

copy-pr-bot bot temporarily deployed to test October 2, 2025 18:43 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci October 2, 2025 18:43 Failure

copy-pr-bot bot temporarily deployed to nemo-ci October 2, 2025 18:43 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama Nemotron Data Curation tutorial #1063

Llama Nemotron Data Curation tutorial #1063

Uh oh!

sarahyurick commented Sep 10, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Sep 10, 2025

Uh oh!

sarahyurick left a comment

Uh oh!

sarahyurick Sep 18, 2025

Uh oh!

sarahyurick Sep 18, 2025

Uh oh!

sarahyurick Sep 18, 2025

Uh oh!

sarahyurick Sep 19, 2025

Uh oh!

sarahyurick Sep 18, 2025

Uh oh!

sarahyurick Sep 18, 2025

Uh oh!

Uh oh!



		@ray.remote
		def split_jsonl_by_size(input_file: str, target_size_mb: int, output_dir: str, output_prefix: str) -> None:

Llama Nemotron Data Curation tutorial #1063

Are you sure you want to change the base?

Llama Nemotron Data Curation tutorial #1063

Uh oh!

Conversation

sarahyurick commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Sep 10, 2025

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

sarahyurick Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

sarahyurick Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

sarahyurick Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

sarahyurick Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

sarahyurick Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

sarahyurick Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sarahyurick commented Sep 10, 2025 •

edited

Loading