Skip to content

Commit e4f9571

Browse files
lbliiisarahyurick
andauthored
Llane/docs readme updates (#1163)
* readme updates Signed-off-by: Lawrence Lane <[email protected]> * readme overhaul Signed-off-by: Lawrence Lane <[email protected]> * marketing callout Signed-off-by: Lawrence Lane <[email protected]> * added > Signed-off-by: Lawrence Lane <[email protected]> * dask > ray Signed-off-by: Lawrence Lane <[email protected]> * Update README.md Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]> * linkfixes; image classifier changed to filters; remove old pages Signed-off-by: Lawrence Lane <[email protected]> * link swap Signed-off-by: Lawrence Lane <[email protected]> * fixes Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Sarah Yurick <[email protected]>
1 parent 9a2fa46 commit e4f9571

File tree

1 file changed

+87
-67
lines changed

1 file changed

+87
-67
lines changed

README.md

Lines changed: 87 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -8,107 +8,127 @@
88

99
</div>
1010

11-
# Accelerate Data Processing and Streamline Synthetic Data Generation with NVIDIA NeMo Curator
11+
# NVIDIA NeMo Curator
1212

13-
NeMo Curator, part of the NVIDIA NeMo software suite for managing the AI agent lifecycle, is a Python library specifically designed for fast and scalable data processing and curation for generative AI use cases such as foundation language model pretraining, text-to-image model training, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT).
13+
**GPU-accelerated data curation for training better AI models, faster.** Scale from laptop to multi-node clusters with modular pipelines for text, images, video, and audio.
1414

15-
It greatly accelerates data processing and curation by leveraging GPUs with [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids), resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens.
15+
> *Part of the [NVIDIA NeMo](https://www.nvidia.com/en-us/ai-data-science/products/nemo/) software suite for managing the AI agent lifecycle.*
1616
17-
NeMo Curator also provides pre-built pipelines for synthetic data generation for customization and evaluation of generative AI systems. You can use any OpenAI API compatible model and plug it in NeMo Curator's synthetic data generation pipelines to process and curate high-quality synthetic data for various use cases.
17+
## What You Can Do
1818

19-
## Getting Started
19+
| Modality | Key Capabilities | Get Started |
20+
|----------|-----------------|-------------|
21+
| **Text** | Deduplication • Classification • Quality Filtering • Language Detection | [Text Guide](https://docs.nvidia.com/nemo/curator/latest/get-started/text.html) |
22+
| **Image** | Aesthetic Filtering • NSFW Detection • Embedding Generation • Deduplication | [Image Guide](https://docs.nvidia.com/nemo/curator/latest/get-started/image.html) |
23+
| **Video** | Scene Detection • Clip Extraction • Motion Filtering • Deduplication | [Video Guide](https://docs.nvidia.com/nemo/curator/latest/get-started/video.html) |
24+
| **Audio** | ASR Transcription • Quality Assessment • WER Filtering | [Audio Guide](https://docs.nvidia.com/nemo/curator/latest/get-started/audio.html) |
2025

21-
New to NeMo Curator? Start with our quickstart guides for hands-on experience:
26+
## Quick Start
2227

23-
- **[Text Curation Quickstart](https://docs.nvidia.com/nemo/curator/latest/get-started/text.html)** - Set up your environment and run your first text curation pipeline in under 30 minutes
24-
- **[Image Curation Quickstart](https://docs.nvidia.com/nemo/curator/latest/get-started/image.html)** - Learn to curate large-scale image-text datasets for generative model training
28+
```bash
29+
# Install for your modality
30+
uv pip install "nemo-curator[text_cuda12]"
2531

26-
For production deployments and advanced configurations, see our [Setup & Deployment documentation](https://docs.nvidia.com/nemo/curator/latest/admin/index.html).
32+
# Run the quickstart example
33+
python tutorials/quickstart.py
34+
```
2735

28-
---
36+
**Full setup:** [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html)[Docker](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator)[Tutorials](tutorials/)
2937

30-
## Key Features
38+
---
3139

32-
With NeMo Curator, you can process raw data and curate high-quality data for training and customizing generative AI models such as LLMs, VLMs and WFMs. NeMo Curator provides a collection of scalable data processing modules for text and image curation.
40+
## Features by Modality
3341

3442
### Text Curation
35-
All of our text pipelines have great multilingual support. With NeMo Curator, you can pick and choose the features you want and build your data curation pipelines. Text curation follows a three-stage workflow: **Load****Process****Generate**. A typical pipeline starts by downloading raw data from public resources, then applies cleaning and filtering steps, and optionally generates synthetic data for training enhancement.
36-
37-
#### Load Data
38-
- **[Download and Extraction](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html)** - Default implementations for Common Crawl, Wikipedia, and ArXiv sources with easy customization for other sources
39-
40-
#### Process Data
41-
- **Quality Assessment & Filtering**
42-
- [Heuristic Filtering](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/heuristic.html) - 30+ heuristic filters for punctuation density, length, and repetition analysis
43-
- [fastText Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/classifier.html) - Fast language and quality classification
44-
- [GPU-Accelerated Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) - Domain, Quality, Safety, Educational Content, Content Type, and Prompt Task/Complexity Classification
45-
46-
- **Deduplication**
47-
- [Exact Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/gpudedup.html) - Remove identical documents efficiently
48-
- [Fuzzy Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/gpudedup.html) - MinHash Locality Sensitive Hashing with optional False Positive Check
49-
- [Semantic Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/semdedup.html) - GPU-accelerated semantic deduplication using RAPIDS cuML, cuDF, and PyTorch
50-
51-
- **Content Processing & Cleaning**
52-
- [Text Cleaning](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/text-cleaning.html) - Remove improperly decoded Unicode characters, inconsistent line spacing, and excessive URLs
53-
- [PII Redaction](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/pii.html) - Identify and remove personally identifiable information from training datasets
54-
55-
- **Specialized Processing**
56-
- [Language Identification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/language-management/index.html) - Accurate language detection using fastText
57-
- [Task Decontamination](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/specialized-processing/task-decontamination.html) - Remove potential evaluation data leakage from training datasets
58-
59-
#### Generate Data
60-
- **[Synthetic Data Pipelines](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/index.html)** - Pre-built pipelines for generating high-quality synthetic training data:
61-
- [Open Q&A Generation](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/open-qa.html) - Create question-answer pairs for instruction tuning
62-
- [Math Problem Generation](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/math.html) - Generate mathematical problems for educational content
63-
- [Coding Tasks](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/python.html) - Create programming challenges and code examples
64-
- [Writing Prompts](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/writing-task.html) - Generate creative writing and content creation tasks
65-
- [Dialogue Generation](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/dialogue.html) - Create conversational data for chat models
66-
- [Nemotron Pipelines](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/wikipedia.html) - Wikipedia-style rewriting and knowledge distillation
43+
44+
Process and curate high-quality text datasets for large language model (LLM) training with multilingual support.
45+
46+
| Category | Features | Documentation |
47+
|----------|----------|---------------|
48+
| **Data Sources** | Common Crawl • Wikipedia • ArXiv • Custom datasets | [Load Data](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html) |
49+
| **Quality Filtering** | 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type | [Quality Assessment](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/heuristic.html) |
50+
| **Deduplication** | Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) | [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) |
51+
| **Processing** | Text cleaning • Language identification | [Content Processing](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/text-cleaning.html) |
6752

6853
---
6954

7055
### Image Curation
7156

72-
NeMo Curator provides powerful image curation features to curate high-quality image data for training generative AI models such as LLMs, VLMs, and WFMs. Image curation follows a **Load****Process** workflow: download datasets in WebDataset format, create embeddings, apply quality filters (NSFW and Aesthetic), and remove duplicates using semantic deduplication.
57+
Curate large-scale image datasets for vision language models (VLMs) and generative AI training.
58+
59+
| Category | Features | Documentation |
60+
|----------|----------|---------------|
61+
| **Data Loading** | WebDataset format • Large-scale image-text pairs | [Load Data](https://docs.nvidia.com/nemo/curator/latest/curate-images/load-data/index.html) |
62+
| **Embeddings** | CLIP embeddings for semantic analysis | [Embeddings](https://docs.nvidia.com/nemo/curator/latest/curate-images/process-data/embeddings/index.html) |
63+
| **Filtering** | Aesthetic quality scoring • NSFW detection | [Filters](https://docs.nvidia.com/nemo/curator/latest/curate-images/process-data/filters/index.html) |
7364

74-
#### Load Data
75-
- **[WebDataset Loading](https://docs.nvidia.com/nemo/curator/latest/curate-images/load-data/index.html)** - Load large-scale image-text datasets in WebDataset format
65+
---
7666

77-
#### Process Data
78-
- **Embeddings & Feature Extraction**
79-
- [Image Embedding Creation](https://docs.nvidia.com/nemo/curator/latest/curate-images/process-data/embeddings/index.html) - Generate CLIP embeddings for image analysis
67+
### Video Curation
8068

81-
- **Quality Assessment & Filtering**
82-
- [Aesthetic Classification](https://docs.nvidia.com/nemo/curator/latest/curate-images/process-data/classifiers/index.html) - Filter images based on aesthetic quality
83-
- [NSFW Classification](https://docs.nvidia.com/nemo/curator/latest/curate-images/process-data/classifiers/index.html) - Remove inappropriate content from datasets
69+
Process large-scale video corpora with distributed, GPU-accelerated pipelines for world foundation models (WFMs).
8470

85-
- **Deduplication**
86-
- [Semantic Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/semdedup.html) - Remove visually similar images using embedding-based clustering
71+
| Category | Features | Documentation |
72+
|----------|----------|---------------|
73+
| **Data Loading** | Local paths • S3-compatible storage • HTTP(S) URLs | [Load Data](https://docs.nvidia.com/nemo/curator/latest/curate-video/load-data/index.html) |
74+
| **Clipping** | Fixed-stride splitting • Scene-change detection (TransNetV2) | [Clipping](https://docs.nvidia.com/nemo/curator/latest/curate-video/process-data/clipping.html) |
75+
| **Processing** | GPU H.264 encoding • Frame extraction • Motion filtering • Aesthetic filtering | [Processing](https://docs.nvidia.com/nemo/curator/latest/curate-video/process-data/filtering.html) |
76+
| **Embeddings** | InternVideo2 and Cosmos-Embed1 for clip-level embeddings | [Embeddings](https://docs.nvidia.com/nemo/curator/latest/curate-video/process-data/embeddings.html) |
77+
| **Deduplication** | K-means clustering • Pairwise similarity for near-duplicates | [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-video/process-data/dedup.html) |
8778

8879
---
8980

90-
## Module Ablation and Compute Performance
81+
### Audio Curation
9182

92-
The modules within NeMo Curator were primarily designed to process and curate high-quality documents at scale. To evaluate the quality of the data, we curated Common Crawl documents and conducted a series of ablation experiments. In these experiments, we trained a 357M-parameter GPT-style model using datasets generated at various stages of our data curation pipeline, which was implemented in NeMo Curator.
83+
Prepare high-quality speech datasets for automatic speech recognition (ASR) and multimodal AI training.
9384

94-
The following figure shows that the use of different data curation modules implemented in NeMo Curator led to improved model zero-shot downstream task performance.
85+
| Category | Features | Documentation |
86+
|----------|----------|---------------|
87+
| **Data Loading** | Local files • Custom manifests • Public datasets (FLEURS) | [Load Data](https://docs.nvidia.com/nemo/curator/latest/curate-audio/load-data/index.html) |
88+
| **ASR Processing** | NeMo Framework pretrained models • Automatic transcription | [ASR Inference](https://docs.nvidia.com/nemo/curator/latest/curate-audio/process-data/asr-inference/index.html) |
89+
| **Quality Assessment** | Word Error Rate (WER) calculation • Duration analysis • Quality-based filtering | [Quality Assessment](https://docs.nvidia.com/nemo/curator/latest/curate-audio/process-data/quality-assessment/index.html) |
90+
| **Integration** | Text curation workflow integration for multimodal pipelines | [Text Integration](https://docs.nvidia.com/nemo/curator/latest/curate-audio/process-data/text-integration/index.html) |
9591

96-
<p align="center">
97-
<img src="./docs/_images/ablation.png" alt="drawing" width="700"/>
98-
</p>
92+
---
93+
94+
## Why NeMo Curator?
9995

100-
NeMo Curator leverages NVIDIA RAPIDS™ libraries like cuDF, cuML, and cuGraph along with Dask to scale workloads across multi-node, multi-GPU environments, significantly reducing data processing time. With NeMo Curator, developers achieve approximately 16× faster fuzzy‑deduplication on an 8 TB RedPajama‑v2 subset, with ~40% lower TCO and near‑linear scaling on 1–4 H100 80 GB nodes. Refer to the chart below to learn more details.
96+
### Performance at Scale
97+
98+
NeMo Curator leverages NVIDIA RAPIDS™ libraries such as cuDF, cuML, and cuGraph along with Ray to scale workloads across multi-node, multi-GPU environments.
99+
100+
**Proven Results:**
101+
- **16× faster** fuzzy deduplication on 8 TB RedPajama v2 (1.78 trillion tokens)
102+
- **40% lower** total cost of ownership (TCO) compared to CPU-based alternatives
103+
- **Near-linear scaling** from one to four H100 80 GB nodes (2.05 hrs → 0.50 hrs)
101104

102105
<p align="center">
103-
<img src="./docs/_images/text-benchmarks.png" alt="drawing" width="700"/>
106+
<img src="./docs/_images/text-benchmarks.png" alt="Performance benchmarks showing 16x speed improvement, 40% cost savings, and near-linear scaling" width="700"/>
104107
</p>
105108

106-
NeMo Curator exhibits near‑linear scaling for fuzzy deduplication. On an 8 TB RedPajama‑v2 subset (~1.78 trillion tokens), processing time drops from 2.05 hours on one H100 80 GB node to 0.50 hours on four nodes. Refer to the scaling chart below to learn more:
109+
### Quality Improvements
110+
111+
Data curation modules measurably improve model performance. In ablation studies using a 357M-parameter GPT model trained on curated Common Crawl data:
107112

108113
<p align="center">
109-
<img src="./docs/_images/scaling.png" alt="drawing" width="700"/>
114+
<img src="./docs/_images/ablation.png" alt="Model accuracy improvements across curation pipeline stages" width="700"/>
110115
</p>
111116

112-
## Contribute to NeMo Curator
117+
**Results:** Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages.
118+
119+
---
120+
121+
## Learn More
122+
123+
| Resource | Links |
124+
|----------|-------|
125+
| **Documentation** | [Main Docs](https://docs.nvidia.com/nemo/curator/latest/)[API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/index.html)[Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/index.html) |
126+
| **Tutorials** | [Text](tutorials/text/)[Image](tutorials/image/)[Video](tutorials/video/)[Audio](tutorials/audio/) |
127+
| **Deployment** | [Installation](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html)[Infrastructure](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index.html) |
128+
| **Community** | [GitHub Discussions](https://github.com/NVIDIA-NeMo/Curator/discussions)[Issues](https://github.com/NVIDIA-NeMo/Curator/issues) |
129+
130+
---
131+
132+
## Contribute
113133

114-
We welcome community contributions! Please refer to [CONTRIBUTING.md](https://github.com/NVIDIA/NeMo/blob/stable/CONTRIBUTING.md) for the process.
134+
We welcome community contributions! Please refer to [CONTRIBUTING.md](https://github.com/NVIDIA/NeMo/blob/stable/CONTRIBUTING.md) for guidelines.

0 commit comments

Comments
 (0)