|
8 | 8 |
|
9 | 9 | </div>
|
10 | 10 |
|
11 |
| -# Accelerate Data Processing and Streamline Synthetic Data Generation with NVIDIA NeMo Curator |
| 11 | +# NVIDIA NeMo Curator |
12 | 12 |
|
13 |
| -NeMo Curator, part of the NVIDIA NeMo software suite for managing the AI agent lifecycle, is a Python library specifically designed for fast and scalable data processing and curation for generative AI use cases such as foundation language model pretraining, text-to-image model training, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT). |
| 13 | +**GPU-accelerated data curation for training better AI models, faster.** Scale from laptop to multi-node clusters with modular pipelines for text, images, video, and audio. |
14 | 14 |
|
15 |
| -It greatly accelerates data processing and curation by leveraging GPUs with [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids), resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens. |
| 15 | +> *Part of the [NVIDIA NeMo](https://www.nvidia.com/en-us/ai-data-science/products/nemo/) software suite for managing the AI agent lifecycle.* |
16 | 16 |
|
17 |
| -NeMo Curator also provides pre-built pipelines for synthetic data generation for customization and evaluation of generative AI systems. You can use any OpenAI API compatible model and plug it in NeMo Curator's synthetic data generation pipelines to process and curate high-quality synthetic data for various use cases. |
| 17 | +## What You Can Do |
18 | 18 |
|
19 |
| -## Getting Started |
| 19 | +| Modality | Key Capabilities | Get Started | |
| 20 | +|----------|-----------------|-------------| |
| 21 | +| **Text** | Deduplication • Classification • Quality Filtering • Language Detection | [Text Guide](https://docs.nvidia.com/nemo/curator/latest/get-started/text.html) | |
| 22 | +| **Image** | Aesthetic Filtering • NSFW Detection • Embedding Generation • Deduplication | [Image Guide](https://docs.nvidia.com/nemo/curator/latest/get-started/image.html) | |
| 23 | +| **Video** | Scene Detection • Clip Extraction • Motion Filtering • Deduplication | [Video Guide](https://docs.nvidia.com/nemo/curator/latest/get-started/video.html) | |
| 24 | +| **Audio** | ASR Transcription • Quality Assessment • WER Filtering | [Audio Guide](https://docs.nvidia.com/nemo/curator/latest/get-started/audio.html) | |
20 | 25 |
|
21 |
| -New to NeMo Curator? Start with our quickstart guides for hands-on experience: |
| 26 | +## Quick Start |
22 | 27 |
|
23 |
| -- **[Text Curation Quickstart](https://docs.nvidia.com/nemo/curator/latest/get-started/text.html)** - Set up your environment and run your first text curation pipeline in under 30 minutes |
24 |
| -- **[Image Curation Quickstart](https://docs.nvidia.com/nemo/curator/latest/get-started/image.html)** - Learn to curate large-scale image-text datasets for generative model training |
| 28 | +```bash |
| 29 | +# Install for your modality |
| 30 | +uv pip install "nemo-curator[text_cuda12]" |
25 | 31 |
|
26 |
| -For production deployments and advanced configurations, see our [Setup & Deployment documentation](https://docs.nvidia.com/nemo/curator/latest/admin/index.html). |
| 32 | +# Run the quickstart example |
| 33 | +python tutorials/quickstart.py |
| 34 | +``` |
27 | 35 |
|
28 |
| ---- |
| 36 | +**Full setup:** [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html) • [Docker](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) • [Tutorials](tutorials/) |
29 | 37 |
|
30 |
| -## Key Features |
| 38 | +--- |
31 | 39 |
|
32 |
| -With NeMo Curator, you can process raw data and curate high-quality data for training and customizing generative AI models such as LLMs, VLMs and WFMs. NeMo Curator provides a collection of scalable data processing modules for text and image curation. |
| 40 | +## Features by Modality |
33 | 41 |
|
34 | 42 | ### Text Curation
|
35 |
| -All of our text pipelines have great multilingual support. With NeMo Curator, you can pick and choose the features you want and build your data curation pipelines. Text curation follows a three-stage workflow: **Load** → **Process** → **Generate**. A typical pipeline starts by downloading raw data from public resources, then applies cleaning and filtering steps, and optionally generates synthetic data for training enhancement. |
36 |
| - |
37 |
| -#### Load Data |
38 |
| -- **[Download and Extraction](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html)** - Default implementations for Common Crawl, Wikipedia, and ArXiv sources with easy customization for other sources |
39 |
| - |
40 |
| -#### Process Data |
41 |
| -- **Quality Assessment & Filtering** |
42 |
| - - [Heuristic Filtering](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/heuristic.html) - 30+ heuristic filters for punctuation density, length, and repetition analysis |
43 |
| - - [fastText Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/classifier.html) - Fast language and quality classification |
44 |
| - - [GPU-Accelerated Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) - Domain, Quality, Safety, Educational Content, Content Type, and Prompt Task/Complexity Classification |
45 |
| - |
46 |
| -- **Deduplication** |
47 |
| - - [Exact Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/gpudedup.html) - Remove identical documents efficiently |
48 |
| - - [Fuzzy Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/gpudedup.html) - MinHash Locality Sensitive Hashing with optional False Positive Check |
49 |
| - - [Semantic Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/semdedup.html) - GPU-accelerated semantic deduplication using RAPIDS cuML, cuDF, and PyTorch |
50 |
| - |
51 |
| -- **Content Processing & Cleaning** |
52 |
| - - [Text Cleaning](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/text-cleaning.html) - Remove improperly decoded Unicode characters, inconsistent line spacing, and excessive URLs |
53 |
| - - [PII Redaction](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/pii.html) - Identify and remove personally identifiable information from training datasets |
54 |
| - |
55 |
| -- **Specialized Processing** |
56 |
| - - [Language Identification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/language-management/index.html) - Accurate language detection using fastText |
57 |
| - - [Task Decontamination](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/specialized-processing/task-decontamination.html) - Remove potential evaluation data leakage from training datasets |
58 |
| - |
59 |
| -#### Generate Data |
60 |
| -- **[Synthetic Data Pipelines](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/index.html)** - Pre-built pipelines for generating high-quality synthetic training data: |
61 |
| - - [Open Q&A Generation](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/open-qa.html) - Create question-answer pairs for instruction tuning |
62 |
| - - [Math Problem Generation](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/math.html) - Generate mathematical problems for educational content |
63 |
| - - [Coding Tasks](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/python.html) - Create programming challenges and code examples |
64 |
| - - [Writing Prompts](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/writing-task.html) - Generate creative writing and content creation tasks |
65 |
| - - [Dialogue Generation](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/dialogue.html) - Create conversational data for chat models |
66 |
| - - [Nemotron Pipelines](https://docs.nvidia.com/nemo/curator/latest/curate-text/generate-data/pipelines/wikipedia.html) - Wikipedia-style rewriting and knowledge distillation |
| 43 | + |
| 44 | +Process and curate high-quality text datasets for large language model (LLM) training with multilingual support. |
| 45 | + |
| 46 | +| Category | Features | Documentation | |
| 47 | +|----------|----------|---------------| |
| 48 | +| **Data Sources** | Common Crawl • Wikipedia • ArXiv • Custom datasets | [Load Data](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html) | |
| 49 | +| **Quality Filtering** | 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type | [Quality Assessment](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/heuristic.html) | |
| 50 | +| **Deduplication** | Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) | [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) | |
| 51 | +| **Processing** | Text cleaning • Language identification | [Content Processing](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/text-cleaning.html) | |
67 | 52 |
|
68 | 53 | ---
|
69 | 54 |
|
70 | 55 | ### Image Curation
|
71 | 56 |
|
72 |
| -NeMo Curator provides powerful image curation features to curate high-quality image data for training generative AI models such as LLMs, VLMs, and WFMs. Image curation follows a **Load** → **Process** workflow: download datasets in WebDataset format, create embeddings, apply quality filters (NSFW and Aesthetic), and remove duplicates using semantic deduplication. |
| 57 | +Curate large-scale image datasets for vision language models (VLMs) and generative AI training. |
| 58 | + |
| 59 | +| Category | Features | Documentation | |
| 60 | +|----------|----------|---------------| |
| 61 | +| **Data Loading** | WebDataset format • Large-scale image-text pairs | [Load Data](https://docs.nvidia.com/nemo/curator/latest/curate-images/load-data/index.html) | |
| 62 | +| **Embeddings** | CLIP embeddings for semantic analysis | [Embeddings](https://docs.nvidia.com/nemo/curator/latest/curate-images/process-data/embeddings/index.html) | |
| 63 | +| **Filtering** | Aesthetic quality scoring • NSFW detection | [Filters](https://docs.nvidia.com/nemo/curator/latest/curate-images/process-data/filters/index.html) | |
73 | 64 |
|
74 |
| -#### Load Data |
75 |
| -- **[WebDataset Loading](https://docs.nvidia.com/nemo/curator/latest/curate-images/load-data/index.html)** - Load large-scale image-text datasets in WebDataset format |
| 65 | +--- |
76 | 66 |
|
77 |
| -#### Process Data |
78 |
| -- **Embeddings & Feature Extraction** |
79 |
| - - [Image Embedding Creation](https://docs.nvidia.com/nemo/curator/latest/curate-images/process-data/embeddings/index.html) - Generate CLIP embeddings for image analysis |
| 67 | +### Video Curation |
80 | 68 |
|
81 |
| -- **Quality Assessment & Filtering** |
82 |
| - - [Aesthetic Classification](https://docs.nvidia.com/nemo/curator/latest/curate-images/process-data/classifiers/index.html) - Filter images based on aesthetic quality |
83 |
| - - [NSFW Classification](https://docs.nvidia.com/nemo/curator/latest/curate-images/process-data/classifiers/index.html) - Remove inappropriate content from datasets |
| 69 | +Process large-scale video corpora with distributed, GPU-accelerated pipelines for world foundation models (WFMs). |
84 | 70 |
|
85 |
| -- **Deduplication** |
86 |
| - - [Semantic Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/semdedup.html) - Remove visually similar images using embedding-based clustering |
| 71 | +| Category | Features | Documentation | |
| 72 | +|----------|----------|---------------| |
| 73 | +| **Data Loading** | Local paths • S3-compatible storage • HTTP(S) URLs | [Load Data](https://docs.nvidia.com/nemo/curator/latest/curate-video/load-data/index.html) | |
| 74 | +| **Clipping** | Fixed-stride splitting • Scene-change detection (TransNetV2) | [Clipping](https://docs.nvidia.com/nemo/curator/latest/curate-video/process-data/clipping.html) | |
| 75 | +| **Processing** | GPU H.264 encoding • Frame extraction • Motion filtering • Aesthetic filtering | [Processing](https://docs.nvidia.com/nemo/curator/latest/curate-video/process-data/filtering.html) | |
| 76 | +| **Embeddings** | InternVideo2 and Cosmos-Embed1 for clip-level embeddings | [Embeddings](https://docs.nvidia.com/nemo/curator/latest/curate-video/process-data/embeddings.html) | |
| 77 | +| **Deduplication** | K-means clustering • Pairwise similarity for near-duplicates | [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-video/process-data/dedup.html) | |
87 | 78 |
|
88 | 79 | ---
|
89 | 80 |
|
90 |
| -## Module Ablation and Compute Performance |
| 81 | +### Audio Curation |
91 | 82 |
|
92 |
| -The modules within NeMo Curator were primarily designed to process and curate high-quality documents at scale. To evaluate the quality of the data, we curated Common Crawl documents and conducted a series of ablation experiments. In these experiments, we trained a 357M-parameter GPT-style model using datasets generated at various stages of our data curation pipeline, which was implemented in NeMo Curator. |
| 83 | +Prepare high-quality speech datasets for automatic speech recognition (ASR) and multimodal AI training. |
93 | 84 |
|
94 |
| -The following figure shows that the use of different data curation modules implemented in NeMo Curator led to improved model zero-shot downstream task performance. |
| 85 | +| Category | Features | Documentation | |
| 86 | +|----------|----------|---------------| |
| 87 | +| **Data Loading** | Local files • Custom manifests • Public datasets (FLEURS) | [Load Data](https://docs.nvidia.com/nemo/curator/latest/curate-audio/load-data/index.html) | |
| 88 | +| **ASR Processing** | NeMo Framework pretrained models • Automatic transcription | [ASR Inference](https://docs.nvidia.com/nemo/curator/latest/curate-audio/process-data/asr-inference/index.html) | |
| 89 | +| **Quality Assessment** | Word Error Rate (WER) calculation • Duration analysis • Quality-based filtering | [Quality Assessment](https://docs.nvidia.com/nemo/curator/latest/curate-audio/process-data/quality-assessment/index.html) | |
| 90 | +| **Integration** | Text curation workflow integration for multimodal pipelines | [Text Integration](https://docs.nvidia.com/nemo/curator/latest/curate-audio/process-data/text-integration/index.html) | |
95 | 91 |
|
96 |
| -<p align="center"> |
97 |
| - <img src="./docs/_images/ablation.png" alt="drawing" width="700"/> |
98 |
| -</p> |
| 92 | +--- |
| 93 | + |
| 94 | +## Why NeMo Curator? |
99 | 95 |
|
100 |
| -NeMo Curator leverages NVIDIA RAPIDS™ libraries like cuDF, cuML, and cuGraph along with Dask to scale workloads across multi-node, multi-GPU environments, significantly reducing data processing time. With NeMo Curator, developers achieve approximately 16× faster fuzzy‑deduplication on an 8 TB RedPajama‑v2 subset, with ~40% lower TCO and near‑linear scaling on 1–4 H100 80 GB nodes. Refer to the chart below to learn more details. |
| 96 | +### Performance at Scale |
| 97 | + |
| 98 | +NeMo Curator leverages NVIDIA RAPIDS™ libraries such as cuDF, cuML, and cuGraph along with Ray to scale workloads across multi-node, multi-GPU environments. |
| 99 | + |
| 100 | +**Proven Results:** |
| 101 | +- **16× faster** fuzzy deduplication on 8 TB RedPajama v2 (1.78 trillion tokens) |
| 102 | +- **40% lower** total cost of ownership (TCO) compared to CPU-based alternatives |
| 103 | +- **Near-linear scaling** from one to four H100 80 GB nodes (2.05 hrs → 0.50 hrs) |
101 | 104 |
|
102 | 105 | <p align="center">
|
103 |
| - <img src="./docs/_images/text-benchmarks.png" alt="drawing" width="700"/> |
| 106 | + <img src="./docs/_images/text-benchmarks.png" alt="Performance benchmarks showing 16x speed improvement, 40% cost savings, and near-linear scaling" width="700"/> |
104 | 107 | </p>
|
105 | 108 |
|
106 |
| -NeMo Curator exhibits near‑linear scaling for fuzzy deduplication. On an 8 TB RedPajama‑v2 subset (~1.78 trillion tokens), processing time drops from 2.05 hours on one H100 80 GB node to 0.50 hours on four nodes. Refer to the scaling chart below to learn more: |
| 109 | +### Quality Improvements |
| 110 | + |
| 111 | +Data curation modules measurably improve model performance. In ablation studies using a 357M-parameter GPT model trained on curated Common Crawl data: |
107 | 112 |
|
108 | 113 | <p align="center">
|
109 |
| - <img src="./docs/_images/scaling.png" alt="drawing" width="700"/> |
| 114 | + <img src="./docs/_images/ablation.png" alt="Model accuracy improvements across curation pipeline stages" width="700"/> |
110 | 115 | </p>
|
111 | 116 |
|
112 |
| -## Contribute to NeMo Curator |
| 117 | +**Results:** Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages. |
| 118 | + |
| 119 | +--- |
| 120 | + |
| 121 | +## Learn More |
| 122 | + |
| 123 | +| Resource | Links | |
| 124 | +|----------|-------| |
| 125 | +| **Documentation** | [Main Docs](https://docs.nvidia.com/nemo/curator/latest/) • [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/index.html) • [Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/index.html) | |
| 126 | +| **Tutorials** | [Text](tutorials/text/) • [Image](tutorials/image/) • [Video](tutorials/video/) • [Audio](tutorials/audio/) | |
| 127 | +| **Deployment** | [Installation](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html) • [Infrastructure](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index.html) | |
| 128 | +| **Community** | [GitHub Discussions](https://github.com/NVIDIA-NeMo/Curator/discussions) • [Issues](https://github.com/NVIDIA-NeMo/Curator/issues) | |
| 129 | + |
| 130 | +--- |
| 131 | + |
| 132 | +## Contribute |
113 | 133 |
|
114 |
| -We welcome community contributions! Please refer to [CONTRIBUTING.md](https://github.com/NVIDIA/NeMo/blob/stable/CONTRIBUTING.md) for the process. |
| 134 | +We welcome community contributions! Please refer to [CONTRIBUTING.md](https://github.com/NVIDIA/NeMo/blob/stable/CONTRIBUTING.md) for guidelines. |
0 commit comments