diff --git a/docs/_templates/autodoc2_index.rst b/docs/_templates/autodoc2_index.rst index eaf5bb9a5..1d45b9688 100644 --- a/docs/_templates/autodoc2_index.rst +++ b/docs/_templates/autodoc2_index.rst @@ -6,87 +6,62 @@ NeMo Curator's API reference provides comprehensive technical documentation for .. grid:: 1 2 2 2 :gutter: 3 - .. grid-item-card:: :octicon:`database;1.5em;sd-mr-1` Core Data Handling - :link: datasets/datasets + .. grid-item-card:: :octicon:`server;1.5em;sd-mr-1` Execution Backends + :link: backends/backends :link-type: doc :class-card: sd-border-0 - **Datasets & Download** + **Ray-based execution backends** - Essential classes for loading, managing, and downloading training data from various sources. + Adapters and executors for running pipelines at scale. - :bdg-secondary:`doc-dataset` :bdg-secondary:`parallel-dataset` :bdg-secondary:`arxiv` :bdg-secondary:`commoncrawl` + :bdg-secondary:`ray-data` :bdg-secondary:`xenna` - .. grid-item-card:: :octicon:`filter;1.5em;sd-mr-1` Data Processing - :link: filters/filters + .. grid-item-card:: :octicon:`workflow;1.5em;sd-mr-1` Pipeline + :link: pipeline/pipeline :link-type: doc :class-card: sd-border-0 - **Filters & Modifiers** + **Orchestrate end-to-end workflows** - Tools for cleaning, filtering, and transforming text data to improve quality and remove unwanted content. - - :bdg-secondary:`classifier-filter` :bdg-secondary:`heuristic-filter` :bdg-secondary:`pii-modifier` + Build and run pipelines composed of processing stages. - .. grid-item-card:: :octicon:`code;1.5em;sd-mr-1` Classification & Analysis - :link: classifiers/classifiers + .. grid-item-card:: :octicon:`stack;1.5em;sd-mr-1` Processing Stages + :link: stages/stages :link-type: doc :class-card: sd-border-0 - **AI-Powered Analysis** + **Download, transform, and write data** - Advanced classification tools and image processing capabilities for content analysis and quality assessment. + Modular stages for download/extract, text models/classifiers, I/O, and utilities. - :bdg-secondary:`aegis` :bdg-secondary:`content-type` :bdg-secondary:`domain-classifier` + :bdg-secondary:`download` :bdg-secondary:`text` :bdg-secondary:`io` :bdg-secondary:`modules` - .. grid-item-card:: :octicon:`shield-check;1.5em;sd-mr-1` Privacy & Security - :link: pii/pii + .. grid-item-card:: :octicon:`tasklist;1.5em;sd-mr-1` Tasks + :link: tasks/tasks :link-type: doc :class-card: sd-border-0 - **PII Detection & Redaction** + **Core data structures** - Identify and handle personally identifiable information in datasets with advanced recognition algorithms. - - :bdg-secondary:`recognizers` :bdg-secondary:`algorithms` :bdg-secondary:`redaction` + Document batches, file groups, and related interfaces passed between stages. - .. grid-item-card:: :octicon:`zap;1.5em;sd-mr-1` Synthetic Data - :link: synthetic/synthetic + .. grid-item-card:: :octicon:`gear;1.5em;sd-mr-1` Utilities + :link: utils/utils :link-type: doc :class-card: sd-border-0 - **Data Generation** + **Helper functions** - Create high-quality synthetic training data using advanced language models and generation techniques. - - :bdg-secondary:`generator` :bdg-secondary:`nemotron` :bdg-secondary:`mixtral` - - .. grid-item-card:: :octicon:`tools;1.5em;sd-mr-1` Advanced Processing - :link: modules/modules - :link-type: doc - :class-card: sd-border-0 - - **Deduplication & Modules** - - Advanced processing modules including semantic deduplication, fuzzy matching, and data pipeline components. - - :bdg-secondary:`semantic-dedup` :bdg-secondary:`fuzzy-dedup` :bdg-secondary:`add-id` + File, performance, and operation utilities used across the pipeline. .. toctree:: :maxdepth: 1 :caption: API Modules :hidden: - datasets/datasets - download/download - filters/filters - modifiers/modifiers - modules/modules - classifiers/classifiers - image/image - pii/pii - synthetic/synthetic - services/services - nemo_run/nemo_run + backends/backends + pipeline/pipeline + stages/stages tasks/tasks utils/utils diff --git a/docs/conf.py b/docs/conf.py index 37212b4b7..d4f35cf18 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -141,31 +141,22 @@ # -- Options for Autodoc2 --------------------------------------------------- sys.path.insert(0, os.path.abspath("..")) +# Ensure the ray-curator package is importable (package lives under ray-curator/ray_curator) +sys.path.insert(0, os.path.abspath(os.path.join("..", "ray-curator"))) -# Document individual submodules instead of the top-level package -# This should generate shorter filenames without the nemo_curator. prefix +# Document `ray_curator` subpackages instead of the legacy `nemo_curator` +# This should generate shorter filenames without the ray_curator. prefix autodoc2_packages_list = [ - # Core data handling - "../nemo_curator/datasets", - "../nemo_curator/download", - # Data processing - "../nemo_curator/filters", - "../nemo_curator/modifiers", - "../nemo_curator/modules", - # Classification and analysis - "../nemo_curator/classifiers", - "../nemo_curator/image", - # Privacy and security - "../nemo_curator/pii", - # Synthetic data - "../nemo_curator/synthetic", - # Services and infrastructure - "../nemo_curator/services", - "../nemo_curator/nemo_run", - # Evaluation and tasks - "../nemo_curator/tasks", - # Utilities - "../nemo_curator/utils", + # Execution backends and adapters + "../ray-curator/ray_curator/backends", + # Pipeline orchestration + "../ray-curator/ray_curator/pipeline", + # All processing stages (download/extract, modules, text, io, etc.) + "../ray-curator/ray_curator/stages", + # Core task data structures + "../ray-curator/ray_curator/tasks", + # Shared utilities + "../ray-curator/ray_curator/utils", ] # Check if any of the packages actually exist before enabling autodoc2 diff --git a/docs/curate-text/load-data/arxiv.md b/docs/curate-text/load-data/arxiv.md index 60075b572..b934381e5 100644 --- a/docs/curate-text/load-data/arxiv.md +++ b/docs/curate-text/load-data/arxiv.md @@ -1,165 +1,166 @@ --- -description: "Download and extract text from arXiv academic papers using NeMo Curator with LaTeX processing and automatic metadata extraction" +description: "Download and extract text from arXiv using ray-curator's pipeline framework" categories: ["how-to-guides"] -tags: ["arxiv", "academic-papers", "latex", "pdf", "data-loading", "scientific-data"] +tags: ["arxiv", "academic-papers", "latex", "data-loading", "scientific-data", "ray-curator"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "how-to" modality: "text-only" --- -(text-load-data-arxiv)= # ArXiv -Download and extract text from ArXiv papers using NeMo Curator utilities. - -ArXiv is a free distribution service and open-access archive for scholarly articles, primarily in fields like physics, mathematics, computer science, and more. ArXiv contains millions of scholarly papers, most of them available in LaTeX source format. - -## How it Works - -NeMo Curator simplifies the process of: - -- Downloading ArXiv papers from S3 -- Extracting text from LaTeX source files -- Converting the content to a standardized format for further processing - -## Before You Start - -ArXiv papers are hosted on Amazon S3, so you'll need to have: - -1. Properly configured AWS credentials in `~/.aws/config` -2. [s5cmd](https://github.com/peak/s5cmd) installed (pre-installed in the NVIDIA NeMo Framework Container) - ---- - -## Usage +(text-load-data-arxiv)= -Here's how to download and extract ArXiv data using NeMo Curator: +Download and extract text from ArXiv LaTeX source bundles using ray-curator's pipeline framework. -::::{tab-set} +ArXiv hosts millions of scholarly papers, typically distributed as LaTeX source inside `.tar` archives under the `s3://arxiv/src/` requester-pays bucket. -:::{tab-item} Python +## How it Works -```python -from nemo_curator.utils.distributed_utils import get_client -from nemo_curator.download import download_arxiv +The ArXiv pipeline in ray-curator consists of four stages: -# Initialize a Dask client -client = get_client(cluster_type="cpu") +1. URL Generation: Lists available ArXiv source tar files from the S3 bucket +2. Download: Downloads `.tar` archives via s5cmd (requester-pays) +3. Iteration: Extracts LaTeX projects and yields per-paper records +4. Extraction: Cleans LaTeX and produces plain text -# Download and extract ArXiv papers -arxiv_dataset = download_arxiv(output_path="/extracted/output/folder") +Internals (implemented): -# Write the dataset to disk -arxiv_dataset.to_json(output_path="/extracted/output/folder", write_to_filename=True) -``` +- URL generation: `ray_curator/stages/download/text/arxiv/url_generation.py` +- Download: `ray_curator/stages/download/text/arxiv/download.py` +- Iterator: `ray_curator/stages/download/text/arxiv/iterator.py` +- Extractor: `ray_curator/stages/download/text/arxiv/extract.py` +- Composite stage: `ray_curator/stages/download/text/arxiv/stage.py` -::: +## Before You Start -:::{tab-item} CLI +You must have s5cmd installed and AWS credentials configured for requester-pays. -```bash -download_and_extract \ - --input-url-file=./arxiv_urls.txt \ - --builder-config-file=./config/arxiv_builder.yaml \ - --output-json-dir=/datasets/arxiv/json -``` +- Install s5cmd: see `https://github.com/peak/s5cmd` +- Configure AWS credentials in your environment (or `~/.aws/credentials`) with access to requester-pays buckets -The config file should look like: +```{admonition} S3 Requester Pays +:class: tip -```yaml -download_module: nemo_curator.download.arxiv.ArxivDownloader -download_params: {} -iterator_module: nemo_curator.download.arxiv.ArxivIterator -iterator_params: {} -extract_module: nemo_curator.download.arxiv.ArxivExtractor -extract_params: {} +The ArXiv `s3://arxiv/src/` bucket is requester-pays. All listing and copy operations set requester-pays via s5cmd. ``` -::: - -:::: +--- -If you've already downloaded and extracted ArXiv data to the specified output folder, NeMo Curator will read from those files instead of downloading them again. +## Usage -```{admonition} Text Processing with Stop Words -:class: tip +Create and run an ArXiv processing pipeline and write outputs to JSONL: -When processing academic papers from ArXiv, you may want to customize text extraction and analysis using stop words. Stop words can help identify section boundaries, distinguish main content from references, and support language-specific processing. For a comprehensive guide to stop words in NeMo Curator, see {ref}`Stop Words in Text Processing `. +```python +from ray_curator.pipeline import Pipeline +from ray_curator.backends.experimental.ray_data import RayDataExecutor +from ray_curator.stages.download.text.arxiv import ArxivDownloadExtractStage +from ray_curator.stages.io.writer import JsonlWriter +from ray_curator.tasks import EmptyTask + +def main(): + pipeline = Pipeline( + name="arxiv_pipeline", + description="Download and process ArXiv LaTeX sources" + ) + + # Add ArXiv stage + arxiv_stage = ArxivDownloadExtractStage( + download_dir="./arxiv_downloads", + url_limit=5, # optional: number of tar files to process + record_limit=1000, # optional: max papers per tar + add_filename_column=True, + verbose=True, + ) + pipeline.add_stage(arxiv_stage) + + # Add writer stage + writer = JsonlWriter(output_dir="./arxiv_output") + pipeline.add_stage(writer) + + # Execute + executor = RayDataExecutor() + results = pipeline.run(executor, initial_tasks=[EmptyTask]) + print(f"Completed with {len(results) if results else 0} output files") + +if __name__ == "__main__": + main() ``` ### Parameters -```{list-table} ArXiv Download Parameters +```{list-table} ArxivDownloadExtractStage Parameters :header-rows: 1 -:widths: 20 20 40 20 +:widths: 25 20 35 20 * - Parameter - Type - Description - Default -* - `output_path` +* - `download_dir` - str - - Path where the extracted files will be placed - - Required -* - `output_type` - - Literal["jsonl", "parquet"] - - File format for storing data - - "jsonl" -* - `raw_download_dir` - - Optional[str] - - Directory to specify where to download the raw ArXiv files - - None -* - `keep_raw_download` - - bool - - Whether to keep the raw downloaded files - - False -* - `force_download` - - bool - - Whether to force re-download even if files exist - - False + - Directory to store downloaded `.tar` files + - "./arxiv_downloads" * - `url_limit` - - Optional[int] - - Limit the number of papers downloaded (useful for testing) + - int | None + - Maximum number of ArXiv tar files to download (useful for testing) - None * - `record_limit` - - Optional[int] - - Limit the number of records processed + - int | None + - Maximum number of papers to extract per tar file - None +* - `add_filename_column` + - bool | str + - Whether to add a source filename column to output; if str, use it as the column name + - True (column name defaults to `file_name`) +* - `log_frequency` + - int + - How often to log progress while iterating papers + - 1000 +* - `verbose` + - bool + - Enable verbose logging during download + - False ``` -## Output Format - -NeMo Curator extracts and processes the main text content from LaTeX source files. The extractor focuses on the body text of papers, automatically removing: +```{note} +URL generation and download use s5cmd with requester-pays to list and copy from `s3://arxiv/src/`. +``` -- Comments and LaTeX markup -- Content before the first section header -- Bibliography and appendix sections -- LaTeX macro definitions (while expanding their usage) +## Output Format -```{admonition} Limited Metadata Extraction -:class: note +The extractor returns per-paper text; the filename column is optionally added by the pipeline: -The current ArXiv implementation focuses on text extraction and does not parse document metadata like titles, authors, or categories from the LaTeX source. Only the processed text content and basic file identifiers are returned. +```json +{ + "text": "Main body text extracted from LaTeX after cleaning...", + "file_name": "arXiv-2401.01234.tar" +} ``` -```{list-table} ArXiv Output Fields +```{list-table} Output Fields :header-rows: 1 -:widths: 20 20 60 +:widths: 20 80 * - Field - - Type - Description * - `text` - - str - - The main text content extracted from LaTeX files (cleaned and processed) -* - `id` - - str - - A unique identifier for the paper (formatted ArXiv ID) -* - `source_id` - - str - - The source tar file name where the paper was found + - Extracted and cleaned paper text (LaTeX macros inlined where supported, comments and references removed) * - `file_name` - - str - - The filename used for the output file -``` \ No newline at end of file + - Optional. Name of the source tar file (enabled by `add_filename_column`) +``` + +```{admonition} Intermediate Fields +:class: note + +During iteration the pipeline yields `id` (ArXiv identifier), `source_id` (tar basename), and `content` (list of LaTeX files). The final extractor stage emits only `text` plus the optional filename column. +``` + +## Advanced Notes + +- The pipeline validates paths and extracts tar files with path traversal protection. +- The iterator and extractor adapt RedPajama preprocessing with safety and robustness improvements. +- Macro expansion handles non-argument macros; macros with arguments are not expanded. + +See also: {ref}`Common Crawl `, {ref}`Wikipedia ` diff --git a/docs/curate-text/load-data/common-crawl.md b/docs/curate-text/load-data/common-crawl.md index 7075759b6..b753c01b9 100644 --- a/docs/curate-text/load-data/common-crawl.md +++ b/docs/curate-text/load-data/common-crawl.md @@ -1,7 +1,7 @@ --- -description: "Download and extract text from Common Crawl web archives with language detection and multiple text extraction algorithms" +description: "Download and extract text from Common Crawl web archives using ray-curator's pipeline framework" categories: ["how-to-guides"] -tags: ["common-crawl", "web-data", "warc", "language-detection", "distributed", "html-extraction"] +tags: ["common-crawl", "web-data", "warc", "language-detection", "distributed", "html-extraction", "pipeline"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "how-to" @@ -9,163 +9,140 @@ modality: "text-only" --- (text-load-data-common-crawl)= + # Common Crawl -Download and extract text from Common Crawl snapshots using NeMo Curator utilities. +Download and extract text from Common Crawl snapshots using ray-curator's pipeline framework. -Common Crawl provides petabytes of web data collected over years of web crawling. The data is stored in a compressed web archive format (`.warc.gz`), which needs to be processed to extract useful text for language model training. +Common Crawl provides petabytes of web data collected over years of web crawling. The data uses a compressed web archive format (`.warc.gz`), which requires processing to extract useful text for language model training. ## How it Works -NeMo Curator's Common Crawl extraction process: +Ray-curator's Common Crawl processing pipeline consists of four sequential stages: + +1. **URL Generation**: Generates WARC file URLs from Common Crawl's index for the specified snapshot range +2. **Download**: Downloads the compressed WARC files from Common Crawl's servers (optionally using S3 for faster downloads) +3. **Iteration**: Extracts individual records from WARC files and decodes HTML content +4. **Extraction**: Performs language detection and extracts clean text using configurable HTML extraction algorithms -1. Downloads the compressed WARC files from Common Crawl's servers (optionally using S3 for faster downloads) -2. Decodes the HTML within each record from binary to text -3. Performs language detection using [pyCLD2](https://github.com/aboSamoor/pycld2) -4. Extracts the relevant text using one of several text extraction algorithms -5. Outputs the extracted text as `.jsonl` files for further processing +The pipeline outputs structured data that you can write to JSONL or Parquet files for further processing. --- ## Usage -Here's how to download and extract Common Crawl data: - -::::{tab-set} +Here's how to create and run a Common Crawl processing pipeline: -:::{tab-item} Python ```python -import os -from nemo_curator import get_client -from nemo_curator.download import download_common_crawl -from nemo_curator.datasets import DocumentDataset +from ray_curator.pipeline import Pipeline +from ray_curator.backends.experimental.ray_data import RayDataExecutor +from ray_curator.stages.download.text.common_crawl import CommonCrawlDownloadExtractStage +from ray_curator.stages.io.writer import JsonlWriter +from ray_curator.tasks import EmptyTask def main(): - # Initialize a Dask client - client = get_client(cluster_type="cpu") - - # Set parameters for downloading - output_path = "/extracted/output/folder" - start_snapshot = "2020-50" - end_snapshot = "2021-04" - output_type = "jsonl" - os.makedirs(output_path, exist_ok=True) - - # Download and extract Common Crawl data - common_crawl_dataset = download_common_crawl( - output_path, start_snapshot, end_snapshot, output_type=output_type + # Create pipeline + pipeline = Pipeline( + name="common_crawl_pipeline", + description="Download and process Common Crawl data" ) - - # Write the dataset to disk - common_crawl_dataset.to_json(output_path=output_path, write_to_filename=True) - print("Extracted dataset saved to:", output_path) + + # Add Common Crawl processing stage + cc_stage = CommonCrawlDownloadExtractStage( + start_snapshot="2020-50", # YYYY-WW format for CC-MAIN + end_snapshot="2020-50", + download_dir="./cc_downloads", + crawl_type="main", # or "news" + use_aws_to_download=True, # Faster S3 downloads (requires s5cmd) + url_limit=10, # Limit number of WARC files for testing + record_limit=1000, # Limit records per WARC file + ) + pipeline.add_stage(cc_stage) + + # Add output writer stage + writer = JsonlWriter(output_dir="./cc_output") + pipeline.add_stage(writer) + + # Create executor and run pipeline + executor = RayDataExecutor() + results = pipeline.run(executor, initial_tasks=[EmptyTask]) + + print(f"Pipeline completed. Results: {len(results) if results else 0} output files") if __name__ == "__main__": main() ``` -::: -:::{tab-item} CLI -First, generate a list of URLs: +### Writing to Parquet -```bash -get_common_crawl_urls \ - --starting-snapshot="2020-50" \ - --ending-snapshot="2020-50" \ - --output-warc-url-file=./url_data/warc_urls_cc_2020_50.txt -``` - -Then download and extract: +To write Parquet instead of JSONL, use `ParquetWriter`: -```bash -download_and_extract \ - --input-url-file=./url_data/warc_urls_cc_2020_50.txt \ - --builder-config-file=./config/cc_warc_builder.yaml \ - --output-json-dir=/datasets/CC-MAIN-2020-50/json -``` - -The config file should look like: - -```yaml -download_module: nemo_curator.download.commoncrawl.CommonCrawlWARCDownloader -download_params: - aws: True # Optional: Set to True to use S3 for faster downloads -iterator_module: nemo_curator.download.commoncrawl.CommonCrawlWARCIterator -iterator_params: {} -extract_module: nemo_curator.download.commoncrawl.CommonCrawlWARCExtractor -extract_params: {} -``` +```python +from ray_curator.stages.io.writer import ParquetWriter -```{note} -The `download_params` section can include optional parameters like `aws: True` for S3 downloads or `verbose: True` for detailed logging. If no custom parameters are needed, use `download_params: {}`. +# Replace the JSONL writer with ParquetWriter +writer = ParquetWriter(output_dir="./cc_output_parquet") +pipeline.add_stage(writer) ``` -::: - -:::: - ### Parameters -```{list-table} Common Crawl Download Parameters +```{list-table} CommonCrawlDownloadExtractStage Parameters :header-rows: 1 -:widths: 20 20 40 20 +:widths: 25 20 35 20 * - Parameter - Type - Description - Default -* - `output_path` - - str - - Path where the extracted files will be placed - - Required * - `start_snapshot` - str - - First Common Crawl snapshot to include (format: "YYYY-WW" for CC-MAIN, "YYYY-MM" for CC-NEWS) + - First snapshot to include (format: "YYYY-WW" for main, "YYYY-MM" for news) - Required * - `end_snapshot` - str - - Last Common Crawl snapshot to include + - Last snapshot to include (same format as start_snapshot) + - Required +* - `download_dir` + - str + - Directory to store downloaded WARC files - Required -* - `output_type` - - Literal["jsonl", "parquet"] - - File format for storing data - - "jsonl" -* - `algorithm` - - HTMLExtractorAlgorithm - - Text extraction algorithm to use (JusTextExtractor, ResiliparseExtractor, or TrafilaturaExtractor) +* - `crawl_type` + - Literal["main", "news"] + - Whether to use CC-MAIN or CC-NEWS dataset + - "main" +* - `html_extraction` + - HTMLExtractorAlgorithm | str | None + - Text extraction algorithm to use - JusTextExtractor() -* - `stop_lists` - - Optional[Dict[str, frozenset]] - - Dictionary of language-specific stop words +* - `html_extraction_kwargs` + - dict | None + - Additional arguments for the HTML extractor - None -* - `news` - - bool - - Whether to use CC-NEWS dataset instead of CC-MAIN - - False -* - `aws` - - bool - - Whether to download from S3 using s5cmd instead of HTTPS (requires s5cmd to be installed) - - False -* - `raw_download_dir` - - Optional[str] - - Directory to store raw WARC files +* - `stop_lists` + - dict[str, frozenset[str]] | None + - Language-specific stop words for text quality assessment - None -* - `keep_raw_download` +* - `use_aws_to_download` - bool - - Whether to keep the raw downloaded files + - Use S3 downloads via s5cmd instead of HTTPS (requires s5cmd installation) - False -* - `force_download` +* - `verbose` - bool - - Whether to force re-download even if files exist + - Enable verbose logging for download operations - False * - `url_limit` - - Optional[int] - - Maximum number of WARC files to download + - int | None + - Maximum number of WARC files to download (useful for testing) - None * - `record_limit` - - Optional[int] - - Maximum number of records to extract per file + - int | None + - Maximum number of records to extract per WARC file - None +* - `add_filename_column` + - bool | str + - Whether to add source filename column to output; if str, uses it as the column name (default name: "file_name") + - True ``` ```{admonition} Snapshot Availability @@ -176,71 +153,174 @@ Not every year and week has a snapshot. Ensure your range includes at least one ## Output Format -The extracted text is stored in `.jsonl` files with the following format: +The pipeline processes Common Crawl data through several stages, ultimately producing structured documents. The extracted text includes the following fields: ```json { - "text": "Extracted web page content...", - "warc_id": "a515a7b6-b6ec-4bed-998b-8be2f86f8eac", - "source_id": "CC-MAIN-20201123153826-20201123183826-00000.warc.gz", "url": "http://example.com/page.html", + "warc_id": "a515a7b6-b6ec-4bed-998b-8be2f86f8eac", + "source_id": "CC-MAIN-20201123153826-20201123183826-00000.warc.gz", "language": "ENGLISH", - "file_name": "CC-MAIN-20201123153826-20201123183826-00000.warc.gz.jsonl" + "text": "Extracted web page content..." } ``` +```{list-table} Output Fields +:header-rows: 1 +:widths: 20 80 + +* - Field + - Description +* - `url` + - Original URL of the web page +* - `warc_id` + - Unique identifier for the WARC record +* - `source_id` + - Name of the source WARC file +* - `language` + - Detected language of the content (e.g., "ENGLISH", "SPANISH") +* - `text` + - Extracted and cleaned text content +``` + +If you enable `add_filename_column`, the output includes an extra field `file_name` (or your custom column name). + ## Customization Options -### Text Extraction +### HTML Text Extraction Algorithms -NeMo Curator supports multiple HTML text extraction algorithms: +Ray-curator supports several HTML text extraction algorithms, each with different strengths: -1. **JusTextExtractor** (default): Uses [jusText](https://github.com/miso-belica/jusText) to extract main content -2. **ResiliparseExtractor**: Uses [Resiliparse](https://github.com/chatnoir-eu/chatnoir-resiliparse) for extraction -3. **TrafilaturaExtractor**: Uses [Trafilatura](https://trafilatura.readthedocs.io/en/latest/) for extraction +```{list-table} Available HTML Extractors +:header-rows: 1 +:widths: 25 25 50 + +* - Extractor + - Library + - Best For +* - `JusTextExtractor` + - [jusText](https://github.com/miso-belica/jusText) + - General web content, good boilerplate removal +* - `ResiliparseExtractor` + - [Resiliparse](https://github.com/chatnoir-eu/chatnoir-resiliparse) + - High-performance extraction, research applications +* - `TrafilaturaExtractor` + - [Trafilatura](https://trafilatura.readthedocs.io/) + - News articles, blog posts, high-quality text +``` -You can select a different extractor as follows: +#### Configuring HTML Extractors ```python -from nemo_curator.download import ( +from ray_curator.stages.download.text.html_extractors import ( ResiliparseExtractor, - TrafilaturaExtractor, - download_common_crawl + TrafilaturaExtractor ) # Use Resiliparse for extraction -extraction_algorithm = ResiliparseExtractor() - -common_crawl_dataset = download_common_crawl( - output_path, - start_snapshot, - end_snapshot, - output_type=output_type, - algorithm=extraction_algorithm, +cc_stage = CommonCrawlDownloadExtractStage( + start_snapshot="2020-50", + end_snapshot="2020-50", + download_dir="./downloads", + html_extraction=ResiliparseExtractor( + required_stopword_density=0.25, + main_content=True + ) +) + +# Or use Trafilatura with custom parameters +cc_stage = CommonCrawlDownloadExtractStage( + start_snapshot="2020-50", + end_snapshot="2020-50", + download_dir="./downloads", + html_extraction=TrafilaturaExtractor( + min_extracted_size=200, + max_repetitions=3 + ) ) ``` -Each extractor has unique parameters -- check their docstrings for details. +```{note} +When `html_extraction` is passed as an extractor instance (for example, `JusTextExtractor()`), the `html_extraction_kwargs` parameter is ignored. To customize the extractor in this case, pass keyword arguments directly to the extractor constructor. +``` ### Language Processing -You can customize language detection and extraction by providing [stop words](text-process-data-languages-stop-words) for different languages: +You can customize language detection and extraction by providing stop words for different languages: ```python -from nemo_curator.download import download_common_crawl - # Define custom stop words for specific languages -stop_lists = {"ENGLISH": frozenset(["the", "and", "is", "in", "for", "where", "when", "to", "at"])} - -common_crawl = download_common_crawl( - "/extracted/output/folder", - "2020-50", - "2021-04", - output_type="jsonl", - stop_lists=stop_lists, +stop_lists = { + "ENGLISH": frozenset(["the", "and", "is", "in", "for", "where", "when", "to", "at"]), + "SPANISH": frozenset(["el", "la", "de", "que", "y", "en", "un", "es", "se", "no"]) +} + +cc_stage = CommonCrawlDownloadExtractStage( + start_snapshot="2020-50", + end_snapshot="2020-50", + download_dir="./downloads", + stop_lists=stop_lists ) ``` ```{note} -If no custom stop lists are provided, NeMo Curator uses jusText's default stop lists with additional support for Thai, Chinese, and Japanese languages. +If no custom stop lists are provided, ray-curator uses jusText's default stop lists with additional support for Thai, Chinese, and Japanese languages. +``` + +## Advanced Usage + +### Using Different Executors + +Ray-curator supports several execution backends: + +```python +# Ray Data executor (recommended for most use cases) +from ray_curator.backends.experimental.ray_data import RayDataExecutor +executor = RayDataExecutor() + +# Xenna executor (for specialized workflows) +from ray_curator.backends.xenna import XennaExecutor +executor = XennaExecutor() +``` + +### Processing CC-NEWS Data + +For Common Crawl News data, use the `news` crawl type with month-based snapshots: + +```python +cc_stage = CommonCrawlDownloadExtractStage( + start_snapshot="2020-08", # YYYY-MM format for CC-NEWS + end_snapshot="2020-10", + download_dir="./news_downloads", + crawl_type="news" # Use CC-NEWS instead of CC-MAIN +) +``` + +### Large-Scale Processing + +For production workloads, consider these optimizations: + +```python +cc_stage = CommonCrawlDownloadExtractStage( + start_snapshot="2020-50", + end_snapshot="2020-52", + download_dir="/fast_storage/cc_downloads", + use_aws_to_download=True, # Faster S3 downloads + verbose=False, # Reduce logging overhead + # Remove limits for full processing + # url_limit=None, + # record_limit=None +) ``` + +::::{admonition} S3 Download Requirements +:class: tip + +To use `use_aws_to_download=True`, you must install [s5cmd](https://github.com/peak/s5cmd): + +```bash +# Install s5cmd for faster S3 downloads +go install github.com/peak/s5cmd/v2@latest +``` + +:::: diff --git a/docs/curate-text/load-data/custom.md b/docs/curate-text/load-data/custom.md index 0fb4bbb1e..7c27cd978 100644 --- a/docs/curate-text/load-data/custom.md +++ b/docs/curate-text/load-data/custom.md @@ -1,7 +1,7 @@ --- -description: "Load and process custom datasets using NeMo Curator's extensible framework with custom downloaders, iterators, and extractors" +description: "Create custom data loading pipelines using Ray Curator's stage-based architecture with composable components" categories: ["how-to-guides"] -tags: ["custom-data", "extensible", "downloaders", "iterators", "extractors", "framework"] +tags: ["custom-data", "stages", "pipelines", "data-loading", "ray-curator"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "advanced" content_type: "how-to" @@ -9,96 +9,311 @@ modality: "text-only" --- (text-load-data-custom)= + # Custom Data Loading -Load and process your own custom datasets using NeMo Curator's extensible framework. This guide explains how to implement custom data loaders that integrate with NeMo Curator's distributed processing capabilities. +Create custom data loading pipelines using Ray Curator's modular stage-based architecture. This guide shows how to build custom components that integrate seamlessly with Ray Curator's distributed processing framework. + +## How It Works + +Ray Curator uses a **4-step pipeline pattern** for custom data loading: + +1. **URL Generation**: Generate URLs from configuration or input parameters +2. **Download**: Download files from URLs to local storage +3. **Iteration**: Extract raw records from downloaded files +4. **Extraction** (Optional): Transform raw records into final structured format + +Each step uses an abstract base class with corresponding processing stages that compose into pipelines. + +--- + +## Architecture Overview + +### Core Components -## How it Works +- **Tasks**: Data containers that flow through the pipeline (`DocumentBatch`, `FileGroupTask`) +- **Stages**: Processing units that transform tasks (`ProcessingStage` subclasses) +- **Pipelines**: Compositions of stages executed sequentially +- **Executors**: Runtime backends that execute pipelines (Ray Data, Cosmos-Xenna) -NeMo Curator's custom data loading process: +### Data Flow -1. Downloads data from your source using a custom `DocumentDownloader` -2. Iterates through the downloaded data using a custom `DocumentIterator` -3. Extracts text using a custom `DocumentExtractor` -4. Outputs the processed data in JSONL or Parquet format +```text +_EmptyTask → FileGroupTask → DocumentBatch + ↓ ↓ ↓ +URLGeneration → Download → Iterate → Extract +``` --- -## Usage +## Implementation Guide + +### 1. Create Directory Structure + +```text +your_data_source/ +├── __init__.py +├── stage.py # Main composite stage +├── url_generation.py # URL generation logic +├── download.py # Download implementation +├── iterator.py # File iteration logic +└── extract.py # Data extraction logic (optional) +``` + +### 2. Build Core Components + +#### URL Generator (`url_generation.py`) + +```python +from ray_curator.stages.download.text import URLGenerator + +class CustomURLGenerator(URLGenerator): + def __init__(self, config_param: str): + self.config_param = config_param + + def generate_urls(self) -> list[str]: + """Generate list of URLs to download.""" + # Your URL generation logic here + return [ + "https://example.com/dataset1.zip", + "https://example.com/dataset2.zip", + ] +``` + +#### Document Download Handler (`download.py`) + +```python +import requests +from ray_curator.stages.download.text import DocumentDownloader + +class CustomDownloader(DocumentDownloader): + def __init__(self, download_dir: str, verbose: bool = False): + super().__init__(download_dir, verbose) + + def _get_output_filename(self, url: str) -> str: + """Extract filename from URL.""" + return url.split('/')[-1] + + def _download_to_path(self, url: str, path: str) -> tuple[bool, str | None]: + """Download file from URL to local path.""" + try: + response = requests.get(url, stream=True, timeout=30) + response.raise_for_status() + + with open(path, 'wb') as f: + for chunk in response.iter_content(chunk_size=8192): + f.write(chunk) + + return True, None + except Exception as e: + return False, str(e) +``` + +#### Document Iterator (`iterator.py`) + +```python +import json +from collections.abc import Iterator +from typing import Any +from ray_curator.stages.download.text import DocumentIterator + +class CustomIterator(DocumentIterator): + def __init__(self, record_format: str = "jsonl"): + self.record_format = record_format + + def iterate(self, file_path: str) -> Iterator[dict[str, Any]]: + """Iterate over records in a file.""" + if self.record_format == "jsonl": + with open(file_path, 'r', encoding='utf-8') as f: + for line in f: + if line.strip(): + yield json.loads(line) + # Add other format handlers as needed + + def output_columns(self) -> list[str]: + """Define output columns.""" + return ["content", "metadata", "id"] +``` + +#### Document Extractor (`extract.py`) + +```python +from typing import Any +from ray_curator.stages.download.text import DocumentExtractor + +class CustomExtractor(DocumentExtractor): + def extract(self, record: dict[str, str]) -> dict[str, Any] | None: + """Transform raw record to final format.""" + # Skip invalid records + if not record.get("content"): + return None + + # Extract and clean text + cleaned_text = self._clean_text(record["content"]) + + # Generate unique ID if not present + doc_id = record.get("id", self._generate_id(cleaned_text)) + + return { + "text": cleaned_text, + "id": doc_id, + "source": record.get("metadata", {}).get("source", "unknown") + } + + def input_columns(self) -> list[str]: + return ["content", "metadata", "id"] + + def output_columns(self) -> list[str]: + return ["text", "id", "source"] + + def _clean_text(self, text: str) -> str: + """Clean and normalize text.""" + # Your text cleaning logic here + return text.strip() + + def _generate_id(self, text: str) -> str: + """Generate unique ID for text.""" + import hashlib + return hashlib.md5(text.encode()).hexdigest()[:16] +``` + +### 3. Create Composite Stage (`stage.py`) -Here's how to implement and use custom data loaders: +```python +from ray_curator.stages.download.text import DocumentDownloadExtractStage +from .url_generation import CustomURLGenerator +from .download import CustomDownloader +from .iterator import CustomIterator +from .extract import CustomExtractor + +class CustomDataStage(DocumentDownloadExtractStage): + """Custom data loading stage combining all components.""" + + def __init__( + self, + config_param: str, + download_dir: str, + record_format: str = "jsonl", + url_limit: int | None = None, + record_limit: int | None = None, + **kwargs + ): + super().__init__( + url_generator=CustomURLGenerator(config_param), + downloader=CustomDownloader(download_dir), + iterator=CustomIterator(record_format), + extractor=CustomExtractor(), # Optional - remove if not needed + url_limit=url_limit, + record_limit=record_limit, + **kwargs + ) +``` + +--- -::::{tab-set} +## Usage Examples + +### Basic Pipeline -:::{tab-item} Python ```python -from nemo_curator import get_client -from nemo_curator.download import download_and_extract -from my_custom_module import MyCustomDownloader, MyCustomIterator, MyCustomExtractor +from ray_curator.pipeline import Pipeline +from ray_curator.backends.xenna import XennaExecutor +from your_data_source.stage import CustomDataStage def main(): - # Initialize a Dask client - client = get_client(cluster_type="cpu") - - # Create instances of your custom components - downloader = MyCustomDownloader() - iterator = MyCustomIterator() - extractor = MyCustomExtractor() - - # Use them with NeMo Curator's framework - dataset = download_and_extract( - urls=[url1, url2, url3], - output_paths=[output_path1, output_path2, output_path3], - downloader=downloader, - iterator=iterator, - extractor=extractor, - output_format={"text": str, "id": str}, - output_type="jsonl", - keep_raw_download=False, - force_download=False, - filename_col="file_name", - record_limit=None + # Create custom data loading stage + data_stage = CustomDataStage( + config_param="production", + download_dir="/tmp/downloads", + record_limit=1000 # Limit for testing ) - - # Process the dataset - dataset.to_json(output_path="/output/folder", write_to_filename=True) + + # Create pipeline + pipeline = Pipeline( + name="custom_data_pipeline", + description="Load and process custom dataset" + ) + pipeline.add_stage(data_stage) + + # Create executor + executor = XennaExecutor() + + # Run pipeline + print("Starting pipeline...") + results = pipeline.run(executor) + + # Process results + if results: + for task in results: + print(f"Processed {task.num_items} documents") + # Access data as pandas DataFrame + df = task.to_pandas() + print(df.head()) if __name__ == "__main__": main() ``` -::: -:::{tab-item} CLI -Create a configuration YAML file: +### Adding Processing Stages -```yaml -# custom_config.yaml -download_module: my_custom_module.MyCustomDownloader -download_params: - param1: value1 - param2: value2 -iterator_module: my_custom_module.MyCustomIterator -iterator_params: - param3: value3 -extract_module: my_custom_module.MyCustomExtractor -extract_params: - param4: value4 +```python +from ray_curator.stages.modules import ScoreFilter +from ray_curator.stages.filters import WordCountFilter +from ray_curator.stages.io.writer import JsonlWriter + +def create_full_pipeline(): + pipeline = Pipeline(name="full_processing") + + # Data loading + pipeline.add_stage(CustomDataStage( + config_param="production", + download_dir="/tmp/downloads" + )) + + # Text filtering + pipeline.add_stage(ScoreFilter( + filter_obj=WordCountFilter(min_words=10, max_words=1000), + text_field="text" + )) + + # Output + pipeline.add_stage(JsonlWriter(output_dir="/output/processed")) + + return pipeline ``` -Then run the command-line tool: +### Configuration-Based Setup -```bash -# Note: Use the actual script name from nemo_curator/scripts/ -python -m nemo_curator.scripts.download_and_extract \ - --input-url-file=./my_urls.txt \ - --builder-config-file=./custom_config.yaml \ - --output-json-dir=/output/folder +```python +from ray_curator.config import BaseConfig + +class CustomDataConfig(BaseConfig): + config_param: str = "development" + download_dir: str = "/tmp/downloads" + record_format: str = "jsonl" + url_limit: int | None = None + record_limit: int | None = None + +def create_pipeline_from_config(config_path: str): + config = CustomDataConfig.from_yaml(config_path) + + stage = CustomDataStage( + config_param=config.config_param, + download_dir=config.download_dir, + record_format=config.record_format, + url_limit=config.url_limit, + record_limit=config.record_limit + ) + + pipeline = Pipeline("configured_pipeline") + pipeline.add_stage(stage) + + return pipeline ``` -::: -:::: +--- -### Parameters +## Parameters Reference ```{list-table} Custom Data Loading Parameters :header-rows: 1 @@ -108,138 +323,130 @@ python -m nemo_curator.scripts.download_and_extract \ - Type - Description - Default -* - `urls` - - List[str] - - List of URLs or paths to download from - - Required -* - `output_paths` - - List[str] - - List of paths where downloaded files will be stored +* - `url_generator` + - URLGenerator + - Custom URL generation implementation - Required * - `downloader` - DocumentDownloader - - Custom downloader implementation + - Custom download implementation - Required * - `iterator` - DocumentIterator - - Custom iterator implementation + - Custom file iteration implementation - Required * - `extractor` - - DocumentExtractor - - Custom extractor implementation - - Required -* - `output_format` - - Dict[str, type] - - Schema for output data - - Required -* - `output_type` - - Literal["jsonl", "parquet"] - - Output file format - - "jsonl" -* - `keep_raw_download` - - bool - - Whether to retain raw downloaded files after extraction - - False -* - `force_download` - - bool - - Whether to re-download and re-extract existing files - - False -* - `filename_col` - - str - - Name of the column for storing filenames in the dataset - - "file_name" + - DocumentExtractor | None + - Optional extraction/transformation step + - None +* - `url_limit` + - int | None + - Maximum number of URLs to process + - None * - `record_limit` - int | None - - Maximum number of records to extract from each file + - Maximum records per file - None +* - `add_filename_column` + - bool | str + - Add filename column to output; if str, uses it as the column name (default name: "file_name") + - True ``` +--- + ## Output Format -The processed data can be stored in either JSONL or Parquet format: +Processed data flows through the pipeline as `DocumentBatch` tasks containing pandas DataFrames or PyArrow Tables: -### JSONL Format +### Example Output Schema -```json +```python { - "text": "This is a sample text document", - "id": "unique-id-123", - "metadata": { - "source": "example", - "timestamp": "2024-03-21" - } + "text": "This is the processed document text", + "id": "unique-document-id", + "source": "example.com", + "file_name": "dataset1.jsonl" # If add_filename_column=True (default column name) } ``` -### Parquet Format +--- -Parquet files maintain the same schema as JSONL files but provide: +## Best Practices -- Efficient compression -- Fast query performance -- Column-based operations -- Reduced storage costs +### 1. Error Handling -## Implementation Guide +```python +def _download_to_path(self, url: str, path: str) -> tuple[bool, str | None]: + try: + # Download logic + return True, None + except requests.RequestException as e: + self.logger.error(f"Network error downloading {url}: {e}") + return False, str(e) + except Exception as e: + self.logger.error(f"Unexpected error: {e}") + return False, str(e) +``` -### 1. Create Custom Downloader +### 2. Resource Management ```python -from nemo_curator.download.doc_builder import DocumentDownloader +from ray_curator.stages.resources import Resources -class MyCustomDownloader(DocumentDownloader): - def download(self, url): - """Download data from url and return the path to the downloaded file""" - # Implement download logic - return "/path/to/downloaded/file" +class CustomDataStage(DocumentDownloadExtractStage): + # Override resources for download-heavy stages + _resources = Resources(cpus=2.0) # More CPU for downloads ``` -### 2. Create Custom Iterator +### 3. Progress Tracking ```python -from nemo_curator.download.doc_builder import DocumentIterator - -class MyCustomIterator(DocumentIterator): - def iterate(self, file_path): - """Iterate through documents in the downloaded file""" - for doc in my_iterator_logic(file_path): - metadata = {"url": doc.get("url", "")} - content = doc.get("content", "") - yield metadata, content +def iterate(self, file_path: str) -> Iterator[dict[str, Any]]: + with open(file_path, 'r') as f: + for i, line in enumerate(f): + if i % 1000 == 0: + self.logger.info(f"Processed {i} records from {file_path}") + yield json.loads(line) ``` -### 3. Create Custom Extractor +### 4. Validation ```python -from nemo_curator.download.doc_builder import DocumentExtractor - -class MyCustomExtractor(DocumentExtractor): - def extract(self, content): - """Extract text from content and return a dictionary""" - # Your extraction logic here - extracted_text = process_content(content) - unique_id = generate_unique_id(content) - - return { - 'text': extracted_text, - 'id': unique_id, - # Add any other fields as needed - } +def extract(self, record: dict[str, str]) -> dict[str, Any] | None: + # Validate required fields + if not all(field in record for field in ["content", "id"]): + self.logger.warning(f"Skipping record missing required fields") + return None + + # Validate content quality + if len(record["content"]) < 10: + return None + + return self._process_record(record) ``` -```{admonition} Enhancing Custom Extraction -:class: tip +### 5. Memory Efficiency -When implementing custom extractors, consider adding robust error handling and metadata extraction to improve the quality of your processed data. You can also implement content filtering and validation logic within your extractor. -``` +- Process files in streaming fashion +- Use generators for large datasets +- Apply proper cleanup in download handlers +- Consider batch sizes for extraction steps -## Best Practices +### 6. Testing -1. **Error Handling**: Implement robust error handling for corrupt files and network issues -2. **Logging**: Use Python's logging module for process visibility and debugging -3. **Metadata**: Include useful metadata in extracted documents for downstream processing -4. **Chunking**: Consider chunking large files for efficient distributed processing -5. **Caching**: Implement caching to avoid re-downloading or re-processing data -6. **Parameter Validation**: Validate input parameters in your custom classes -7. **Memory Management**: Be mindful of memory usage when processing large files -8. **Type Annotations**: Use proper type hints to improve code clarity and IDE support +```python +def test_custom_extractor(): + extractor = CustomExtractor() + + test_record = { + "content": "Sample text content", + "id": "test-123", + "metadata": {"source": "test"} + } + + result = extractor.extract(test_record) + assert result is not None + assert "text" in result + assert result["id"] == "test-123" +``` diff --git a/docs/curate-text/load-data/index.md b/docs/curate-text/load-data/index.md index 5380a2801..4db9d19cf 100644 --- a/docs/curate-text/load-data/index.md +++ b/docs/curate-text/load-data/index.md @@ -1,7 +1,7 @@ --- -description: "Load text data from various sources including Common Crawl, arXiv, Wikipedia, and custom datasets using NeMo Curator" +description: "Load text data from various sources including Common Crawl, Wikipedia, and custom datasets using NeMo Curator's ray-curator framework" categories: ["workflows"] -tags: ["data-loading", "common-crawl", "arxiv", "wikipedia", "custom-data", "distributed"] +tags: ["data-loading", "common-crawl", "wikipedia", "custom-data", "distributed", "ray"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "workflow" @@ -9,134 +9,179 @@ modality: "text-only" --- (text-load-data)= + # Text Data Loading -Load text data from a variety of data sources using NeMo Curator. +Load text data from a variety of data sources using NeMo Curator's ray-curator framework. -NeMo Curator provides tools for downloading and processing large-scale public text datasets. Common data formats like Common Crawl's `.warc.gz` are automatically converted to more processing-friendly formats like `.jsonl`. +NeMo Curator provides a task-centric pipeline framework for downloading and processing large-scale public text datasets. The framework uses Ray as the distributed backend and converts raw data formats like Common Crawl's `.warc.gz` to processing-friendly formats like `.jsonl`. ## How it Works -NeMo Curator's data loading framework consists of three main components: +Ray-curator's data loading framework uses a **4-step pipeline pattern** where data flows through stages as tasks: + +1. **URL Generation**: Generate URLs from configuration (`URLGenerationStage`) +2. **Download**: Retrieve files from URLs to local storage (`DocumentDownloadStage`) +3. **Iteration**: Parse downloaded files to extract raw records (`DocumentIterateStage`) +4. **Extraction**: Extract and clean structured content from raw records (`DocumentExtractStage`) + +Each step uses a `ProcessingStage` that transforms tasks. The pipeline flow is: -1. **Downloaders**: Responsible for retrieving data from source locations (`DocumentDownloader`) -2. **Iterators**: Parse through downloaded data to identify individual documents (`DocumentIterator`) -3. **Extractors**: Extract and clean text from raw document formats (`DocumentExtractor`) +```text +_EmptyTask → FileGroupTask(URLs) → FileGroupTask(Files) → DocumentBatch → DocumentBatch +``` -Each supported data source has specific implementations of these components optimized for that data type. The result is a standardized [`DocumentDataset`](documentdataset) that can be used for further curation steps. +Data sources provide composite stages that combine these steps into complete download-extract pipelines, producing `DocumentBatch` tasks for further processing. ::::{tab-set} :::{tab-item} Python ```python -from nemo_curator import get_client -from nemo_curator.download import download_common_crawl, download_wikipedia, download_arxiv - -# Initialize a Dask client -client = get_client(cluster_type="cpu") +from ray_curator.pipeline import Pipeline +from ray_curator.backends.xenna import XennaExecutor +from ray_curator.stages.download.text.common_crawl import CommonCrawlDownloadExtractStage +from ray_curator.stages.io.writer import JsonlWriter +from ray_curator.tasks import EmptyTask + +# Create a pipeline for downloading Common Crawl data +pipeline = Pipeline( + name="common_crawl_download", + description="Download and process Common Crawl web archives" +) -# Download and extract data using correct parameter names -dataset = download_common_crawl( - output_path="/output/folder", - start_snapshot="2020-50", - end_snapshot="2021-04" +# Add data loading stage +cc_stage = CommonCrawlDownloadExtractStage( + start_snapshot="2020-50", + end_snapshot="2020-50", + download_dir="/tmp/cc_downloads", + crawl_type="main", + url_limit=10 # Limit for testing ) +pipeline.add_stage(cc_stage) -# Write to disk in the desired format -dataset.to_json(output_path="/output/folder", write_to_filename=True) +# Add writer stage to save as JSONL +writer = JsonlWriter(output_dir="/output/folder") +pipeline.add_stage(writer) + +# Build and execute pipeline +pipeline.build() +executor = XennaExecutor() + +# Start with an empty task to trigger URL generation +results = pipeline.run(executor, initial_tasks=[EmptyTask]) ``` ::: -:::{tab-item} CLI - -```bash -# Generic download and extract utility -# Requires a YAML configuration file specifying downloader, iterator, and extractor implementations -# Example config files: config/cc_warc_builder.yaml, config/arxiv_builder.yaml, config/wikipedia_builder.yaml -download_and_extract \ - --input-url-file= \ - --builder-config-file= \ - --output-json-dir= - -# Alternative: Extract from pre-downloaded files (extraction-only mode) -download_and_extract \ - --input-data-dir= \ - --builder-config-file= \ - --output-json-dir= - -# Common Crawl URL retrieval utility -# Generates a list of WARC file URLs for specified snapshot range -get_common_crawl_urls \ - --starting-snapshot="2020-50" \ - --ending-snapshot="2020-50" \ - --output-warc-url-file=./warc_urls.txt +:::{tab-item} Reading Custom Data + +```python +from ray_curator.pipeline import Pipeline +from ray_curator.stages.io.reader import JsonlReader +from ray_curator.stages.modules import ScoreFilter +from ray_curator.stages.filters import WordCountFilter + +# Create pipeline for processing existing JSONL files +pipeline = Pipeline(name="custom_data_processing") + +# Read JSONL files +reader = JsonlReader( + file_paths="/path/to/data/*.jsonl", + files_per_partition=4, + columns=["text", "url"] # Only read specific columns +) +pipeline.add_stage(reader) + +# Add filtering stage +word_filter = ScoreFilter( + filter_obj=WordCountFilter(min_words=50, max_words=1000), + text_field="text" +) +pipeline.add_stage(word_filter) + +# Execute pipeline +executor = XennaExecutor() +results = pipeline.run(executor) ``` ::: :::: - --- ## Data Sources & File Formats -Load data from public, local, and custom data sources. +Load data from public datasets and custom data sources using ray-curator stages. ::::{grid} 1 1 1 2 :gutter: 1 1 1 2 -:::{grid-item-card} {octicon}`download;1.5em;sd-mr-1` arXiv -:link: text-load-data-arxiv -:link-type: ref -Extract and process scientific papers from arXiv -+++ -{bdg-secondary}`academic` -{bdg-secondary}`pdf` -{bdg-secondary}`latex` -::: - :::{grid-item-card} {octicon}`download;1.5em;sd-mr-1` Common Crawl :link: text-load-data-common-crawl :link-type: ref -Load and preprocess text data from Common Crawl web archives +Download and process web archive data from Common Crawl +++ {bdg-secondary}`web-data` {bdg-secondary}`warc` -{bdg-secondary}`distributed` +{bdg-secondary}`html-extraction` ::: -:::{grid-item-card} {octicon}`download;1.5em;sd-mr-1` Custom Data -:link: text-load-data-custom +:::{grid-item-card} {octicon}`download;1.5em;sd-mr-1` Wikipedia +:link: text-load-data-wikipedia :link-type: ref -Load your own text datasets in various formats +Download and extract Wikipedia articles from Wikipedia dumps +++ -{bdg-secondary}`jsonl` -{bdg-secondary}`parquet` -{bdg-secondary}`custom-formats` +{bdg-secondary}`articles` +{bdg-secondary}`multilingual` +{bdg-secondary}`xml-dumps` ::: -:::{grid-item-card} {octicon}`download;1.5em;sd-mr-1` Wikipedia -:link: text-load-data-wikipedia +:::{grid-item-card} {octicon}`download;1.5em;sd-mr-1` Custom Data +:link: text-load-data-custom :link-type: ref -Import and process Wikipedia articles for training datasets +Read and process your own text datasets in standard formats +++ -{bdg-secondary}`articles` -{bdg-secondary}`multilingual` -{bdg-secondary}`dumps` +{bdg-secondary}`jsonl` +{bdg-secondary}`parquet` +{bdg-secondary}`file-partitioning` ::: :::: +## Key Components + +### Tasks + +Ray-curator operates on **Tasks** - batches of data that flow through the pipeline: + +- **`_EmptyTask`**: Starting point for pipelines that generate data +- **`FileGroupTask`**: Contains file paths (URLs or local files) +- **`DocumentBatch`**: Contains text documents as pandas DataFrame or PyArrow Table + +### Stages + +**ProcessingStages** transform tasks through the pipeline: + +- **Composite Stages**: High-level stages like `CommonCrawlDownloadExtractStage` that decompose into several steps +- **Atomic Stages**: Individual processing steps like `DocumentDownloadStage`, `JsonlReaderStage` +- **I/O Stages**: File readers (`JsonlReader`) and writers (`JsonlWriter`, `ParquetWriter`) + +### Executors + +**Executors** run pipelines on different backends: + +- **`XennaExecutor`**: Production-ready executor using Cosmos framework +- **`RayDataExecutor`**: Experimental executor using Ray Data + ```{toctree} :maxdepth: 4 :titlesonly: :hidden: -common-crawl arxiv +common-crawl wikipedia Custom Data ``` diff --git a/docs/curate-text/load-data/wikipedia.md b/docs/curate-text/load-data/wikipedia.md index a2674116b..250071674 100644 --- a/docs/curate-text/load-data/wikipedia.md +++ b/docs/curate-text/load-data/wikipedia.md @@ -1,7 +1,7 @@ --- -description: "Download and extract text from Wikipedia dumps with support for multiple languages and automatic content processing" +description: "Download and extract text from Wikipedia dumps using ray-curator's pipeline-based processing" categories: ["how-to-guides"] -tags: ["wikipedia", "dumps", "multilingual", "articles", "data-loading"] +tags: ["wikipedia", "dumps", "multilingual", "articles", "data-loading", "ray-curator"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "how-to" @@ -9,23 +9,25 @@ modality: "text-only" --- (text-load-data-wikipedia)= + # Wikipedia -Download and extract text from [Wikipedia Dumps](https://dumps.wikimedia.org/backup-index.html) using NeMo Curator utilities. +Download and extract text from [Wikipedia Dumps](https://dumps.wikimedia.org/backup-index.html) using ray-curator's pipeline-based processing system. -Wikipedia regularly releases dumps of all its content, which include articles, talk pages, user pages, and more. These dumps are available in various formats, including XML and SQL. +Wikipedia releases compressed dumps of all its content in XML format twice per month. ray-curator provides a complete pipeline to automatically download, parse, and extract clean text from these dumps. ## How it Works -NeMo Curator simplifies the process of: +The Wikipedia pipeline in ray-curator consists of four stages: -- Downloading the latest Wikipedia dump -- Extracting the article content -- Converting the content to a usable format for language model training +1. **URL Generation**: Automatically discovers Wikipedia dump URLs for the specified language and date +2. **Download**: Downloads compressed .bz2 dump files using `wget` +3. **Iteration**: Parses XML content and extracts individual articles +4. **Extraction**: Cleans Wikipedia markup and converts to plain text ## Before You Start -NeMo Curator uses `wget` to download Wikipedia dumps. You must have `wget` installed on your system: +ray-curator uses `wget` to download Wikipedia dumps. You must have `wget` installed on your system: - **On macOS**: `brew install wget` - **On Ubuntu/Debian**: `sudo apt-get install wget` @@ -35,81 +37,75 @@ NeMo Curator uses `wget` to download Wikipedia dumps. You must have `wget` insta ## Usage -Here's how to download and extract Wikipedia data using NeMo Curator: - -::::{tab-set} - -:::{tab-item} Python +Here's how to download and extract Wikipedia data using ray-curator: ```python -from nemo_curator.utils.distributed_utils import get_client -from nemo_curator.download import download_wikipedia - -# Initialize a Dask client -client = get_client(cluster_type="cpu") - -# Download and extract Wikipedia -wikipedia_dataset = download_wikipedia( - output_path="/extracted/output/folder", - dump_date="20240401" # Optional: specific dump date +from ray_curator.pipeline import Pipeline +from ray_curator.backends.xenna import XennaExecutor +from ray_curator.stages.download.text.wikipedia import WikipediaDownloadExtractStage +from ray_curator.stages.io.writer import JsonlWriter +from ray_curator.tasks import EmptyTask + +# Create the Wikipedia processing stage +wikipedia_stage = WikipediaDownloadExtractStage( + language="en", + download_dir="./wikipedia_downloads", + dump_date="20240401", # Optional: specific dump date (YYYYMMDD format) + url_limit=5, # Optional: limit number of dump files (useful for testing) + record_limit=1000, # Optional: limit articles per dump file + verbose=True ) -# The dataset is now available as a DocumentDataset object -print(f"Downloaded {len(wikipedia_dataset)} articles") -print(wikipedia_dataset.head()) - -# Write the dataset to disk as JSONL files -wikipedia_dataset.to_json(output_path="/path/to/output/files") -``` - -::: +# Create writer stage to save results +writer_stage = JsonlWriter( + output_dir="./wikipedia_output" +) -:::{tab-item} CLI -NeMo Curator provides a CLI for downloading Wikipedia data. +# Create and configure pipeline +pipeline = Pipeline( + name="wikipedia_pipeline", + description="Download and process Wikipedia dumps" +) +pipeline.add_stage(wikipedia_stage) +pipeline.add_stage(writer_stage) -**Step 1: Generate Wikipedia URLs** +# Create executor and run pipeline +executor = XennaExecutor() -First, generate a list of Wikipedia dump URLs for the desired language: +# Start with an empty task to trigger URL generation +initial_tasks = [EmptyTask] -```bash -get_wikipedia_urls \ - --language=en \ - --output-url-file=./wikipedia_urls.txt +# Execute the pipeline +results = pipeline.run(executor, initial_tasks=initial_tasks) +print(f"Pipeline completed with {len(results) if results else 0} output files") ``` -**Step 2: Create Configuration File** - -Create a configuration file (`wikipedia_builder.yaml`): - -```yaml -download_module: nemo_curator.download.wikipedia.WikipediaDownloader -download_params: {} -iterator_module: nemo_curator.download.wikipedia.WikipediaIterator -iterator_params: - language: 'en' -extract_module: nemo_curator.download.wikipedia.WikipediaExtractor -extract_params: - language: 'en' -format: - text: str - title: str - id: str - url: str - language: str - source_id: str -``` +### Multi-Language Processing -**Step 3: Run Download and Extraction** +You can process several languages by creating separate pipelines: -```bash -download_and_extract \ - --input-url-file=./wikipedia_urls.txt \ - --builder-config-file=./wikipedia_builder.yaml \ - --output-json-dir=/datasets/wikipedia/json +```python +languages = ["en", "es", "fr", "de"] + +for lang in languages: + # Create language-specific pipeline + wikipedia_stage = WikipediaDownloadExtractStage( + language=lang, + download_dir=f"./downloads/{lang}", + dump_date="20240401" + ) + + writer_stage = JsonlWriter( + output_dir=f"./output/{lang}" + ) + + pipeline = Pipeline(name=f"wikipedia_{lang}") + pipeline.add_stage(wikipedia_stage) + pipeline.add_stage(writer_stage) + + # Execute + results = pipeline.run(executor, initial_tasks=[EmptyTask]) ``` -::: - -:::: ### Parameters @@ -121,35 +117,70 @@ download_and_extract \ - Type - Default - Description -* - `output_path` +* - `language` + - str + - "en" + - Language code for Wikipedia dump (e.g., "en", "es", "fr") +* - `download_dir` - str - - Required - - Path where the extracted files will be placed + - "./wikipedia_downloads" + - Directory to store downloaded .bz2 files * - `dump_date` - Optional[str] - None - - Parameter to specify a particular Wikipedia dump date. The format must be "YYYYMMDD" (for example, "20250401" for April 1, 2025). Wikipedia creates new dumps approximately twice a month (around the 1st and 20th). You can find available dump dates by visiting https://dumps.wikimedia.org/enwiki/. If not specified, NeMo Curator will automatically use the latest available dump. -* - `language` + - Specific dump date in "YYYYMMDD" format (e.g., "20240401"). If None, uses latest available dump +* - `wikidumps_index_prefix` - str - - "en" - - Language code to download (for example, "en" for English) + - "https://dumps.wikimedia.org" + - Base URL for Wikipedia dumps index +* - `verbose` + - bool + - False + - Enable verbose logging during download * - `url_limit` - Optional[int] - None - - Parameter to limit the number of URLs downloaded (useful for testing) + - Maximum number of dump URLs to process (useful for testing) +* - `record_limit` + - Optional[int] + - None + - Maximum number of articles to extract per dump file +* - `add_filename_column` + - bool | str + - True + - Whether to add source filename column to output; if str, uses it as the column name (default name: "file_name") +* - `log_frequency` + - int + - 1000 + - How often to log progress during article processing ``` -If no `dump_date` is specified, NeMo Curator will download the latest available dump. +::::{note} +Wikipedia creates new dumps twice per month (around the 1st and 20th). You can find available dump dates at . +:::: ## Output Format -The extracted Wikipedia articles are stored in `.jsonl` files, with each line containing a JSON object with fields: +The processed Wikipedia articles become JSONL files, with each line containing a JSON object with these fields: -- `text`: The main text content of the article -- `id`: A unique identifier for the article +- `text`: The cleaned main text content of the article - `title`: The title of the Wikipedia article -- `url`: The URL of the Wikipedia article +- `id`: Wikipedia's unique identifier for the article +- `url`: The constructed Wikipedia URL for the article - `language`: The language code of the article -- `source_id`: The source file identifier -- `file_name`: The output file name (when using `write_to_filename=True`) +- `source_id`: Identifier of the source dump file +If you enable `add_filename_column`, the output includes an extra field `file_name` (or your custom column name). + +### Example Output Record + +```json +{ + "text": "Python is a high-level, general-purpose programming language...", + "title": "Python (programming language)", + "id": "23862", + "url": "https://en.wikipedia.org/wiki/Python_(programming_language)", + "language": "en", + "source_id": "enwiki-20240401-pages-articles-multistream1.xml" +} +``` diff --git a/docs/curate-text/process-data/content-processing/text-cleaning.md b/docs/curate-text/process-data/content-processing/text-cleaning.md index e67efdf42..b88074031 100644 --- a/docs/curate-text/process-data/content-processing/text-cleaning.md +++ b/docs/curate-text/process-data/content-processing/text-cleaning.md @@ -1,7 +1,7 @@ --- -description: "Remove undesirable text including improperly decoded Unicode characters, inconsistent spacing, and excessive URLs" +description: "Create custom text processing stages to clean and normalize text using ray-curator's processing pipeline" categories: ["how-to-guides"] -tags: ["text-cleaning", "unicode", "normalization", "url-removal", "preprocessing", "ftfy"] +tags: ["text-cleaning", "unicode", "normalization", "url-removal", "preprocessing", "ray-curator", "processing-stages"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "how-to" @@ -9,91 +9,253 @@ modality: "text-only" --- (text-process-data-format-text-cleaning)= -# Text Cleaning -Remove undesirable text such as improperly decoded Unicode characters, inconsistent line spacing, or excessive URLs from documents being pre-processed for your dataset using NeMo Curator. +# Text Cleaning -One common issue in text datasets is improper Unicode character encoding, which can result in garbled or unreadable text, particularly with special characters like apostrophes, quotes, or diacritical marks. For example, the input sentence `"The Mona Lisa doesn't have eyebrows."` from a given document may not have included a properly encoded apostrophe (`'`), resulting in the sentence decoding as `"The Mona Lisa doesn’t have eyebrows."`. +Create custom text processing stages to clean and normalize text data in your ray-curator pipelines. Ray-curator uses a task-centric architecture where you build text cleaning operations as `ProcessingStage` components that transform `DocumentBatch` tasks. -NeMo Curator enables you to easily run this document through the default `UnicodeReformatter` module to detect and remove the unwanted text, or you can define your own custom Unicode text cleaner tailored to your needs. +Common text cleaning needs include removing improperly decoded Unicode characters, normalizing inconsistent line spacing, and filtering out excessive URLs. For example, corrupted encoding might turn `"The Mona Lisa doesn't have eyebrows."` into `"The Mona Lisa doesn’t have eyebrows."`. ## How it Works -NeMo Curator provides the following modules for cleaning text: +Ray-curator provides a flexible stage-based architecture for text processing: -- `UnicodeReformatter`: Uses [ftfy](https://ftfy.readthedocs.io/en/latest/) to fix broken Unicode characters. Modifies the "text" field of the dataset by default. The module accepts extensive configuration options for fine-tuning Unicode repair behavior. Please see the [ftfy documentation](https://ftfy.readthedocs.io/en/latest/config.html) for more information about parameters used by the `UnicodeReformatter`. -- `NewlineNormalizer`: Uses regex to replace 3 or more consecutive newline characters in each document with only 2 newline characters. -- `UrlRemover`: Uses regex to remove all URLs in each document. +- **ProcessingStage**: Base class for all text transformation operations that accepts a `DocumentBatch` and returns a transformed `DocumentBatch`. +- **Pipeline**: Orchestrates several processing stages in sequence. +- **DocumentBatch**: Task containing a pandas DataFrame with text data that flows through the pipeline. +- **Text Processing Utilities**: Helper functions for common text operations like `remove_control_characters()`. -You can use these modules individually or sequentially in a cleaning pipeline. - ---- +You create custom text cleaning stages by extending `ProcessingStage` and implementing the `process()` method. ## Usage -::::{tab-set} - -:::{tab-item} Python - -Consider the following example, which loads a dataset (`books.jsonl`), steps through each module in a cleaning pipeline, and outputs the processed dataset as `cleaned_books.jsonl`: +The following example shows how to create custom text cleaning stages and combine them in a ray-curator pipeline: ```python -from nemo_curator import Sequential, Modify, get_client -from nemo_curator.datasets import DocumentDataset -from nemo_curator.modifiers import UnicodeReformatter, UrlRemover, NewlineNormalizer +import re +import unicodedata +from dataclasses import dataclass +from typing import Any + +import pandas as pd +from ray_curator.pipeline import Pipeline +from ray_curator.stages.base import ProcessingStage +from ray_curator.stages.io.reader import JsonlReader +from ray_curator.stages.io.writer import JsonlWriter +from ray_curator.backends.experimental.ray_data import RayDataExecutor +from ray_curator.tasks import DocumentBatch + +@dataclass +class UnicodeCleaningStage(ProcessingStage[DocumentBatch, DocumentBatch]): + """Stage that cleans Unicode control characters from text.""" + + text_field: str = "text" + _name: str = "unicode_cleaning" + + def inputs(self) -> tuple[list[str], list[str]]: + return ["data"], [self.text_field] + + def outputs(self) -> tuple[list[str], list[str]]: + return ["data"], [self.text_field] + + def process(self, batch: DocumentBatch) -> DocumentBatch: + """Remove Unicode control characters from text.""" + df = batch.to_pandas() + + def clean_unicode(text: str) -> str: + # Remove control characters (non-printable characters) + return "".join(char for char in text if unicodedata.category(char)[0] != "C") + + df[self.text_field] = df[self.text_field].apply(clean_unicode) + + return DocumentBatch( + task_id=f"{batch.task_id}_unicode_cleaned", + dataset_name=batch.dataset_name, + data=df, + _metadata=batch._metadata, + _stage_perf=batch._stage_perf, + ) + +@dataclass +class NewlineNormalizationStage(ProcessingStage[DocumentBatch, DocumentBatch]): + """Stage that normalizes excessive newlines in text.""" + + text_field: str = "text" + _name: str = "newline_normalization" + + def inputs(self) -> tuple[list[str], list[str]]: + return ["data"], [self.text_field] + + def outputs(self) -> tuple[list[str], list[str]]: + return ["data"], [self.text_field] + + def process(self, batch: DocumentBatch) -> DocumentBatch: + """Replace 3+ consecutive newlines with exactly 2 newlines.""" + df = batch.to_pandas() + + def normalize_newlines(text: str) -> str: + # Replace 3 or more consecutive newlines with exactly 2 + return re.sub(r'\n{3,}', '\n\n', text) + + df[self.text_field] = df[self.text_field].apply(normalize_newlines) + + return DocumentBatch( + task_id=f"{batch.task_id}_newlines_normalized", + dataset_name=batch.dataset_name, + data=df, + _metadata=batch._metadata, + _stage_perf=batch._stage_perf, + ) + +@dataclass +class UrlRemovalStage(ProcessingStage[DocumentBatch, DocumentBatch]): + """Stage that removes URLs from text.""" + + text_field: str = "text" + _name: str = "url_removal" + + def inputs(self) -> tuple[list[str], list[str]]: + return ["data"], [self.text_field] + + def outputs(self) -> tuple[list[str], list[str]]: + return ["data"], [self.text_field] + + def process(self, batch: DocumentBatch) -> DocumentBatch: + """Remove URLs from text using regex.""" + df = batch.to_pandas() + + def remove_urls(text: str) -> str: + # Remove HTTP/HTTPS URLs + url_pattern = r'https?://[^\s<>"{}|\\^`[\]]+[^\s<>"{}|\\^`[\].,;!?]' + return re.sub(url_pattern, '', text) + + df[self.text_field] = df[self.text_field].apply(remove_urls) + + return DocumentBatch( + task_id=f"{batch.task_id}_urls_removed", + dataset_name=batch.dataset_name, + data=df, + _metadata=batch._metadata, + _stage_perf=batch._stage_perf, + ) def main(): - client = get_client(cluster_type="cpu") - - dataset = DocumentDataset.read_json("books.jsonl") - cleaning_pipeline = Sequential([ - Modify(UnicodeReformatter()), - Modify(NewlineNormalizer()), - Modify(UrlRemover()), - ]) - - cleaned_dataset = cleaning_pipeline(dataset) - - cleaned_dataset.to_json("cleaned_books.jsonl") + """Create and run a text cleaning pipeline.""" + + # Create pipeline + pipeline = Pipeline( + name="text_cleaning_pipeline", + description="Clean and normalize text data" + ) + + # Add stages to pipeline + pipeline.add_stage(JsonlReader(file_paths="books.jsonl", columns=["text"])) + pipeline.add_stage(UnicodeCleaningStage(text_field="text")) + pipeline.add_stage(NewlineNormalizationStage(text_field="text")) + pipeline.add_stage(UrlRemovalStage(text_field="text")) + pipeline.add_stage(JsonlWriter(output_dir="./cleaned_output")) + + # Create executor and run pipeline + executor = RayDataExecutor() + results = pipeline.run(executor) + + print(f"Text cleaning complete. Processed {len(results) if results else 0} batches.") if __name__ == "__main__": main() ``` -::: -:::{tab-item} CLI +## Custom Text Processing Stages -You can also perform text cleaning operations using the CLI by running the `text_cleaning` command: +You can create custom text processing stages by extending the `ProcessingStage` class. Here's a template that demonstrates the key patterns: -```bash -text_cleaning \ - --input-data-dir=/path/to/input/ \ - --output-clean-dir=/path/to/output/ \ - --normalize-newlines \ - --remove-urls +```python +from dataclasses import dataclass +from typing import Any +import pandas as pd + +from ray_curator.stages.base import ProcessingStage +from ray_curator.tasks import DocumentBatch + +@dataclass +class CustomTextStage(ProcessingStage[DocumentBatch, DocumentBatch]): + """Template for creating custom text processing stages.""" + + text_field: str = "text" # Field containing text to process + _name: str = "custom_text_stage" + + def inputs(self) -> tuple[list[str], list[str]]: + """Define input requirements - DocumentBatch with specified text field.""" + return ["data"], [self.text_field] + + def outputs(self) -> tuple[list[str], list[str]]: + """Define output - DocumentBatch with processed text field.""" + return ["data"], [self.text_field] + + def process(self, batch: DocumentBatch) -> DocumentBatch: + """Process the DocumentBatch and return transformed result.""" + df = batch.to_pandas() + + # Apply your custom text processing function + def custom_text_function(text: str) -> str: + # Insert your text processing logic here + return text.strip().lower() # Example: normalize whitespace and case + + df[self.text_field] = df[self.text_field].apply(custom_text_function) + + # Return new DocumentBatch with processed data + return DocumentBatch( + task_id=f"{batch.task_id}_{self.name}", + dataset_name=batch.dataset_name, + data=df, + _metadata=batch._metadata, + _stage_perf=batch._stage_perf, + ) ``` -By default, the CLI will only perform Unicode reformatting. Appending the `--normalize-newlines` and `--remove-urls` options adds the other text cleaning options. -::: +### Key Implementation Guidelines -:::: +1. **Extend ProcessingStage**: Inherit from `ProcessingStage[DocumentBatch, DocumentBatch]` for text transformations +2. **Define Input/Output Requirements**: Add `inputs()` and `outputs()` methods to specify data field dependencies +3. **Add process() method**: Transform the pandas DataFrame within the DocumentBatch and return a new DocumentBatch +4. **Preserve Metadata**: Always pass through `_metadata` and `_stage_perf` to maintain task lineage +5. **Use @dataclass decorator**: Leverage `@dataclass` decorator for easy configuration -## Custom Text Cleaner +### Advanced Text Processing -You can create your own custom text cleaner by extending the `DocumentModifier` class. The implementation of `UnicodeReformatter` demonstrates this approach: - -```python -import ftfy +For more complex text processing operations, you can: -from nemo_curator.modifiers import DocumentModifier +- **Use External Libraries**: Import libraries like `ftfy` for Unicode repair, `spacy` for NLP, or `regex` for advanced pattern matching +- **Batch Processing**: Override `process_batch()` for more efficient batch operations +- **Resource Management**: Specify GPU/CPU requirements using the `_resources` field +- **Setup Methods**: Use `setup_on_node()` or `setup()` for model loading or expensive initialization +Example with external library: -class UnicodeReformatter(DocumentModifier): - def __init__(self): - super().__init__() - - def modify_document(self, text: str) -> str: - return ftfy.fix_text(text) +```python +@dataclass +class AdvancedUnicodeStage(ProcessingStage[DocumentBatch, DocumentBatch]): + """Unicode cleaning using ftfy library.""" + + def setup_on_node(self, node_info=None, worker_metadata=None): + """Install ftfy if needed (called once per node).""" + try: + import ftfy + except ImportError: + import subprocess + subprocess.check_call(["pip", "install", "ftfy"]) + + def process(self, batch: DocumentBatch) -> DocumentBatch: + import ftfy + + df = batch.to_pandas() + df[self.text_field] = df[self.text_field].apply(ftfy.fix_text) + + return DocumentBatch( + task_id=f"{batch.task_id}_unicode_fixed", + dataset_name=batch.dataset_name, + data=df, + _metadata=batch._metadata, + _stage_perf=batch._stage_perf, + ) ``` - -To create a custom text cleaner, inherit from the `DocumentModifier` class and implement the constructor and `modify_document` method. Also, like the `DocumentFilter` class, `modify_document` can be annotated with `batched` to take in a pandas Series of documents instead of a single document. See the {ref}`custom filters documentation ` for more information. diff --git a/docs/curate-text/process-data/language-management/index.md b/docs/curate-text/process-data/language-management/index.md index 2e9266947..4240baecd 100644 --- a/docs/curate-text/process-data/language-management/index.md +++ b/docs/curate-text/process-data/language-management/index.md @@ -20,31 +20,29 @@ NeMo Curator provides robust tools for managing multilingual text datasets throu Language management in NeMo Curator typically follows this pattern: ```python -import nemo_curator as nc -from nemo_curator.datasets import DocumentDataset -from nemo_curator.filters import FastTextLangId -from nemo_curator.filters.heuristic_filter import HistogramFilter -from nemo_curator.modules.filter import ScoreFilter -from nemo_curator.utils.text_utils import get_word_splitter - -# Load your dataset -dataset = DocumentDataset.read_json("input_data/*.jsonl") - -# Identify languages using FastText -lang_filter = ScoreFilter( - FastTextLangId( - model_path="lid.176.bin", - min_langid_score=0.8 - ), - text_field="text", - score_field="language" +import ray +from ray_curator.stages.filters.fasttext_filter import FastTextLangId +from ray_curator.backends.experimental.ray_data import RayDataExecutor + +# Initialize Ray +ray.init() + +# Load your dataset with Ray +dataset = ray.data.read_json("input_data/*.jsonl") + +# Identify languages using FastText with Ray +lang_filter = FastTextLangId( + model_path="lid.176.bin", + min_langid_score=0.8 ) -# Apply language identification -dataset = lang_filter(dataset) +# Apply language identification using Ray backend +executor = RayDataExecutor() +dataset = executor.run_filter(dataset, lang_filter, text_field="text") # Apply language-specific processing -for lang, subset in dataset.groupby("language"): +# Group by language and process each group +def process_by_language(dataset): if lang in ["zh", "ja", "th", "ko"]: # Special handling for non-spaced languages processor = get_word_splitter(lang) @@ -103,39 +101,39 @@ Manage high-frequency words to enhance text extraction and content detection ### Quick Start Example ```python -from nemo_curator import ScoreFilter -from nemo_curator.datasets import DocumentDataset -from nemo_curator.filters import FastTextLangId +import ray +from ray_curator.stages.filters.fasttext_filter import FastTextLangId +from ray_curator.backends.experimental.ray_data import RayDataExecutor -# Load multilingual dataset -dataset = DocumentDataset.read_json("multilingual_data/*.jsonl") +# Initialize Ray and load multilingual dataset +ray.init() +dataset = ray.data.read_json("multilingual_data/*.jsonl") # Identify languages -langid_filter = ScoreFilter( - FastTextLangId( - model_path="/path/to/lid.176.bin", # Download from fasttext.cc - min_langid_score=0.3 - ), - text_field="text", - score_field="language", - score_type="object" +langid_filter = FastTextLangId( + model_path="/path/to/lid.176.bin", # Download from fasttext.cc + min_langid_score=0.3 ) -# Apply language identification -identified_dataset = langid_filter(dataset) +# Apply language identification using Ray +executor = RayDataExecutor() +identified_dataset = executor.run_filter(dataset, langid_filter, text_field="text") # Extract language codes -identified_dataset.df["language"] = identified_dataset.df["language"].apply( - lambda score: score[1] # Extract language code from [score, lang_code] -) +def extract_language_code(batch): + import ast + batch["language"] = [ast.literal_eval(score)[1] for score in batch["language"]] + return batch + +identified_dataset = identified_dataset.map_batches(extract_language_code) # Filter for specific languages -english_docs = identified_dataset[identified_dataset.df.language == "EN"] -spanish_docs = identified_dataset[identified_dataset.df.language == "ES"] +english_docs = identified_dataset.filter(lambda row: row["language"] == "EN") +spanish_docs = identified_dataset.filter(lambda row: row["language"] == "ES") # Save by language -english_docs.to_json("output/english/", write_to_filename=True) -spanish_docs.to_json("output/spanish/", write_to_filename=True) +english_docs.write_json("output/english/") +spanish_docs.write_json("output/spanish/") ``` ### Supported Languages @@ -151,7 +149,7 @@ NeMo Curator supports language identification for **176 languages** through Fast 1. **Download the FastText model**: Get `lid.176.bin` from [fasttext.cc](https://fasttext.cc/docs/en/language-identification.html) 2. **Set appropriate thresholds**: Balance precision vs. recall based on your needs 3. **Handle non-spaced languages**: Use special processing for Chinese, Japanese, Thai, Korean -4. **Validate on your domain**: Test language detection accuracy on your specific data +4. **Check on your domain**: Test language detection accuracy on your specific data ```{toctree} :maxdepth: 4 diff --git a/docs/curate-text/process-data/language-management/language.md b/docs/curate-text/process-data/language-management/language.md index d0ff22890..3cfe92962 100644 --- a/docs/curate-text/process-data/language-management/language.md +++ b/docs/curate-text/process-data/language-management/language.md @@ -13,143 +13,200 @@ modality: "text-only" Large unlabeled text corpora often contain a variety of languages. NVIDIA NeMo Curator provides tools to accurately identify the language of each document, which is essential for language-specific curation tasks and building high-quality monolingual datasets. -## Overview +## How it Works -Language identification is a critical step in text data curation for several reasons: +NeMo Curator's language identification system works through a three-step process: -- Many data curation steps are language-specific (for example, quality filtering with language-tuned heuristics) -- Most curation pipelines focus on creating monolingual datasets -- Document language is important metadata for model training and evaluation +1. **Text Preprocessing**: The system normalizes input text by stripping whitespace and converting newlines to spaces to prepare it for fastText analysis. -NeMo Curator provides utilities for language identification using fastText, which offers highly accurate language detection across 176 languages. While preliminary language identification may occur earlier in the pipeline (such as during Common Crawl extraction with pyCLD2), fastText provides more accurate results for a definitive classification. +2. **FastText Language Detection**: The pre-trained fastText language identification model (`lid.176.bin`) analyzes the preprocessed text and returns: + - A confidence score (0.0 to 1.0) indicating certainty of the prediction + - A language code (e.g., "EN", "ES", "FR") in fastText's two-letter uppercase format -## Usage +3. **Filtering and Scoring**: Documents are filtered based on a configurable confidence threshold (`min_langid_score`), with results stored as metadata containing both the confidence score and language code. -::::{tab-set} +### Language Detection Process -:::{tab-item} Python +The `FastTextLangId` filter implements this workflow by: -```python -import nemo_curator as nc -from nemo_curator.datasets import DocumentDataset -from nemo_curator.utils.file_utils import get_all_files_paths_under -from nemo_curator.filters import FastTextLangId - -# Load your dataset -files = get_all_files_paths_under("input_data/", keep_extensions="jsonl") -dataset = DocumentDataset.read_json(files) - -# Create language identification filter -# IMPORTANT: Download lid.176.bin from https://fasttext.cc/docs/en/language-identification.html first -langid_filter = nc.ScoreFilter( - FastTextLangId( - model_path="/path/to/lid.176.bin", - min_langid_score=0.3 # Default confidence threshold (can be adjusted based on requirements) - ), - text_field="text", # Field in your documents containing text to analyze - score_field="language", # Field to store language identification results - score_type="object" # The score is an object containing [score, language_code] -) - -# Apply language identification -identified_dataset = langid_filter(dataset) - -# The language field contains [score, lang_code] -# Extract just the language code if needed -identified_dataset.df["language"] = identified_dataset.df["language"].apply( - lambda score: score[1] # Extract language code from [score, lang_code] -) - -# Now each document has a language code field -# You can filter for specific languages -english_docs = identified_dataset[identified_dataset.df.language == "EN"] - -# Save the dataset with language information -identified_dataset.to_json("output_with_language/", write_to_filename=True) -``` +- Loading the fastText language identification model on worker initialization +- Processing text through `model.predict()` with `k=1` to get the top language prediction +- Extracting the language code from fastText labels (e.g., `__label__en` becomes "EN") +- Comparing confidence scores against the threshold to determine document retention +- Returning results as `[confidence_score, language_code]` for downstream processing -::: +This approach supports **176 languages** with high accuracy, making it suitable for large-scale multilingual dataset curation where language-specific processing and monolingual dataset creation are critical. -:::{tab-item} CLI +## Before You Start -### Identifying Languages +- Language identification requires NeMo Curator with distributed backend support. For installation instructions, see the {ref}`admin-installation` guide. -```bash -filter_documents \ - --input-data-dir=/path/to/jsonl/files \ - --filter-config-file=./config/fasttext_langid.yaml \ - --log-scores \ - --log-dir=./log/lang_id -``` +--- -This command applies the fastText model to compute language scores and codes for each document, adding this information as additional fields in each JSON document. +## Usage -### Separating Documents by Language +The following example demonstrates how to create a language identification pipeline using `ray_curator` with distributed processing on a cluster. -Once language information is added to your documents, you can separate them by language: +::::{tab-set} -```bash -separate_by_metadata \ - --input-data-dir=/path/to/jsonl/files \ - --input-metadata-field=language \ - --output-data-dir=/path/to/output/by_language \ - --output-metadata-distribution=./data/lang_distro.json -``` +:::{tab-item} Python -After running this command, the output directory will contain one subdirectory per language, with each containing only documents in that language. +```python +"""Language identification using ray_curator.""" + +from ray_curator.backends.xenna import XennaExecutor +from ray_curator.pipeline import Pipeline +from ray_curator.stages.filters.classifier_filter import FastTextLangId +from ray_curator.stages.io.reader.jsonl import JsonlReader +from ray_curator.stages.modules.filter import ScoreFilter + +def create_language_identification_pipeline(data_dir: str) -> Pipeline: + """Create a pipeline for language identification.""" + + # Define pipeline + pipeline = Pipeline( + name="language_identification", + description="Identify document languages using FastText" + ) + + # Add stages + # 1. Reader stage - creates tasks from JSONL files + pipeline.add_stage( + JsonlReader( + file_paths=data_dir, + files_per_partition=2, # Each task processes 2 files + reader="pandas" + ) + ) + + # 2. Language identification with filtering + # IMPORTANT: Download lid.176.bin or lid.176.ftz from https://fasttext.cc/docs/en/language-identification.html + fasttext_model_path = "/path/to/lid.176.bin" # or lid.176.ftz (compressed) + pipeline.add_stage( + ScoreFilter( + FastTextLangId(model_path=fasttext_model_path, min_langid_score=0.3), + score_field="language" + ) + ) + + return pipeline + +def main(): + # Create pipeline + pipeline = create_language_identification_pipeline("./data") + + # Print pipeline description + print(pipeline.describe()) + + # Create executor and run + executor = XennaExecutor() + results = pipeline.run(executor) + + # Process results + print(f"Pipeline completed! Processed {len(results)} batches") + + total_documents = sum(task.num_items for task in results) if results else 0 + print(f"Total documents processed: {total_documents}") + + # Access language scores + for i, batch in enumerate(results): + if batch.num_items > 0: + df = batch.to_pandas() + print(f"Batch {i} columns: {list(df.columns)}") + # Language scores are now in the 'language' field + +if __name__ == "__main__": + main() +``` ::: :::: ## Configuration -A typical configuration for language identification looks like: - -```yaml -# Example fasttext_langid.yaml -input_field: text -filters: - - name: nemo_curator.filters.classifier_filter.FastTextLangId - log_score: True - params: - model_path: /path/to/lid.176.bin - min_langid_score: 0.3 # Default confidence threshold (adjust based on precision/recall needs) -``` +Currently, `ray_curator` language identification is configured programmatically using the Python API as shown in the previous section. + +:::{note} +YAML pipeline configuration support for `ray_curator` is planned for future releases but not yet available. Use the programmatic Python API for now. +::: ## Understanding Results -The language identification process adds a field to each document: +The language identification process adds a score field to each document batch: -1. `language`: By default, this contains a list with two elements: +1. **`language` field**: Contains the FastText language identification results as a list with two elements: - Element 0: The confidence score (between 0 and 1) - Element 1: The language code in fastText format (for example, "EN" for English, "ES" for Spanish) +2. **Task-based processing**: `ray_curator` processes documents in batches (tasks), and results are available through the task's pandas DataFrame: + +```python +# Access results from pipeline execution +for batch in results: + df = batch.to_pandas() + # Language scores are in the 'language' column + print(df[['text', 'language']].head()) +``` + :::{note} FastText language codes are typically two-letter uppercase codes that may differ slightly from standard ISO 639-1 codes. The model supports 176 languages with high accuracy. ::: -As shown in the Python example, you can extract just the language code with a simple transform if needed. +### Processing Language Results -A higher confidence score indicates greater certainty in the language identification. You can adjust the threshold based on your requirements for precision. +You can extract and work with language identification results: + +```python +# Extract language codes from results +for batch in results: + df = batch.to_pandas() + if 'language' in df.columns: + # Parse the [score, language_code] format + df['lang_score'] = df['language'].apply(lambda x: eval(x)[0] if isinstance(x, str) else x[0]) + df['lang_code'] = df['language'].apply(lambda x: eval(x)[1] if isinstance(x, str) else x[1]) + + # Filter by confidence threshold + high_confidence = df[df['lang_score'] > 0.7] + print(f"High confidence documents: {len(high_confidence)}") +``` + +A higher confidence score indicates greater certainty in the language identification. The `ScoreFilter` automatically filters documents below your specified `min_langid_score` threshold. ## Performance Considerations -- Language identification is computationally intensive but highly scalable across processors -- For large datasets, consider using a distributed Dask setup -- The fastText model file (`lid.176.bin`) is approximately 130MB and must be accessible to all worker nodes -- Processing speed depends on document length and available computational resources -- Memory usage scales with the number of worker processes and batch sizes +- Language identification with `ray_curator` is computationally intensive but highly scalable with Ray's distributed processing +- `ray_curator` automatically handles distributed execution across cluster nodes using the XennaExecutor backend +- The fastText model file (`lid.176.bin` or compressed `lid.176.ftz`) must be accessible to all worker nodes +- Processing speed depends on document length, cluster size, available computational resources, and the `files_per_partition` setting +- Memory usage scales with cluster configuration and task batch sizes +- ray_curator provides efficient resource allocation and autoscaling for large-scale processing +- **Task-based processing**: Documents are grouped into tasks based on the `JsonlReader` configuration, enabling parallel processing across the cluster ## Best Practices :::{important} -**Model Download Required**: Download the fastText language identification model (`lid.176.bin`) from the [official fastText repository](https://fasttext.cc/docs/en/language-identification.html) before using this filter. The model file is approximately 130MB. +**Model Download Required**: Download the fastText language identification model from the [official fastText repository](https://fasttext.cc/docs/en/language-identification.html) before using this filter. Available formats: +- `lid.176.bin` (full model, ~130MB) +- `lid.176.ftz` (compressed model, smaller file size) ::: -- Set an appropriate confidence threshold based on your requirements: - - **Default threshold (0.3)**: Balanced approach suitable for most use cases - - **Higher threshold (0.7+)**: More precision but may discard borderline documents - - **Lower threshold (0.1-0.2)**: Higher recall but may include misclassified documents -- Analyze the language distribution in your dataset to understand its composition +### `ray_curator` Specific Recommendations + +- **Task partitioning**: Adjust `files_per_partition` in `JsonlReader` based on your data size and cluster capacity. Smaller values create more tasks for better parallelization +- **Pipeline design**: Use `ScoreFilter` for combined scoring and filtering, or separate `Score` and `Filter` stages for more control +- **Model accessibility**: Ensure the fastText model file is accessible to all worker nodes in your Ray cluster +- **Resource allocation**: Configure appropriate resources for the `FastTextLangId` stage based on your model size and processing requirements + +### Confidence Threshold Guidelines + +- **Default threshold (0.3)**: Balanced approach suitable for most use cases +- **Higher threshold (0.7+)**: More precision but may discard borderline documents +- **Lower threshold (0.1-0.2)**: Higher recall but may include misclassified documents + +### Production Workflow Tips + +- Analyze the language distribution in your dataset using the pipeline results to understand composition - Consider a two-pass approach: first filter with a lower threshold, then manually review edge cases -- For production workflows, validate language identification accuracy on a sample of your specific domain data \ No newline at end of file +- For production workflows, validate language identification accuracy on a sample of your specific domain data +- Monitor task completion and resource usage through Ray's monitoring capabilities +- Save pipeline results to Parquet format for efficient downstream processing diff --git a/docs/curate-text/process-data/language-management/stopwords.md b/docs/curate-text/process-data/language-management/stopwords.md index 6829943e4..3ecf12807 100644 --- a/docs/curate-text/process-data/language-management/stopwords.md +++ b/docs/curate-text/process-data/language-management/stopwords.md @@ -9,9 +9,10 @@ modality: "text-only" --- (text-process-data-languages-stop-words)= + # Stop Words in Text Processing -Stop words are common words that are often filtered out in natural language processing (NLP) tasks because they typically don't carry significant meaning. NVIDIA NeMo Curator provides built-in stop word lists for several languages to support text analysis and extraction processes. +Stop words are common words that are often filtered out in natural language processing (NLP) tasks because they typically don't carry significant meaning. Ray Curator provides built-in stop word lists for several languages to support text analysis and extraction processes. ```{note} Studies on stopword lists and their distribution in various text corpora have shown that typical English text contains 30–40% stop words. @@ -22,14 +23,15 @@ Studies on stopword lists and their distribution in various text corpora have sh Stop words are high-frequency words that generally don't contribute much semantic value to text analysis. Examples in English include "the," "is," "at," "which," and "on." These words appear so frequently in language that they can distort text processing tasks if not properly managed. Key characteristics of stop words: + - They appear with high frequency in text - They typically serve grammatical rather than semantic functions - They're language-specific (each language has its own set of stop words) -- Removing them can improve efficiency in many NLP tasks +- Removing them can improve efficiency in NLP tasks -## Why Stop Words Matter in NeMo Curator +## Why Stop Words Matter in Ray Curator -In NeMo Curator, stop words play several important roles: +In Ray Curator, stop words play several important roles: 1. **Text Extraction**: The text extraction process (especially for Common Crawl data) uses stop word density as a key metric to identify meaningful content 2. **Language Detection**: Stop words help in language detection and processing @@ -38,7 +40,7 @@ In NeMo Curator, stop words play several important roles: ## Available Stop Word Lists -NeMo Curator leverages the extensive stop word collection from [JusText](https://github.com/miso-belica/jusText/tree/main/justext/stoplists) for most languages. In addition, NeMo Curator provides custom stop word lists for the following languages not covered by JusText: +Ray Curator leverages the extensive stop word collection from [JusText](https://github.com/miso-belica/jusText/tree/main/justext/stoplists) for most languages. In addition, Ray Curator provides custom stop word lists for the following languages not covered by JusText: | Language | File Name | Number of Stop Words | |----------|-----------|---------------------| @@ -88,7 +90,7 @@ thai_stopwords = frozenset([ ## How Stop Words Are Used in Text Extraction -Stop words are a critical component in NeMo Curator's text extraction algorithms. Here's how they're used in different extractors: +Stop words are a critical component in Ray Curator's text extraction algorithms. Here's how they're used in different extractors: ### JusText Extractor @@ -98,7 +100,7 @@ The JusText algorithm uses stop word density to classify text blocks as main con 2. **Parameter Customization**: You can customize the stop word density thresholds via `stopwords_low` and `stopwords_high` parameters ```python -from nemo_curator.download import JusTextExtractor +from ray_curator.stages.download.text.html_extractors import JusTextExtractor # Customize stop word thresholds extractor = JusTextExtractor( @@ -112,7 +114,7 @@ extractor = JusTextExtractor( These extractors also use stop word density to filter extracted content: ```python -from nemo_curator.download import ResiliparseExtractor, TrafilaturaExtractor +from ray_curator.stages.download.text.html_extractors import ResiliparseExtractor, TrafilaturaExtractor # Resiliparse with custom stop word density resiliparse = ResiliparseExtractor( @@ -127,22 +129,25 @@ trafilatura = TrafilaturaExtractor( ## Special Handling for Non-Spaced Languages -Languages like Thai, Chinese, Japanese, and Korean don't use spaces between words, which affects how stop word density is calculated. NeMo Curator identifies these languages via the `NON_SPACED_LANGUAGES` constant: +Languages like Thai, Chinese, Japanese, and Korean don't use spaces between words, which affects how stop word density is calculated. Ray Curator identifies these languages via the `NON_SPACED_LANGUAGES` constant: ```python -NON_SPACED_LANGUAGES = ["THAI", "CHINESE", "JAPANESE", "KOREAN"] +NON_SPACED_LANGUAGES = frozenset(["THAI", "CHINESE", "JAPANESE", "KOREAN"]) ``` For these languages, special handling is applied: + - Stop word density calculations are disabled - Boilerplate removal based on stop words is adjusted ## Creating Custom Stop Word Lists -You can create and use your own stop word lists when processing text with NeMo Curator: +You can create and use your own stop word lists when processing text with Ray Curator: ```python -from nemo_curator.download import download_common_crawl +from ray_curator.stages.download.text.common_crawl import CommonCrawlDownloadExtractStage +from ray_curator.pipeline import Pipeline +from ray_curator.backends.xenna import XennaExecutor # Define custom stop words for multiple languages custom_stop_lists = { @@ -150,23 +155,31 @@ custom_stop_lists = { "SPANISH": frozenset(["el", "la", "los", "las", "un", "una", "y", "o", "de", "en", "que"]), } -# Use custom stop lists in download process -dataset = download_common_crawl( - "/output/folder", - "2023-06", - "2023-10", +# Create Common Crawl processing stage with custom stop lists +cc_stage = CommonCrawlDownloadExtractStage( + start_snapshot="2023-06", + end_snapshot="2023-10", + download_dir="/output/folder", stop_lists=custom_stop_lists ) + +# Create and run pipeline +pipeline = Pipeline(name="custom_stopwords_pipeline") +pipeline.add_stage(cc_stage) + +# Execute pipeline +executor = XennaExecutor() +results = pipeline.run(executor) ``` ## Performance Considerations -- Stop word lists are implemented as frozen sets for fast lookups (O(1) complexity) -- Using appropriate stop word lists can significantly improve extraction quality +- Stop word lists use frozen sets for fast lookups (O(1) complexity) +- Using appropriate stop word lists can improve extraction quality - For specialized domains, consider customizing the stop word lists -## Additional Resources +## More Resources - [JusText Algorithm Overview](https://corpus.tools/wiki/Justext/Algorithm) - [Resiliparse Documentation](https://resiliparse.chatnoir.eu/en/latest/man/extract/html2text.html) -- [Trafilatura Documentation](https://trafilatura.readthedocs.io/en/latest/) \ No newline at end of file +- [Trafilatura Documentation](https://trafilatura.readthedocs.io/en/latest/) \ No newline at end of file diff --git a/docs/curate-text/process-data/quality-assessment/classifier.md b/docs/curate-text/process-data/quality-assessment/classifier.md index d0f94ebc4..42a5aedc3 100644 --- a/docs/curate-text/process-data/quality-assessment/classifier.md +++ b/docs/curate-text/process-data/quality-assessment/classifier.md @@ -9,6 +9,7 @@ modality: "text-only" --- (text-process-data-filter-classifier)= + # Classifier-Based Filtering Classifier-based filtering uses machine learning models to differentiate between high-quality and low-quality documents. NVIDIA NeMo Curator implements an approach similar to the one described in [Brown et al., 2020](https://arxiv.org/abs/2005.14165), which trains a binary skip-gram classifier to distinguish between curated high-quality data and lower-quality data. @@ -143,56 +144,60 @@ Finally, use the trained model to filter your dataset: :::{tab-item} Python ```python -import nemo_curator as nc -from nemo_curator.datasets import DocumentDataset -from nemo_curator.filters import FastTextQualityFilter - -# Load your dataset -dataset = DocumentDataset.read_json("input_data/*.jsonl") - -# Create a quality filter using the trained model -filter_step = nc.ScoreFilter( - FastTextQualityFilter( - model_path="./quality_classifier.bin", - label="__label__hq", # High quality label - alpha=3, # Pareto distribution alpha parameter - seed=42 # Random seed for reproducibility - ), - text_field="text", - score_field="quality_score" -) - -# Apply the filter -high_quality_data = filter_step(dataset) - -# Save the results -high_quality_data.to_json("high_quality_output/", write_to_filename=True) -``` - -::: - -:::{tab-item} CLI - -```bash -filter_documents \ - --input-data-dir=/path/to/input/data \ - --filter-config-file=./config/fasttext_quality_filter.yaml \ - --output-retained-document-dir=/path/to/output/high_quality \ - --output-removed-document-dir=/path/to/output/low_quality \ - --log-dir=/path/to/logs/fasttext_classifier -``` - -Where the YAML configuration file looks like: - -```yaml -input_field: text -filters: - - name: nemo_curator.filters.FastTextQualityFilter - params: - model_path: /path/to/quality_classifier.bin - alpha: 3 - label: "__label__hq" - seed: 42 +from ray_curator.backends.xenna import XennaExecutor +from ray_curator.pipeline import Pipeline +from ray_curator.stages.filters.fasttext_filter import FastTextQualityFilter +from ray_curator.stages.io.reader.jsonl import JsonlReader +from ray_curator.stages.modules.filter import ScoreFilter + +def create_quality_filtering_pipeline(data_dir: str) -> Pipeline: + """Create a pipeline for quality filtering using trained fastText model.""" + + # Define pipeline + pipeline = Pipeline( + name="quality_filtering", + description="Filter documents using trained fastText quality classifier" + ) + + # Add stages + # 1. Reader stage + pipeline.add_stage( + JsonlReader( + file_paths=data_dir, + files_per_partition=2, + reader="pandas" + ) + ) + + # 2. Quality filtering + pipeline.add_stage( + ScoreFilter( + FastTextQualityFilter( + model_path="./quality_classifier.bin", + label="__label__hq", + alpha=3, + seed=42 + ), + score_field="quality_score" + ) + ) + + return pipeline + +def main(): + # Create and run pipeline + pipeline = create_quality_filtering_pipeline("./input_data") + executor = XennaExecutor() + results = pipeline.run(executor) + + # Process results + for i, batch in enumerate(results): + # Save high-quality documents + output_file = f"high_quality_output/batch_{i}.parquet" + batch.to_pyarrow().write(output_file) + +if __name__ == "__main__": + main() ``` ::: @@ -218,22 +223,7 @@ The `FastTextQualityFilter` accepts the following parameters: - `alpha` (float, default=3): Alpha parameter for Pareto distribution sampling - `seed` (int, default=42): Random seed for reproducible sampling -## Configuration - -A typical configuration for classifier-based filtering looks like: - -```yaml -filters: - - name: ScoreFilter - filter: - name: FastTextQualityFilter - model_path: /path/to/quality_classifier.bin - label: __label__hq - alpha: 3 - seed: 42 - text_field: text - score_field: quality_score -``` + ## Best Practices @@ -243,4 +233,4 @@ For effective classifier-based filtering: 2. **Validation**: Manually review a sample of filtered results to confirm effectiveness 3. **Threshold tuning**: Adjust the threshold based on your quality requirements 4. **Combination with heuristics**: Consider using heuristic filters as a pre-filter -5. **Domain adaptation**: Train domain-specific classifiers for specialized corpora \ No newline at end of file +5. **Domain adaptation**: Train domain-specific classifiers for specialized corpora diff --git a/docs/curate-text/process-data/quality-assessment/custom.md b/docs/curate-text/process-data/quality-assessment/custom.md index 927f4c9bd..f0bb4ccff 100644 --- a/docs/curate-text/process-data/quality-assessment/custom.md +++ b/docs/curate-text/process-data/quality-assessment/custom.md @@ -1,7 +1,7 @@ --- description: "Create and combine custom filters using NeMo Curator's flexible framework for specialized data quality requirements" categories: ["how-to-guides"] -tags: ["custom-filters", "extensible", "flexible", "advanced", "framework", "batched"] +tags: ["custom-filters", "extensible", "flexible", "advanced", "framework", "task-based"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "advanced" content_type: "how-to" @@ -9,6 +9,7 @@ modality: "text-only" --- (text-process-data-filter-custom)= + # Custom Filters NVIDIA NeMo Curator provides a flexible framework for implementing and combining custom filters to meet your specific data quality requirements. Whether you need to filter documents based on domain-specific criteria or optimize your pipeline's performance, custom filters give you complete control over the filtering process. @@ -23,7 +24,7 @@ Custom filters in NeMo Curator inherit from the `DocumentFilter` abstract base c Here's a simple example of a custom filter: ```python -from nemo_curator.filters import DocumentFilter +from ray_curator.stages.filters.doc_filter import DocumentFilter class CustomWordFilter(DocumentFilter): def __init__(self, target_words, min_occurrences=1): @@ -41,25 +42,19 @@ class CustomWordFilter(DocumentFilter): def keep_document(self, score: int): """Keep documents with enough target words.""" return score >= self._min_occurrences - - @property - def backend(self): - """Specify which dataframe backend this filter supports.""" - return "pandas" # Options are "pandas", "cudf", or "any" ``` -By default, the `backend` property returns "pandas", but you can override it to support GPU-accelerated processing with "cudf" or specify "any" if your filter works with either backend. +Custom filters automatically work with both pandas and PyArrow data formats used in the task-based processing architecture. ## Using Custom Filters -Once you've defined your custom filter, you can use it with NeMo Curator's filtering framework: +Once you've defined your custom filter, you can use it with NeMo Curator's task-based processing framework: ```python -import nemo_curator as nc -from nemo_curator.datasets import DocumentDataset - -# Load your dataset -dataset = DocumentDataset.read_json("input_data/*.jsonl") +from ray_curator.pipeline.pipeline import Pipeline +from ray_curator.stages.modules.score_filter import ScoreFilter +from ray_curator.stages.io.reader.jsonl import JsonlReader +from ray_curator.stages.io.writer.jsonl import JsonlWriter # Create and configure your custom filter my_filter = CustomWordFilter( @@ -67,141 +62,192 @@ my_filter = CustomWordFilter( min_occurrences=3 ) -# Apply the filter -filter_step = nc.ScoreFilter( - my_filter, +# Create a pipeline +pipeline = Pipeline( + name="custom_filter_pipeline", + description="Apply custom word filter to documents" +) + +# Add stages to the pipeline +pipeline.add_stage(JsonlReader(file_paths="input_data/*.jsonl")) +pipeline.add_stage(ScoreFilter( + filter_obj=my_filter, text_field="text", score_field="target_word_count" -) +)) +pipeline.add_stage(JsonlWriter(output_dir="filtered_output/")) -# Get filtered dataset -filtered_dataset = filter_step(dataset) +# Build and run the pipeline +pipeline.build() -# Save results -filtered_dataset.to_json("filtered_output/", write_to_filename=True) +# Execute with your chosen backend (Ray Data or Xenna) +from ray_curator.backends.experimental.ray_data.executor import RayDataExecutor +executor = RayDataExecutor() +pipeline.run(executor) ``` -## Optimizing Performance with Batched Filters +## Optimizing Performance + +The framework automatically optimizes performance through its task-based architecture. DocumentBatch objects contain documents that process together, providing natural batching without requiring special decorators. + +For computationally intensive filters, you can optimize performance by: -For improved performance, especially with large datasets, you can implement batched versions of your filters using the `@batched` decorator: +1. **Using vectorized operations**: Use pandas or NumPy vectorized operations when possible +2. **Configuring batch sizes**: Adjust the `files_per_partition` parameter in readers +3. **Resource allocation**: Specify appropriate CPU/GPU resources for your stages ```python +from ray_curator.stages.filters.doc_filter import DocumentFilter import pandas as pd -from nemo_curator.filters import DocumentFilter -from nemo_curator.utils.decorators import batched -class BatchedCustomFilter(DocumentFilter): +class OptimizedCustomFilter(DocumentFilter): def __init__(self, threshold=0.5): super().__init__() self._threshold = threshold - self._name = 'batched_custom_filter' + self._name = 'optimized_custom_filter' def score_document(self, text: str): - # Single document scoring logic - return compute_quality_score(text) + # Single document scoring logic for individual processing + return len(text.split()) / max(text.count('.'), 1) # words per sentence - @batched - def keep_document(self, scores: pd.Series): - """Process multiple documents at once. - - Args: - scores: Pandas Series containing scores with document IDs as index - - Returns: - Pandas Series of boolean values with same index as input - """ - return scores >= self._threshold + def keep_document(self, score: float): + """Filter logic applied to individual document scores.""" + return score >= self._threshold ``` -When implementing batched methods, it's crucial to maintain the original index in the returned Series to ensure proper document tracking. +Processing stages automatically handle batching of DocumentBatch objects containing documents. ## Filter Composition Methods -NeMo Curator makes it easy to combine multiple filters using different composition approaches: +NeMo Curator makes it easy to combine filters using pipeline composition: -### Sequential +### Sequential Pipeline Composition -The `Sequential` class applies a series of filters in order: +Use the `Pipeline` class to apply a series of filters in order: ```python -import nemo_curator as nc -from nemo_curator.filters import WordCountFilter, NonAlphaNumericFilter, UrlsFilter - -# Create a pipeline of filters -filter_pipeline = nc.Sequential([ - nc.ScoreFilter(WordCountFilter(min_words=100)), - nc.ScoreFilter(NonAlphaNumericFilter(max_symbol_ratio=0.3)), - nc.ScoreFilter(UrlsFilter(max_urls=3)) -]) - -# Apply the pipeline -high_quality_docs = filter_pipeline(dataset) +from ray_curator.pipeline.pipeline import Pipeline +from ray_curator.stages.modules.score_filter import ScoreFilter +from ray_curator.stages.filters.heuristic_filter import WordCountFilter, NonAlphaNumericFilter, UrlsFilter +from ray_curator.stages.io.reader.jsonl import JsonlReader +from ray_curator.stages.io.writer.jsonl import JsonlWriter + +# Create a pipeline with multiple filters +filter_pipeline = Pipeline( + name="sequential_filter_pipeline", + description="Apply multiple filters in sequence" +) + +# Add reader +filter_pipeline.add_stage(JsonlReader(file_paths="input_data/*.jsonl")) + +# Add filtering stages +filter_pipeline.add_stage(ScoreFilter( + filter_obj=WordCountFilter(min_words=100), + text_field="text" +)) +filter_pipeline.add_stage(ScoreFilter( + filter_obj=NonAlphaNumericFilter(max_non_alpha_numeric_to_text_ratio=0.3), + text_field="text" +)) +filter_pipeline.add_stage(ScoreFilter( + filter_obj=UrlsFilter(max_url_to_text_ratio=0.2), + text_field="text" +)) + +# Add writer +filter_pipeline.add_stage(JsonlWriter(output_dir="high_quality_output/")) + +# Execute the pipeline +filter_pipeline.build() +executor = RayDataExecutor() +filter_pipeline.run(executor) ``` -### Parallel with Voting (Custom Implementation) +### Custom Composite Stages -You can implement a custom voting system where documents must pass a certain number of filters. This is not a built-in class but can be implemented as a utility function: +For complex filter combinations, you can create custom composite stages that provide voting or parallel filtering logic: ```python -import pandas as pd -import nemo_curator as nc - -# Custom utility function for filter voting -def voting_filter(dataset, filters, min_passing=2): - """ - Custom implementation of a voting filter system. - - Args: - dataset: DocumentDataset to filter - filters: List of filter modules - min_passing: Minimum number of filters that must accept a document - - Returns: - Filtered DocumentDataset - """ - results = [] - for f in filters: - results.append(f(dataset)) +from dataclasses import dataclass +from ray_curator.stages.base import CompositeStage, ProcessingStage +from ray_curator.stages.modules.score_filter import ScoreFilter +from ray_curator.tasks import DocumentBatch + +@dataclass +class VotingFilterStage(CompositeStage[DocumentBatch, DocumentBatch]): + """Custom composite stage that implements voting filter logic.""" - # Create a mask where documents pass at least min_passing filters - document_ids = dataset.df.index - pass_counts = pd.Series(0, index=document_ids) + filters: list[DocumentFilter] + min_passing: int = 2 + text_field: str = "text" - for result in results: - pass_counts[result.df.index] += 1 + def decompose(self) -> list[ProcessingStage]: + """Decompose into individual scoring stages.""" + stages = [] + + # Add scoring stages for each filter + for i, filter_obj in enumerate(self.filters): + stages.append(ScoreFilter( + filter_obj=filter_obj, + text_field=self.text_field, + score_field=f"filter_{i}_score" + )) + + # Add final voting stage + stages.append(VotingDecisionStage( + filter_count=len(self.filters), + min_passing=self.min_passing + )) + + return stages - passing_ids = pass_counts[pass_counts >= min_passing].index - return nc.DocumentDataset(dataset.df.loc[passing_ids]) + def get_description(self) -> str: + return f"Voting filter requiring {self.min_passing} of {len(self.filters)} filters to pass" ``` +This approach leverages the modular nature of the pipeline architecture to provide complex filtering logic. + ## Scoring Without Filtering Sometimes you want to add quality scores to your documents without actually filtering them: ```python -import nemo_curator as nc -from nemo_curator.filters import WordCountFilter, NonAlphaNumericFilter +from ray_curator.pipeline.pipeline import Pipeline +from ray_curator.stages.modules.score_filter import Score +from ray_curator.stages.filters.heuristic_filter import WordCountFilter, NonAlphaNumericFilter +from ray_curator.stages.io.reader.jsonl import JsonlReader +from ray_curator.stages.io.writer.jsonl import JsonlWriter + +# Create a pipeline for adding scores +scoring_pipeline = Pipeline( + name="scoring_pipeline", + description="Add quality scores without filtering" +) + +# Add reader +scoring_pipeline.add_stage(JsonlReader(file_paths="input_data/*.jsonl")) -# Score documents without filtering them -scoring_step = nc.Score( - WordCountFilter().score_document, +# Add scoring stages (no filtering) +scoring_pipeline.add_stage(Score( + score_fn=WordCountFilter(), # Uses the filter's score_document method text_field="text", score_field="word_count" -) +)) -# Add multiple scores -symbol_scoring = nc.Score( - NonAlphaNumericFilter().score_document, - text_field="text", +scoring_pipeline.add_stage(Score( + score_fn=NonAlphaNumericFilter(), + text_field="text", score_field="symbol_ratio" -) +)) -# Apply scoring -scored_dataset = scoring_step(dataset) -scored_dataset = symbol_scoring(scored_dataset) +# Add writer to save scored documents +scoring_pipeline.add_stage(JsonlWriter(output_dir="scored_output/")) -# Save the scored dataset -scored_dataset.to_json("scored_output/", write_to_filename=True) +# Execute the pipeline +scoring_pipeline.build() +executor = RayDataExecutor() +scoring_pipeline.run(executor) ``` ## Filtering on Existing Metadata @@ -209,54 +255,58 @@ scored_dataset.to_json("scored_output/", write_to_filename=True) If your dataset already contains quality metrics, you can filter directly on those: ```python -import nemo_curator as nc - -# Filter based on existing metadata field -filter_step = nc.Filter( - lambda score: score < 0.3, # Keep only documents with toxicity < 0.3 - filter_field="toxicity_score" +from ray_curator.pipeline.pipeline import Pipeline +from ray_curator.stages.modules.score_filter import Filter +from ray_curator.stages.io.reader.jsonl import JsonlReader +from ray_curator.stages.io.writer.jsonl import JsonlWriter + +# Create a pipeline to filter on existing metadata +metadata_filter_pipeline = Pipeline( + name="metadata_filter_pipeline", + description="Filter documents based on existing scores" ) -safe_documents = filter_step(scored_dataset) -``` +# Add reader +metadata_filter_pipeline.add_stage(JsonlReader(file_paths="scored_data/*.jsonl")) -## Integrating with CLI - -To make your custom filters available through the command-line interface, you can register them in a configuration file: - -```yaml -# custom_filters.yaml -input_field: text -filters: - - name: ScoreFilter - filter: - name: path.to.your.CustomWordFilter - params: - target_words: ["machine", "learning", "ai"] - min_occurrences: 2 - text_field: text - score_field: target_word_count - - # Add more filters as needed -``` +# Filter based on existing metadata field +metadata_filter_pipeline.add_stage(Filter( + filter_fn=lambda score: score < 0.3, # Keep only documents with toxicity < 0.3 + filter_field="toxicity_score" +)) -Then use this configuration with the `filter_documents` CLI: +# Add writer for filtered results +metadata_filter_pipeline.add_stage(JsonlWriter(output_dir="safe_documents/")) -```bash -filter_documents \ - --input-data-dir=/path/to/input/data \ - --filter-config-file=./custom_filters.yaml \ - --output-retained-document-dir=/path/to/output \ - --log-dir=/path/to/logs +# Execute the pipeline +metadata_filter_pipeline.build() +executor = RayDataExecutor() +metadata_filter_pipeline.run(executor) ``` ## Best Practices When developing custom filters: -1. **Optimize for performance**: Implement batch processing for computationally intensive operations -2. **Add meaningful metadata**: Store scores that provide insight into why documents were kept or removed +1. **Leverage task-based processing**: The framework automatically handles batching through DocumentBatch objects +2. **Add meaningful metadata**: Store scores that provide insight into which documents pass filters 3. **Start simple**: Begin with basic filters and incrementally add complexity -4. **Test on samples**: Validate your filters on small samples before processing large datasets -5. **Monitor filter impact**: Track how many documents each filter removes to identify potential issues -6. **Document behavior**: Add clear documentation about what your filter does and its parameters \ No newline at end of file +4. **Test on samples**: Check your filters on small samples before processing large datasets using `limit` parameters in readers +5. **Track filter impact**: Use pipeline logging and stage performance metrics to track processing +6. **Resource allocation**: Specify appropriate CPU/GPU resources for computationally intensive filters +7. **Modular design**: Create reusable filters that work across different pipelines +8. **Document behavior**: Add clear documentation about what your filter does and its parameters + +### Resource Configuration Example + +```python +from ray_curator.stages.resources import Resources + +class ComputeIntensiveFilter(DocumentFilter): + # Override default resources for GPU-based processing + _resources = Resources(cpus=2.0, gpu_memory_gb=4.0) + + def score_document(self, text: str): + # GPU-accelerated scoring logic + pass +``` diff --git a/docs/curate-text/process-data/quality-assessment/distributed-classifier.md b/docs/curate-text/process-data/quality-assessment/distributed-classifier.md index 7fbf90f94..3dfd8ecfe 100644 --- a/docs/curate-text/process-data/quality-assessment/distributed-classifier.md +++ b/docs/curate-text/process-data/quality-assessment/distributed-classifier.md @@ -1,217 +1,252 @@ --- -description: "Perform distributed data classification using GPU-accelerated models for domain, quality, safety, and content assessment" +description: "Perform task-based data classification using composable processing stages for domain, quality, safety, and content assessment" categories: ["how-to-guides"] -tags: ["distributed-classification", "gpu", "domain", "quality", "safety", "crossfit", "scalable"] +tags: ["task-based-classification", "ray", "pipeline", "filters", "extensible", "scalable"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "how-to" modality: "text-only" --- -(text-process-data-filter-dist-classifier)= -# Distributed Data Classification +(text-process-data-filter-task-classifier)= -NVIDIA NeMo Curator provides a module for performing distributed classification on large text datasets using GPU acceleration. This enables the categorization and filtering of text documents based on multiple dimensions such as domain, quality, safety, educational value, content type, and more. These classifications can enhance the quality of training data for large language models by identifying high-value content and removing problematic material. +# Task-Based Data Classification + +NVIDIA NeMo Curator provides a flexible, task-based classification system using Ray for distributed processing. This enables categorization and filtering of text documents through modular processing stages that you can combine into custom pipelines for domain, quality, safety, and content assessment. ## How It Works -The distributed data classification in NeMo Curator works by: +The task-based classification system works by: -1. **Parallel Processing**: Chunking datasets across multiple computing nodes and GPUs to accelerate classification -2. **Pre-trained Models**: Using specialized models for different classification tasks -3. **Batched Inference**: Optimizing throughput with intelligent batching via CrossFit integration -4. **Consistent API**: Providing a unified interface through the `DistributedDataClassifier` base class +1. **Pipeline Architecture**: Composing processing stages into workflows that operate on DocumentBatch tasks +2. **Extensible Filters**: Using DocumentFilter base classes for custom scoring and filtering logic +3. **Ray Distribution**: Leveraging Ray for distributed execution across nodes and GPU devices +4. **Task Flow**: Processing data in batches that flow through interconnected stages -The `DistributedDataClassifier` is designed to run on GPU clusters with minimal code changes regardless of which specific classifier you're using. All classifiers support filtering based on classification results and storing prediction scores as metadata. +The system provides three core classification stages that you can combine and customize for different use cases, providing fine-grained control over classification workflows. --- -## Usage +## Core Classification Stages -NVIDIA NeMo Curator provides a base class `DistributedDataClassifier` that can be extended to fit your specific model. The only requirement is that the model can fit on a single GPU. This module operates on the GPU, so the Dask cluster must be started as a GPU cluster, and `DocumentDataset` requires `backend="cudf"`. +The task-based classification system provides three main stages that you can compose into pipelines: -### Classifier Comparison +### Stage Types -| Classifier | Purpose | Model Location | Key Parameters | Requirements | +| Stage | Purpose | Input | Output | Use Case | |---|---|---|---|---| -| DomainClassifier | Categorize English text by domain | [nvidia/domain-classifier](https://huggingface.co/nvidia/domain-classifier) | `filter_by`, `text_field` | None | -| MultilingualDomainClassifier | Categorize text in 52 languages by domain | [nvidia/multilingual-domain-classifier](https://huggingface.co/nvidia/multilingual-domain-classifier) | `filter_by`, `text_field` | None | -| QualityClassifier | Assess document quality | [nvidia/quality-classifier-deberta](https://huggingface.co/nvidia/quality-classifier-deberta) | `filter_by`, `text_field` | None | -| AegisClassifier | Detect unsafe content | [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) | `aegis_variant`, `filter_by` | HuggingFace token | -| InstructionDataGuardClassifier | Detect poisoning attacks | [nvidia/instruction-data-guard](https://huggingface.co/nvidia/instruction-data-guard) | `text_field`, `pred_column` | HuggingFace token | -| FineWebEduClassifier | Score educational value | [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) | `pred_column`, `int_column` | None | -| FineWebMixtralEduClassifier | Score educational value (Mixtral annotations) | [nvidia/nemocurator-fineweb-mixtral-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier) | `pred_column`, `int_column` | None | -| FineWebNemotronEduClassifier | Score educational value (Nemotron annotations) | [nvidia/nemocurator-fineweb-nemotron-4-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) | `pred_column`, `int_column` | None | -| ContentTypeClassifier | Categorize by speech type | [nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta) | `filter_by`, `text_field` | None | -| PromptTaskComplexityClassifier | Classify prompt tasks and complexity | [nvidia/prompt-task-and-complexity-classifier](https://huggingface.co/nvidia/prompt-task-and-complexity-classifier) | `text_field` | None | - -### Domain Classifier +| `Score` | Add classification scores to documents | DocumentBatch | DocumentBatch with score columns | Computing metrics without filtering | +| `Filter` | Filter documents based on existing metadata | DocumentBatch | Filtered DocumentBatch | Applying filters to pre-scored data | +| `ScoreFilter` | Score and filter documents in one step | DocumentBatch | Scored and filtered DocumentBatch | Combined scoring and filtering | -The Domain Classifier categorizes English text documents into specific domains or subject areas. +### Available Document Filters -```python -from nemo_curator.classifiers import DomainClassifier -from nemo_curator.datasets import DocumentDataset +| Filter | Purpose | Module | Key Parameters | +|---|---|---|---| +| `FastTextQualityFilter` | Assess document quality using FastText | `ray_curator.stages.filters.fasttext_filter` | `alpha`, `label` | +| `FastTextLangId` | Language identification | `ray_curator.stages.filters.fasttext_filter` | `min_langid_score` | +| `DocumentFilter` | Base class for custom filters | `ray_curator.stages.filters.doc_filter` | Abstract base | +| Heuristic filters | Text quality assessment | `ray_curator.stages.filters.heuristic_filter` | Various parameters | -# Load your dataset with cuDF backend -input_dataset = DocumentDataset.read_json("books_dataset/*.jsonl", backend="cudf") +## Basic Usage -# Apply the classifier, filtering for specific domains -domain_classifier = DomainClassifier(filter_by=["Games", "Sports"]) -result_dataset = domain_classifier(dataset=input_dataset) - -# Save the results -result_dataset.to_json("games_and_sports/") -``` +### Quality Classification with FastText -### Multilingual Domain Classifier - -Functionally similar to the Domain Classifier, but supports 52 languages. +Assess document quality using FastText models integrated into the pipeline system. ```python -from nemo_curator.classifiers import MultilingualDomainClassifier - -input_dataset = DocumentDataset.read_json("multilingual_dataset/*.jsonl", backend="cudf") -classifier = MultilingualDomainClassifier(filter_by=["Games", "Sports"]) -result_dataset = classifier(dataset=input_dataset) +from ray_curator.pipeline import Pipeline +from ray_curator.stages.io.reader import JsonlReader +from ray_curator.stages.modules import ScoreFilter +from ray_curator.stages.filters import FastTextQualityFilter +from ray_curator.backends.xenna import XennaExecutor + +# Create classification pipeline +pipeline = Pipeline([ + JsonlReader("books_dataset/*.jsonl"), + ScoreFilter( + filter_obj=FastTextQualityFilter(alpha=3.0, label="__label__hq"), + text_field="text", + score_field="quality_score" + ) +]) + +# Execute pipeline +executor = XennaExecutor() +results = pipeline.run(executor) ``` -### Quality Classifier +### Language Identification -The Quality Classifier assesses document quality on a scale from Low to High. +Identify document language using FastText language identification. ```python -from nemo_curator.classifiers import QualityClassifier - -input_dataset = DocumentDataset.read_json("web_documents/*.jsonl", backend="cudf") -quality_classifier = QualityClassifier(filter_by=["High", "Medium"]) -result_dataset = quality_classifier(dataset=input_dataset) +from ray_curator.stages.filters import FastTextLangId + +pipeline = Pipeline([ + JsonlReader("multilingual_dataset/*.jsonl"), + ScoreFilter( + filter_obj=FastTextLangId(min_langid_score=0.3), + text_field="text", + score_field="language_info" + ) +]) ``` -### AEGIS Safety Model +### Separate Scoring and Filtering -The AEGIS classifier detects unsafe content across 13 critical risk categories. It requires a HuggingFace token for access to Llama Guard. +Use separate stages for more complex workflows. ```python -from nemo_curator.classifiers import AegisClassifier - -input_dataset = DocumentDataset.read_json("content/*.jsonl", backend="cudf") - -token = "hf_1234" # Your HuggingFace user access token -safety_classifier = AegisClassifier( - aegis_variant="nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0", - token=token, - filter_by=["safe", "O13"] # Keep only safe content and "needs caution" category -) -result_dataset = safety_classifier(dataset=input_dataset) +from ray_curator.stages.modules import Score, Filter + +pipeline = Pipeline([ + JsonlReader("web_documents/*.jsonl"), + # First, add quality scores + Score( + score_fn=FastTextQualityFilter(alpha=3.0), + score_field="quality_score", + text_field="text" + ), + # Then filter based on scores + Filter( + filter_fn=lambda score: score > 0.7, + filter_field="quality_score" + ) +]) ``` -The classifier adds a column with labels: "safe," "O1" through "O13" (each representing specific safety risks), or "unknown." For raw LLM output, use: +## Creating Custom Filters -```python -safety_classifier = AegisClassifier( - aegis_variant="nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0", - token=token, - keep_raw_pred=True, - raw_pred_column="raw_predictions" -) -``` +Build custom classification logic by extending the DocumentFilter base class. -### Instruction Data Guard - -Detects LLM poisoning attacks in instruction-response datasets. Requires HuggingFace token access. +### Custom Domain Filter Example ```python -from nemo_curator.classifiers import InstructionDataGuardClassifier - -# For instruction-response data: "Instruction: {instruction}. Input: {input_}. Response: {response}." -input_dataset = DocumentDataset.read_json("instruction_data/*.jsonl", backend="cudf") - -token = "hf_1234" # Your HuggingFace user access token -classifier = InstructionDataGuardClassifier(token=token) -result_dataset = classifier(dataset=input_dataset) +from ray_curator.stages.filters import DocumentFilter +import torch +from transformers import AutoTokenizer, AutoModelForSequenceClassification + +class CustomDomainFilter(DocumentFilter): + def __init__(self, target_domains: list[str], model_name: str = "nvidia/domain-classifier"): + super().__init__() + self.target_domains = target_domains + self.model_name = model_name + self.model = None + self.tokenizer = None + + def setup_on_node(self, node_info=None, worker_metadata=None): + """Download model on each node""" + # Model download logic here + pass + + def setup(self, worker_metadata=None): + """Load model on each worker""" + self.tokenizer = AutoTokenizer.from_pretrained(self.model_name) + self.model = AutoModelForSequenceClassification.from_pretrained(self.model_name) + self.model.eval() + + def score_document(self, text: str) -> str: + """Classify document domain""" + inputs = self.tokenizer(text, truncation=True, max_length=512, return_tensors="pt") + with torch.no_grad(): + outputs = self.model(**inputs) + predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) + predicted_class = predictions.argmax().item() + return self.model.config.id2label[predicted_class] + + def keep_document(self, domain: str) -> bool: + """Filter based on target domains""" + return domain in self.target_domains + +# Use in pipeline +pipeline = Pipeline([ + JsonlReader("content/*.jsonl"), + ScoreFilter( + filter_obj=CustomDomainFilter(target_domains=["Science", "Technology"]), + text_field="text", + score_field="domain" + ) +]) ``` -The output includes two columns: a float score `instruction_data_guard_poisoning_score` and a Boolean `is_poisoned`. - -### FineWeb Educational Content Classifier - -Scores documents on educational value from 0–5. This helps prioritize content for knowledge-intensive tasks. +### Heuristic Filter Example ```python -from nemo_curator.classifiers import FineWebEduClassifier - -input_dataset = DocumentDataset.read_json("web_documents/*.jsonl", backend="cudf") -edu_classifier = FineWebEduClassifier( - batch_size=256, - pred_column="fineweb-edu-score", # Raw float scores - int_column="fineweb-edu-score-int" # Rounded integer scores -) -result_dataset = edu_classifier(dataset=input_dataset) - -# Extract highly educational content (scores 4-5) -high_edu_dataset = result_dataset[result_dataset["fineweb-edu-score-int"] >= 4] +from ray_curator.stages.filters.heuristic_filter import WordCountFilter + +# Use built-in heuristic filters +pipeline = Pipeline([ + JsonlReader("documents/*.jsonl"), + ScoreFilter( + filter_obj=WordCountFilter(min_words=50, max_words=1000), + text_field="text", + score_field="word_count" + ) +]) ``` -### FineWeb Mixtral and Nemotron Edu Classifiers - -Similar to the FineWeb Edu Classifier but trained with different annotation sources: +## Advanced Pipeline Patterns -- **FineWebMixtralEduClassifier**: Uses annotations from Mixtral 8x22B-Instruct -- **FineWebNemotronEduClassifier**: Uses annotations from Nemotron-4-340B-Instruct +### Multi-Stage Classification -Both provide a quality label column marking scores above 2.5 as "high_quality": +Combine several classification stages for comprehensive document processing. ```python -from nemo_curator.classifiers import FineWebMixtralEduClassifier # or FineWebNemotronEduClassifier - -classifier = FineWebMixtralEduClassifier( - pred_column="score", # Raw float scores - int_column="score-int", # Rounded integer scores - quality_label_column="quality-label" # "high_quality" or "low_quality" -) -result_dataset = classifier(dataset=input_dataset) +from ray_curator.stages.io.writer import JsonlWriter + +# Complex classification pipeline +pipeline = Pipeline([ + JsonlReader("raw_documents/*.jsonl"), + + # Language identification + Score( + score_fn=FastTextLangId(), + score_field="language", + text_field="text" + ), + + # Quality assessment + Score( + score_fn=FastTextQualityFilter(alpha=3.0), + score_field="quality_score", + text_field="text" + ), + + # Filter based on multiple criteria + Filter( + filter_fn=lambda row: row["language"] == "en" and row["quality_score"] > 0.7, + filter_field=["language", "quality_score"] + ), + + # Save results + JsonlWriter("filtered_documents/") +]) ``` -### Content Type Classifier +### Batch Processing Configuration -Categorizes documents into 11 distinct speech types. +Configure processing stages for optimal performance. ```python -from nemo_curator.classifiers import ContentTypeClassifier - -input_dataset = DocumentDataset.read_json("content/*.jsonl", backend="cudf") -classifier = ContentTypeClassifier(filter_by=["Blogs", "News"]) -result_dataset = classifier(dataset=input_dataset) -``` +# Configure resources for GPU-intensive tasks +from ray_curator.stages.resources import Resources -### Prompt Task and Complexity Classifier - -Classifies prompts by task type and complexity dimensions. +custom_filter = ScoreFilter( + filter_obj=CustomDomainFilter(target_domains=["Science"]), + text_field="text", + score_field="domain" +) -```python -from nemo_curator.classifiers import PromptTaskComplexityClassifier +# Set GPU requirements +custom_filter._resources = Resources(gpu_memory_gb=8.0, cpus=2.0) -input_dataset = DocumentDataset.read_json("prompts/*.jsonl", backend="cudf") -classifier = PromptTaskComplexityClassifier() -result_dataset = classifier(dataset=input_dataset) +pipeline = Pipeline([ + JsonlReader("large_dataset/*.jsonl", files_per_partition=10), + custom_filter +]) ``` -## CrossFit Integration - -CrossFit is an open-source library by RAPIDS AI for fast offline inference scaled to multi-node multi-GPU environments. It accelerates NVIDIA NeMo Curator's classifiers with: - -- PyTorch integration for model inference -- Efficient I/O and tokenization with cuDF -- Smart batching/chunking for optimized processing -- 1.4x-4x performance improvement over Dask + PyTorch baselines - -### Sorted Sequence Data Loader - -The key feature of CrossFit used in NVIDIA NeMo Curator is the sorted sequence data loader, which optimizes throughput by: - -- Sorting input sequences by length -- Grouping similar-length sequences into batches -- Efficiently allocating batches to GPU memory based on estimated memory footprints +### Error Handling and Monitoring -See the [rapidsai/crossfit](https://github.com/rapidsai/crossfit) repository for more information. \ No newline at end of file +The task-based system provides built-in fault tolerance and performance monitoring through the Ray backend. Each DocumentBatch task tracks processing statistics and supports independent retry if processing fails. diff --git a/docs/curate-text/process-data/quality-assessment/heuristic.md b/docs/curate-text/process-data/quality-assessment/heuristic.md index ba5988da6..7b7253370 100644 --- a/docs/curate-text/process-data/quality-assessment/heuristic.md +++ b/docs/curate-text/process-data/quality-assessment/heuristic.md @@ -9,15 +9,17 @@ modality: "text-only" --- (text-process-data-filter-heuristic)= + # Heuristic Filtering -Heuristic filtering uses simple, rule-based metrics to identify and filter out low-quality documents from your dataset. NVIDIA NeMo Curator provides a variety of pre-built heuristic filters that can be configured and combined to meet your specific needs. +Heuristic filtering uses simple, rule-based metrics to identify and filter out low-quality documents from your dataset. Ray Curator provides a variety of pre-built heuristic filters that you can configure and combine to meet your specific needs. ## How It Works Heuristic filters examine specific attributes of text documents and apply predefined thresholds to determine document quality. Unlike classifier-based filtering, heuristic filters don't require training data but rely on configurable thresholds and rules. These filters assess quality using measurable document characteristics such as: + - Document length (word or character count) - Punctuation ratios and patterns - Repetitive content detection @@ -27,6 +29,8 @@ These filters assess quality using measurable document characteristics such as: Each heuristic filter follows a consistent structure: ```python +from ray_curator.stages.filters import DocumentFilter + class ExampleFilter(DocumentFilter): def __init__(self, parameter1=default1, parameter2=default2): super().__init__() @@ -34,79 +38,86 @@ class ExampleFilter(DocumentFilter): self._param2 = parameter2 self._name = "example_filter" - def score_document(self, text): + def score_document(self, text: str) -> float: # Calculate and return a score between 0 and 1 # Higher scores typically indicate lower quality score = compute_score(text) return score - def keep_document(self, score): + def keep_document(self, score: float) -> bool: # Return True to keep the document, False to filter it out return score <= self._param1 ``` The filtering process typically involves: + 1. Calculating a quality score for each document 2. Applying a threshold to determine whether to keep or discard the document 3. Optionally storing the score as metadata for later analysis ---- +--- ## Usage -::::{tab-set} - -:::{tab-item} Python ```python -import nemo_curator as nc -from nemo_curator.datasets import DocumentDataset -from nemo_curator.filters import ( +from ray_curator.pipeline import Pipeline +from ray_curator.stages.io.reader import JsonlReader +from ray_curator.stages.io.writer import JsonlWriter +from ray_curator.stages.modules import ScoreFilter +from ray_curator.stages.filters import ( WordCountFilter, RepeatingTopNGramsFilter, PunctuationFilter ) +from ray_curator.backends.experimental.ray_data import RayDataExecutor -# Load your dataset -dataset = DocumentDataset.read_json("input_data/*.jsonl") - -# Create a filter chain using Sequential -filter_step = nc.Sequential([ - nc.ScoreFilter( - WordCountFilter(min_words=80), - text_field="text", - score_field="word_count", - ), - nc.ScoreFilter(PunctuationFilter(max_num_sentences_without_endmark_ratio=0.85)), - nc.ScoreFilter(RepeatingTopNGramsFilter(n=2, max_repeating_ngram_ratio=0.2)), - nc.ScoreFilter(RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18)), - nc.ScoreFilter(RepeatingTopNGramsFilter(n=4, max_repeating_ngram_ratio=0.16)), -]) - -# Apply the filters to get the high-quality subset -high_quality_data = filter_step(dataset) - -# Save the results -high_quality_data.to_json("high_quality_output/", write_to_filename=True) -``` -::: +# Create pipeline +pipeline = Pipeline( + name="heuristic_filtering", + description="Filter documents using heuristic quality metrics" +) -:::{tab-item} Command Line -```bash -filter_documents \ - --input-data-dir=/path/to/input/data \ - --filter-config-file=./config/heuristic_filter_en.yaml \ - --output-retained-document-dir=/path/to/output/high_quality \ - --output-removed-document-dir=/path/to/output/low_quality \ - --output-document-score-dir=/path/to/output/scores \ - --log-dir=/path/to/logs/heuristic_filter +# Add data reading stage +pipeline.add_stage(JsonlReader( + file_paths="input_data/*.jsonl", + files_per_partition=10 +)) + +# Add filtering stages +pipeline.add_stage(ScoreFilter( + filter_obj=WordCountFilter(min_words=80), + text_field="text", + score_field="word_count_score" +)) + +pipeline.add_stage(ScoreFilter( + filter_obj=PunctuationFilter(max_num_sentences_without_endmark_ratio=0.85), + text_field="text" +)) + +pipeline.add_stage(ScoreFilter( + filter_obj=RepeatingTopNGramsFilter(n=2, max_repeating_ngram_ratio=0.2), + text_field="text" +)) + +pipeline.add_stage(ScoreFilter( + filter_obj=RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18), + text_field="text" +)) + +# Add output stage +pipeline.add_stage(JsonlWriter( + output_dir="high_quality_output/" +)) + +# Execute pipeline +executor = RayDataExecutor() +results = pipeline.run(executor) ``` -::: - -:::: ## Available Filters -NeMo Curator includes over 30 heuristic filters for assessing document quality. Below are the most commonly used filters with their parameters: +Ray Curator includes over 30 heuristic filters for assessing document quality. Below are the most commonly used filters with their parameters: ### Text Length Filters @@ -115,7 +126,7 @@ NeMo Curator includes over 30 heuristic filters for assessing document quality. | **WordCountFilter** | Filters by word count | `min_words`, `max_words` | min=50, max=100000 | | **TokenCountFilter** | Filters by token count | `min_tokens`, `max_tokens` | min=0, max=∞ | | **MeanWordLengthFilter** | Filters by average word length | `min_mean_word_length`, `max_mean_word_length` | min=3, max=10 | -| **LongWordFilter** | Filters by presence of extremely long words | `max_word_length` | 1000 | +| **LongWordFilter** | Filters by presence of long words | `max_word_length` | 1000 | ### Repetition Detection Filters @@ -151,77 +162,105 @@ NeMo Curator includes over 30 heuristic filters for assessing document quality. | Filter | Description | Key Parameters | Default Values | |--------|-------------|----------------|---------------| -| **PornographicUrlsFilter** | Detects URLs containing "porn" substring | None | N/A | +| **PornographicUrlsFilter** | Detects URLs containing "porn" text | None | N/A | | **EllipsisFilter** | Limits excessive ellipses | `max_num_lines_ending_with_ellipsis_ratio` | 0.3 | | **HistogramFilter** | Filters based on character distribution | `threshold` | 0.8 | -| **SubstringFilter** | Filters based on presence of specific substring in a position | `substring`, `position` | "", "any" | +| **SubstringFilter** | Filters based on presence of specific text in a position | `substring`, `position` | "", "any" | -## Configuration +## Advanced Usage Patterns -::::{tab-set} +### Scoring Without Filtering -:::{tab-item} Example Configuration -```yaml -# Sample filter configuration (simplified) -filters: - - name: ScoreFilter - filter: - name: WordCountFilter - min_words: 50 - max_words: 100000 - text_field: text - score_field: word_count - - - name: ScoreFilter - filter: - name: PunctuationFilter - max_num_sentences_without_endmark_ratio: 0.85 - text_field: text - score_field: punctuation_ratio - - - name: ScoreFilter - filter: - name: RepeatingTopNGramsFilter - n: 2 - max_repeating_ngram_ratio: 0.18 - text_field: text - score_field: ngram_repetition +Use the `Score` stage to add quality metrics without filtering documents: + +```python +from ray_curator.stages.modules import Score + +# Add quality scores without filtering +pipeline.add_stage(Score( + score_fn=WordCountFilter(min_words=50), + score_field="word_count_score", + text_field="text" +)) + +pipeline.add_stage(Score( + score_fn=PunctuationFilter(), + score_field="punctuation_score", + text_field="text" +)) ``` -::: -:::: +### Filtering on Pre-computed Scores -The configuration file `config/heuristic_filter_en.yaml` contains a general-purpose set of heuristic filters that work well for English text. For non-English texts, you may need to adjust the filter parameters. +Use the `Filter` stage to filter based on existing metadata: -## Best Practices +```python +from ray_curator.stages.modules import Filter -When building filter chains, follow these best practices: +# Filter based on previously computed scores +pipeline.add_stage(Filter( + filter_fn=lambda x: x <= 0.1, # Custom filtering function + filter_field="word_count_score" +)) +``` -::::{tab-set} +### Language-Specific Filtering -:::{tab-item} Order for Efficiency ```python -# Efficient ordering -filter_chain = nc.Sequential([ - nc.ScoreFilter(WordCountFilter(min_words=50)), # Fast - nc.ScoreFilter(UrlsFilter()), # Medium - nc.ScoreFilter(RepeatingTopNGramsFilter()) # Slow -]) +# Chinese text filter +pipeline.add_stage(ScoreFilter( + filter_obj=SymbolsToWordsFilter(max_symbol_to_word_ratio=0.15, lang="zh"), + text_field="text" +)) ``` -::: -:::{tab-item} Batched Processing +## Pipeline Execution Options + +### Using Ray Data Executor (Experimental) + ```python -from nemo_curator.utils.decorators import batched +from ray_curator.backends.experimental.ray_data import RayDataExecutor + +executor = RayDataExecutor() +results = pipeline.run(executor) +``` + +### Using Alternative Executor + +```python +from ray_curator.backends.xenna import XennaExecutor + +executor = XennaExecutor() +results = pipeline.run(executor) +``` + +## Best Practices -class MyCustomFilter(DocumentFilter): - @batched - def keep_document(self, scores): - return scores <= self.threshold +When building filter pipelines, follow these best practices: + +::::{tab-set} + +:::{tab-item} Efficient Stage Ordering + +```python +# Order stages from fastest to slowest for efficiency +pipeline.add_stage(ScoreFilter( + filter_obj=WordCountFilter(min_words=50) # Fast - simple counting +)) + +pipeline.add_stage(ScoreFilter( + filter_obj=UrlsFilter() # Medium - regex matching +)) + +pipeline.add_stage(ScoreFilter( + filter_obj=RepeatingTopNGramsFilter() # Slow - n-gram computation +)) ``` + ::: :::{tab-item} Precision vs. Recall + ```python # More permissive (higher recall) lenient_filter = WordCountFilter(min_words=10, max_words=100000) @@ -229,97 +268,109 @@ lenient_filter = WordCountFilter(min_words=10, max_words=100000) # More strict (higher precision) strict_filter = WordCountFilter(min_words=100, max_words=10000) ``` + ::: -:::{tab-item} Language Considerations +:::{tab-item} Score Preservation + ```python -# Chinese text filter -cn_filter = nc.ScoreFilter( - SymbolsToWordsFilter(max_symbol_to_word_ratio=0.15, lang="zh") -) +# Keep scores for analysis while filtering +pipeline.add_stage(ScoreFilter( + filter_obj=WordCountFilter(min_words=50), + text_field="text", + score_field="word_count_score" # Preserve score +)) ``` + ::: -:::{tab-item} Multiple Filters +:::{tab-item} Comprehensive Quality Pipeline + ```python -# Comprehensive quality filter -quality_chain = nc.Sequential([ - # Basic text quality - nc.ScoreFilter(WordCountFilter(min_words=50)), - nc.ScoreFilter(PunctuationFilter(max_num_sentences_without_endmark_ratio=0.85)), - - # Content quality - nc.ScoreFilter(CommonEnglishWordsFilter(min_num_common_words=2)), - - # Repetition detection - nc.ScoreFilter(RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18)) -]) +# Multi-stage quality assessment +pipeline.add_stage(JsonlReader(file_paths="data/*.jsonl")) + +# Basic text quality +pipeline.add_stage(ScoreFilter( + filter_obj=WordCountFilter(min_words=50), + text_field="text" +)) + +pipeline.add_stage(ScoreFilter( + filter_obj=PunctuationFilter(max_num_sentences_without_endmark_ratio=0.85), + text_field="text" +)) + +# Content quality +pipeline.add_stage(ScoreFilter( + filter_obj=CommonEnglishWordsFilter(min_num_common_words=2), + text_field="text" +)) + +# Repetition detection +pipeline.add_stage(ScoreFilter( + filter_obj=RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18), + text_field="text" +)) + +pipeline.add_stage(JsonlWriter(output_dir="filtered_output/")) ``` + ::: :::: -## Analyzing Filter Results +## Performance Considerations -When working with non-English data or tuning your filtering pipeline, it's valuable to examine which filters are removing documents: +For large datasets, consider these optimizations: ::::{tab-set} -:::{tab-item} Filter Analysis +:::{tab-item} File Partitioning + ```python -import pandas as pd +# Control how files are partitioned for parallel processing +pipeline.add_stage(JsonlReader( + file_paths="large_dataset/*.jsonl", + files_per_partition=5, # Fewer files per partition for larger datasets + blocksize="128MB" # Or use blocksize-based partitioning +)) +``` -# Load scores from filter run -scores = pd.read_json("output/scores/scores.jsonl", lines=True) +::: + +:::{tab-item} Resource Management -# Analyze rejection reasons -rejection_counts = scores[scores["rejected"] == True].groupby("rejected_by").size() -print(f"Documents rejected by filter:\n{rejection_counts}") +```python +from ray_curator.stages.resources import Resources -# Analyze score distributions -import matplotlib.pyplot as plt -scores.hist(column="word_count", bins=50) -plt.title("Word Count Distribution") -plt.savefig("word_count_hist.png") +# Configure resource requirements for compute-intensive filters +class CustomFilter(ScoreFilter): + _resources = Resources(cpus=2.0) # Request more CPU cores + + def __init__(self, filter_obj): + super().__init__(filter_obj=filter_obj) ``` + ::: :::: -## Performance Tuning +Remember that the goal of filtering is to improve the quality of your training data, not necessarily to remove as much content as possible. Track your filtering results and adjust thresholds based on your specific data characteristics and downstream tasks. -For large datasets, consider these performance optimizations: +## Evidence and Sources -::::{tab-set} +**Source**: `ray-curator/ray_curator/stages/filters/heuristic_filter.py:47-835` +**Evidence**: All heuristic filters extend the DocumentFilter abstract base class with score_document() and keep_document() methods -:::{tab-item} Memory Efficient Processing -```python -# Process in chunks to reduce memory usage -for chunk in DocumentDataset.read_json_chunks("large_dataset/*.jsonl", chunk_size=10000): - filtered_chunk = filter_step(chunk) - filtered_chunk.to_json("output/", mode="append") -``` -::: +**Source**: `ray-curator/ray_curator/stages/modules/score_filter.py:183-279` +**Evidence**: ScoreFilter stage combines scoring and filtering, accepting DocumentFilter objects and text/score field configuration -:::{tab-item} Multi-process Filtering -```bash -# Use multiple processes with CLI -filter_documents --input-data-dir=input/ --num-proc=8 --filter-config-file=config.yaml --output-retained-document-dir=output/ -``` -::: - -:::{tab-item} Custom Batch Sizes -```python -# Adjust batch size for specific filters -from nemo_curator.utils.decorators import batched +**Source**: `ray-curator/ray_curator/stages/filters/__init__.py:15-53` +**Evidence**: Complete list of available heuristic filters exported from the filters module -class CustomBatchFilter(DocumentFilter): - @batched(batch_size=5000) # Set custom batch size - def keep_document(self, scores): - return scores <= self.threshold -``` -::: - -:::: +**Source**: `ray-curator/ray_curator/stages/io/reader/jsonl.py:159-218` +**Evidence**: JsonlReader composite stage for reading JSONL files with file partitioning support -Remember that the goal of filtering is to improve the quality of your training data, not necessarily to remove as many documents as possible. Monitor your filtering results and adjust thresholds based on your specific data characteristics and downstream tasks. +**Source**: `ray-curator/ray_curator/pipeline/pipeline.py:12-180` +**Evidence**: Pipeline class for composing and executing processing stages diff --git a/docs/curate-text/process-data/quality-assessment/index.md b/docs/curate-text/process-data/quality-assessment/index.md index 194c245cf..f56ff2ae6 100644 --- a/docs/curate-text/process-data/quality-assessment/index.md +++ b/docs/curate-text/process-data/quality-assessment/index.md @@ -9,118 +9,143 @@ modality: "text-only" --- (text-process-data-filter)= + # Quality Assessment & Filtering -Score and remove low-quality content using heuristics and ML classifiers to prepare your data for model training using NeMo Curator's tools and utilities. +Score and remove low-quality content using heuristics and ML classifiers to prepare your data for model training using Ray Curator's task-centric pipeline architecture. -Large datasets often contain many documents considered to be "low quality." In this context, "low quality" data simply means data we don't want a downstream model to learn from, and "high quality" data is data that we do want a downstream model to learn from. The metrics that define quality can vary widely. +Large datasets often contain documents considered to be "low quality." In this context, "low quality" data means data we don't want a downstream model to learn from, and "high quality" data is data that we do want a downstream model to learn from. The metrics that define quality can vary widely. ## How It Works -NeMo Curator's filtering framework is built around several key components: +Ray Curator's filtering framework uses a task-centric architecture with these key components: + +- **Tasks**: `DocumentBatch` objects containing batches of text data that flow through the pipeline +- **Stages**: Processing units that transform tasks (Score, Filter, ScoreFilter) +- **Pipelines**: Collections of stages that define the complete workflow +- **Executors**: Components that orchestrate pipeline execution on distributed systems ::::{tab-set} :::{tab-item} ScoreFilter -The `ScoreFilter` is at the center of filtering in NeMo Curator. It applies a filter to a document and optionally saves the score as metadata: +The `ScoreFilter` stage combines scoring and filtering in Ray Curator. It processes `DocumentBatch` tasks through a pipeline: ```python -import nemo_curator as nc -from nemo_curator.datasets import DocumentDataset -from nemo_curator.utils.file_utils import get_all_files_paths_under -from nemo_curator.filters import WordCountFilter - -# Load dataset -files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl") -books = DocumentDataset.read_json(files, add_filename=True) - -# Create and apply filter -filter_step = nc.ScoreFilter( - WordCountFilter(min_words=80), - text_field="text", - score_field="word_count", +from ray_curator.pipeline import Pipeline +from ray_curator.backends.xenna import XennaExecutor +from ray_curator.stages.modules import ScoreFilter +from ray_curator.stages.filters import WordCountFilter +from ray_curator.stages.io.reader import JsonlReader +from ray_curator.stages.io.writer import JsonlWriter + +# Create pipeline +pipeline = Pipeline( + name="book_filtering", + description="Filter books by word count" ) -# Get filtered dataset -long_books = filter_step(books) +# Add stages to pipeline +pipeline.add_stage(JsonlReader(file_paths="books_dataset/")) +pipeline.add_stage(ScoreFilter( + filter_obj=WordCountFilter(min_words=80), + text_field="text", + score_field="word_count" +)) +pipeline.add_stage(JsonlWriter(output_dir="long_books/")) -# Save filtered dataset -long_books.to_json("long_books/", write_to_filename=True) +# Execute pipeline +executor = XennaExecutor() +results = pipeline.run(executor) ``` -The filter object implements two key methods: +The `DocumentFilter` objects define two key methods: - `score_document`: Computes a quality score for a document -- `keep_document`: Determines if a document should be kept based on its score +- `keep_document`: Determines if a document should remain based on its score ::: -:::{tab-item} Filter and Score Modules +:::{tab-item} Score and Filter Stages -For more specific use cases, NeMo Curator provides two specialized modules: +Ray Curator provides specialized stages for granular control: -- `Score`: A module that only adds metadata scores to records without filtering +- `Score`: A stage that adds metadata scores without filtering - Takes a scoring function that evaluates text and returns a score - Adds the score to a specified metadata field - Useful for analysis or multi-stage filtering pipelines ```python -# Example: Score documents without filtering -scoring_step = nc.Score( - WordCountFilter().score_document, # Use just the scoring part +from ray_curator.stages.modules import Score +from ray_curator.stages.filters import WordCountFilter + +# Add scoring stage to pipeline +pipeline.add_stage(Score( + score_fn=WordCountFilter().score_document, text_field="text", score_field="word_count" -) -scored_dataset = scoring_step(dataset) +)) ``` -- `Filter`: A module that filters based on pre-computed metadata +- `Filter`: A stage that filters based on pre-computed metadata - Takes a filter function that evaluates metadata and returns True/False - - Only uses existing metadata fields (doesn't compute new scores) + - Uses existing metadata fields (doesn't compute new scores) - Efficient for filtering on pre-computed metrics ```python -# Example: Filter using pre-computed scores -filter_step = nc.Filter( - lambda score: score >= 100, # Keep documents with score >= 100 +from ray_curator.stages.modules import Filter + +# Add filtering stage to pipeline +pipeline.add_stage(Filter( + filter_fn=lambda score: score >= 100, filter_field="word_count" -) -filtered_dataset = filter_step(scored_dataset) +)) ``` -You can combine these modules in pipelines: +You can combine these stages in multi-step pipelines: ```python -pipeline = nc.Sequential([ - nc.Score(word_counter, score_field="word_count"), - nc.Score(symbol_counter, score_field="symbol_ratio"), - nc.Filter(lambda x: x >= 100, filter_field="word_count"), - nc.Filter(lambda x: x <= 0.3, filter_field="symbol_ratio") -]) +# Multi-stage filtering pipeline +pipeline = Pipeline(name="multi_filter", description="Multi-stage quality filtering") +pipeline.add_stage(JsonlReader(file_paths="dataset/")) +pipeline.add_stage(Score(word_counter, score_field="word_count")) +pipeline.add_stage(Score(symbol_counter, score_field="symbol_ratio")) +pipeline.add_stage(Filter(lambda x: x >= 100, filter_field="word_count")) +pipeline.add_stage(Filter(lambda x: x <= 0.3, filter_field="symbol_ratio")) +pipeline.add_stage(JsonlWriter(output_dir="filtered_output/")) ``` ::: -:::{tab-item} Batched Filtering +:::{tab-item} Task-Based Processing -For improved performance, NeMo Curator supports batch processing using the `@batched` decorator: +Ray Curator processes data in batches through `DocumentBatch` tasks, providing efficient vectorized operations: ```python -from nemo_curator.utils.decorators import batched -import pandas as pd - -class BatchedFilter(DocumentFilter): - @batched - def keep_document(self, scores: pd.Series): - # Process multiple documents in one operation - return scores > 10 +from ray_curator.stages.modules import ScoreFilter +from ray_curator.stages.filters import WordCountFilter + +# DocumentBatch automatically handles batch processing +class CustomFilter(DocumentFilter): + def score_document(self, text: str) -> float: + return len(text.split()) + + def keep_document(self, score: float) -> bool: + return score > 10 + +# Used in pipeline - batch processing is automatic +pipeline.add_stage(ScoreFilter( + filter_obj=CustomFilter(), + text_field="text" +)) ``` -The batched processing can significantly improve performance on large datasets by: -- Reducing function call overhead -- Enabling vectorized operations -- Optimizing memory usage +Task-based processing provides performance benefits: + +- **Automatic batching**: `DocumentBatch` handles many documents per task +- **Vectorized operations**: Pandas DataFrame operations within each batch +- **Resource optimization**: Tasks sized for optimal memory and compute usage +- **Fault tolerance**: Individual tasks can retry independently ::: @@ -166,7 +191,7 @@ GPU-accelerated classification with pre-trained models :::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` Custom Filters :link: custom :link-type: doc -Implement and combine your own custom filters +Create and combine your own custom filters +++ {bdg-secondary}`custom` {bdg-secondary}`flexible` @@ -177,45 +202,7 @@ Implement and combine your own custom filters ## Usage -NeMo Curator provides a CLI tool for document filtering that becomes available after installing the package: - -```bash -filter_documents \ - --input-data-dir=/path/to/input/data \ - --filter-config-file=./config/heuristic_filter_en.yaml \ - --output-retained-document-dir=/path/to/output/high_quality \ - --output-removed-document-dir=/path/to/output/low_quality \ - --output-document-score-dir=/path/to/output/scores \ - --num-workers=4 -``` - -For distributed processing with multiple workers: - -```bash -filter_documents \ - --input-data-dir=/path/to/input/data \ - --filter-config-file=./config/heuristic_filter_en.yaml \ - --output-retained-document-dir=/path/to/output/high_quality \ - --num-workers=8 \ - --device=gpu \ - --log-dir=./logs -``` - -### CLI Parameters - -| Parameter | Description | Required | -|-----------|-------------|----------| -| `--input-data-dir` | Directory containing input JSONL files | Yes | -| `--filter-config-file` | YAML configuration for the filter pipeline | Yes | -| `--output-retained-document-dir` | Directory for documents passing filters | Yes | -| `--output-removed-document-dir` | Directory for rejected documents | No | -| `--output-document-score-dir` | Directory for storing score metadata | No | -| `--log-dir` | Directory for storing logs | No | -| `--num-workers` | Number of Dask workers for distributed processing | No | -| `--scheduler-address` | Address of Dask scheduler for distributed processing | No | -| `--device` | Processing device: `cpu` or `gpu` (default: `cpu`) | No | -| `--input-file-type` | Input file format: `jsonl`, `parquet`, etc. (default: `jsonl`) | No | -| `--output-file-type` | Output file format: `jsonl`, `parquet`, etc. (default: `jsonl`) | No | +Quality assessment in Ray Curator uses pipeline-based workflows with the Task/Stage/Pipeline architecture. Data flows through stages as `DocumentBatch` tasks, enabling distributed processing and fault tolerance. The examples in each filtering approach section show how to create and execute filtering pipelines. ```{toctree} :maxdepth: 4 @@ -230,10 +217,31 @@ Custom Filters ## Best Practices -When filtering large datasets, consider these performance tips: +When filtering large datasets with Ray Curator, consider these performance tips: -1. **Order matters**: Place computationally inexpensive filters early in your pipeline -2. **Batch size tuning**: Adjust batch sizes based on your hardware capabilities -3. **Use vectorization**: Implement batched methods for compute-intensive filters -4. **Disk I/O**: Consider compression and chunking strategies for large datasets -5. **Distributed processing**: For TB-scale datasets, use distributed filtering with Dask workers (`--num-workers`) or connect to an existing Dask cluster (`--scheduler-address`) \ No newline at end of file +1. **Stage ordering**: Place computationally inexpensive filters first in your pipeline +2. **Resource specification**: Configure stage resources based on computational requirements +3. **Task sizing**: Balance task size for optimal memory usage and parallel processing +4. **Pipeline composition**: Combine related operations in single stages when possible +5. **Distributed execution**: Use `XennaExecutor` with auto-scaling for TB-scale datasets + +```python +# Example: Resource-aware pipeline +from ray_curator.stages.resources import Resources + +# CPU-intensive filtering stage +pipeline.add_stage(ScoreFilter( + filter_obj=ComplexFilter(), + text_field="text" +).with_(resources=Resources(cpus=2.0))) + +# GPU-accelerated classification stage +pipeline.add_stage(ScoreFilter( + filter_obj=MLClassifier(), + text_field="text" +).with_(resources=Resources(gpu_memory_gb=8.0))) + +# Execute with auto-scaling +executor = XennaExecutor(config={"auto_scaling": True, "max_workers": 10}) +results = pipeline.run(executor) +``` \ No newline at end of file diff --git a/docs/curate-text/process-data/specialized-processing/bitext.md b/docs/curate-text/process-data/specialized-processing/bitext.md index cb4fb09af..828b38710 100644 --- a/docs/curate-text/process-data/specialized-processing/bitext.md +++ b/docs/curate-text/process-data/specialized-processing/bitext.md @@ -9,11 +9,8 @@ modality: "text-only" --- (text-process-data-filter-bitext)= -# Bitext Filtering -:::{note} -**Documentation Status**: This page has been verified against NeMo Curator source code for accuracy (January 2025). -::: +# Bitext Filtering NVIDIA NeMo Curator provides specialized filters for processing and filtering bilingual text (bitext) data. These filters are designed specifically for parallel corpora used in machine translation and other multilingual applications. diff --git a/docs/curate-text/process-data/specialized-processing/code.md b/docs/curate-text/process-data/specialized-processing/code.md index 6fe7f5b2e..264346719 100644 --- a/docs/curate-text/process-data/specialized-processing/code.md +++ b/docs/curate-text/process-data/specialized-processing/code.md @@ -19,47 +19,92 @@ These filters can be applied individually or in combination to create comprehens ## Usage -Here's an example of applying code filters to a dataset: +Here's an example of applying code filters to a dataset using `ray_curator`: ```python -import nemo_curator as nc -from nemo_curator.datasets import DocumentDataset -from nemo_curator.filters import ( +from ray_curator.backends.xenna import XennaExecutor +from ray_curator.pipeline import Pipeline +from ray_curator.stages.filters.code import ( PythonCommentToCodeFilter, NumberOfLinesOfCodeFilter, AlphaFilter ) +from ray_curator.stages.io.reader.jsonl import JsonlReader +from ray_curator.stages.io.writer.jsonl import JsonlWriter +from ray_curator.stages.modules.score_filter import ScoreFilter -# Load your code dataset -dataset = DocumentDataset.read_json("code_data/*.jsonl") - -# Create a filter chain for code quality -filter_step = nc.Sequential([ - nc.ScoreFilter( - PythonCommentToCodeFilter( - min_comment_to_code_ratio=0.01, - max_comment_to_code_ratio=0.8 - ), - text_field="content", - score_field="comment_ratio" - ), - nc.ScoreFilter( - NumberOfLinesOfCodeFilter(min_lines=5, max_lines=1000), - text_field="content", - score_field="line_count" - ), - nc.ScoreFilter( - AlphaFilter(min_alpha_ratio=0.3), - text_field="content", - score_field="alpha_ratio" +def create_code_filtering_pipeline(data_dir: str, output_dir: str) -> Pipeline: + """Create a pipeline for filtering code quality.""" + + # Define pipeline + pipeline = Pipeline( + name="code_quality_filtering", + description="Filter code based on quality metrics" + ) + + # Add stages + # 1. Reader stage - load JSONL files + pipeline.add_stage( + JsonlReader( + file_paths=data_dir, + files_per_partition=2, + reader="pandas" + ) + ) + + # 2. Comment ratio filtering + pipeline.add_stage( + ScoreFilter( + PythonCommentToCodeFilter( + min_comment_to_code_ratio=0.01, + max_comment_to_code_ratio=0.8 + ), + text_field="content", + score_field="comment_ratio" + ) ) -]) + + # 3. Line count filtering + pipeline.add_stage( + ScoreFilter( + NumberOfLinesOfCodeFilter(min_lines=5, max_lines=1000), + text_field="content", + score_field="line_count" + ) + ) + + # 4. Alphabetic content filtering + pipeline.add_stage( + ScoreFilter( + AlphaFilter(min_alpha_ratio=0.3), + text_field="content", + score_field="alpha_ratio" + ) + ) + + # 5. Writer stage - save filtered results + pipeline.add_stage( + JsonlWriter(output_dir=output_dir) + ) + + return pipeline -# Apply the filters -quality_code = filter_step(dataset) +def main(): + # Create and run pipeline + pipeline = create_code_filtering_pipeline("./code_data", "./filtered_code") + + # Print pipeline description + print(pipeline.describe()) + + # Execute with distributed backend + executor = XennaExecutor() + results = pipeline.run(executor) + + # Process results + print(f"Pipeline completed! Processed {len(results)} batches") -# Save the results -quality_code.to_json("filtered_code/", write_to_filename=True) +if __name__ == "__main__": + main() ``` ## Available Code Filters @@ -76,23 +121,28 @@ NeMo Curator offers several specialized filters for code content: The comment-to-code ratio is an important metric for code quality. Too few comments may indicate poor documentation, while too many comments might suggest automatically generated code or tutorials: ```python +from ray_curator.stages.filters.code import PythonCommentToCodeFilter, GeneralCommentToCodeFilter +from ray_curator.stages.modules.score_filter import ScoreFilter + # For Python files with docstrings -python_filter = nc.ScoreFilter( +python_filter = ScoreFilter( PythonCommentToCodeFilter( min_comment_to_code_ratio=0.05, # At least 5% comments max_comment_to_code_ratio=0.7 # At most 70% comments ), - text_field="content" + text_field="content", + score_field="comment_ratio" ) # For other languages -cpp_filter = nc.ScoreFilter( +cpp_filter = ScoreFilter( GeneralCommentToCodeFilter( language="text/x-c++", # MIME type for C++ min_comment_to_code_ratio=0.02, max_comment_to_code_ratio=0.6 ), - text_field="content" + text_field="content", + score_field="comment_ratio" ) ``` @@ -114,33 +164,42 @@ The `GeneralCommentToCodeFilter` supports various language MIME types: Code structure filters help identify problematic patterns: ```python +from ray_curator.stages.filters.code import NumberOfLinesOfCodeFilter, AlphaFilter +from ray_curator.stages.modules.score_filter import ScoreFilter + # Filter for reasonable line counts -line_filter = nc.ScoreFilter( +line_filter = ScoreFilter( NumberOfLinesOfCodeFilter( min_lines=5, # Filter out tiny snippets max_lines=2000 # Filter out extremely long files ), - text_field="content" + text_field="content", + score_field="line_count" ) # Filter for alphabetic content (avoid large data blobs) -alpha_filter = nc.ScoreFilter( +alpha_filter = ScoreFilter( AlphaFilter(min_alpha_ratio=0.3), # At least 30% alphabetic chars - text_field="content" + text_field="content", + score_field="alpha_ratio" ) ``` -The `TokenizerFertilityFilter` helps ensure code is efficiently tokenizable: +The `TokenizerFertilityFilter` helps ensure code is efficiently tokenized: ```python +from ray_curator.stages.filters.code import TokenizerFertilityFilter +from ray_curator.stages.modules.score_filter import ScoreFilter + # Filter for token efficiency # Note: path_to_tokenizer is required -tokenization_filter = nc.ScoreFilter( +tokenization_filter = ScoreFilter( TokenizerFertilityFilter( path_to_tokenizer="/path/to/code_tokenizer.model", # Required parameter min_char_to_token_ratio=2.5 # Each token encodes at least 2.5 chars on average ), - text_field="content" + text_field="content", + score_field="token_ratio" ) ``` @@ -159,14 +218,18 @@ This filter helps avoid content that has poor token efficiency, which can impact Different programming languages have different conventions and characteristics. The `PerExtensionFilter` applies customized filtering based on file extension: ```python +from ray_curator.stages.filters.code import PerExtensionFilter +from ray_curator.stages.modules.score_filter import ScoreFilter + # Apply language-specific filters -python_specific = nc.ScoreFilter( +python_specific = ScoreFilter( PerExtensionFilter( lang="python", extension=".py", metadata_file="code_meta.csv" # Contains language-specific thresholds ), - text_field="content" + text_field="content", + score_field="extension_score" ) ``` @@ -176,40 +239,54 @@ The metadata file can specify different thresholds for metrics like: - Empty line ratio - Alphabetic content ratio -## Filter Configuration - -A typical configuration for code filtering in YAML format: - -```yaml -filters: - - name: ScoreFilter - filter: - name: PythonCommentToCodeFilter - min_comment_to_code_ratio: 0.01 - max_comment_to_code_ratio: 0.85 - text_field: content - score_field: comment_ratio - - - name: ScoreFilter - filter: - name: NumberOfLinesOfCodeFilter - min_lines: 10 - max_lines: 5000 - text_field: content - score_field: line_count - - - name: ScoreFilter - filter: - name: AlphaFilter - min_alpha_ratio: 0.25 - text_field: content - score_field: alpha_ratio - - - name: ScoreFilter - filter: - name: XMLHeaderFilter - text_field: content - score_field: xml_detected +## Pipeline Configuration + +Ray Curator uses programmatic pipeline configuration. Here's how to structure a comprehensive code filtering pipeline: + +```python +from ray_curator.pipeline import Pipeline +from ray_curator.stages.filters.code import ( + PythonCommentToCodeFilter, + NumberOfLinesOfCodeFilter, + AlphaFilter, + XMLHeaderFilter +) +from ray_curator.stages.modules.score_filter import ScoreFilter + +def create_comprehensive_code_pipeline(data_dir: str, output_dir: str) -> Pipeline: + """Create a comprehensive code filtering pipeline.""" + + pipeline = Pipeline( + name="comprehensive_code_filtering", + description="Multi-stage code quality filtering" + ) + + # Add reader stage + from ray_curator.stages.io.reader.jsonl import JsonlReader + from ray_curator.stages.io.writer.jsonl import JsonlWriter + pipeline.add_stage(JsonlReader(file_paths=data_dir, files_per_partition=2)) + + # Add multiple filtering stages + filters = [ + (PythonCommentToCodeFilter(min_comment_to_code_ratio=0.01, max_comment_to_code_ratio=0.85), "comment_ratio"), + (NumberOfLinesOfCodeFilter(min_lines=10, max_lines=5000), "line_count"), + (AlphaFilter(min_alpha_ratio=0.25), "alpha_ratio"), + (XMLHeaderFilter(), "xml_detected") + ] + + for filter_obj, score_field in filters: + pipeline.add_stage( + ScoreFilter( + filter_obj, + text_field="content", + score_field=score_field + ) + ) + + # Add writer stage + pipeline.add_stage(JsonlWriter(output_dir=output_dir)) + + return pipeline ``` ## Best Practices for Code Filtering @@ -217,13 +294,17 @@ filters: When filtering code datasets, consider these best practices: 1. **Language-specific configurations**: Adjust thresholds based on the programming language + ```python + from ray_curator.stages.filters.code import PythonCommentToCodeFilter, GeneralCommentToCodeFilter + # Python tends to have more comments than C python_comment_filter = PythonCommentToCodeFilter(min_comment_to_code_ratio=0.05) c_comment_filter = GeneralCommentToCodeFilter(language="text/x-c", min_comment_to_code_ratio=0.02) ``` 2. **Preserve code structure**: Ensure filters don't inadvertently remove valid coding patterns + ```python # Some languages naturally have low comment ratios assembly_filter = GeneralCommentToCodeFilter( @@ -233,38 +314,43 @@ When filtering code datasets, consider these best practices: ``` 3. **Combine with language detection**: Verify file extensions match content + ```python - # First check if the content is actually Python using FastText language ID - from nemo_curator.filters import FastTextLangId + # Language detection for code is available through ray_curator + from ray_curator.stages.filters.fasttext_filter import FastTextLangId - python_detection = nc.ScoreFilter( - FastTextLangId( - model_path="/path/to/lid.176.bin", # Download from fasttext.cc - min_langid_score=0.8 - ), - score_field="language" + python_detection = FastTextLangId( + model_path="/path/to/lid.176.bin", # Download from fasttext.cc + min_langid_score=0.8 ) - # Then apply Python-specific filters - python_filters = nc.Sequential([ - python_detection, - nc.ScoreFilter(PythonCommentToCodeFilter()) - ]) - ``` + # Implementation depends on ray_curator pipeline setup + # Refer to ray_curator documentation for complete examples + ``` + :::{note} - The `FastTextLangId` filter requires downloading the fastText language identification model from [fasttext.cc](https://fasttext.cc/docs/en/language-identification.html). + The `FastTextLangId` filter requires downloading the FastText language identification model from [fasttext.cc](https://fasttext.cc/docs/en/language-identification.html). ::: 4. **Avoid over-filtering**: Monitor rejection rates and adjust thresholds as needed + ```python - # Track filter statistics - rejection_stats = {} - for filter_name, filter_obj in filters.items(): - filter_step = nc.ScoreFilter(filter_obj, text_field="content") - before_count = len(dataset) - filtered = filter_step(dataset) - after_count = len(filtered) - rejection_stats[filter_name] = (before_count - after_count) / before_count + # Track filter statistics in ray_curator pipeline + def monitor_pipeline_results(results): + total_input_docs = 0 + total_output_docs = 0 + + for batch in results: + if batch is not None: + total_output_docs += batch.num_items + # Input count available from metadata + if 'original_count' in batch._metadata: + total_input_docs += batch._metadata['original_count'] + + if total_input_docs > 0: + retention_rate = total_output_docs / total_input_docs + print(f"Pipeline retention rate: {retention_rate:.2%}") + print(f"Filtered out: {total_input_docs - total_output_docs} documents") ``` ## Use Cases @@ -272,36 +358,69 @@ When filtering code datasets, consider these best practices: ::::{tab-set} :::{tab-item} Cleaning Open Source Code Datasets + ```python -# Filter to remove non-functional code snippets -repo_filter = nc.Sequential([ +from ray_curator.pipeline import Pipeline +from ray_curator.stages.filters.code import NumberOfLinesOfCodeFilter, XMLHeaderFilter, GeneralCommentToCodeFilter +from ray_curator.stages.modules.score_filter import ScoreFilter +from ray_curator.stages.io.reader.jsonl import JsonlReader +from ray_curator.stages.io.writer.jsonl import JsonlWriter + +def create_repo_cleaning_pipeline(data_dir: str, output_dir: str) -> Pipeline: + """Filter to remove non-functional code snippets.""" + + pipeline = Pipeline(name="repo_cleaning", description="Clean open source repositories") + + pipeline.add_stage(JsonlReader(file_paths=data_dir, files_per_partition=2)) + # Remove extremely short files - nc.ScoreFilter(NumberOfLinesOfCodeFilter(min_lines=3)), + pipeline.add_stage(ScoreFilter(NumberOfLinesOfCodeFilter(min_lines=3), text_field="content")) # Remove files with XML preamble (misidentified as code) - nc.ScoreFilter(XMLHeaderFilter()), + pipeline.add_stage(ScoreFilter(XMLHeaderFilter(), text_field="content")) # Ensure reasonable comment-to-code ratio - nc.ScoreFilter(GeneralCommentToCodeFilter(language="text/x-c++")) -]) + pipeline.add_stage(ScoreFilter(GeneralCommentToCodeFilter(language="text/x-c++"), text_field="content")) + + pipeline.add_stage(JsonlWriter(output_dir=output_dir)) + + return pipeline ``` + ::: :::{tab-item} Training Data Preparation + ```python -training_filter = nc.Sequential([ +from ray_curator.pipeline import Pipeline +from ray_curator.stages.filters.code import AlphaFilter, TokenizerFertilityFilter, HTMLBoilerplateFilter +from ray_curator.stages.modules.score_filter import ScoreFilter +from ray_curator.stages.io.reader.jsonl import JsonlReader +from ray_curator.stages.io.writer.jsonl import JsonlWriter + +def create_training_data_pipeline(data_dir: str, output_dir: str) -> Pipeline: + """Prepare high-quality training data.""" + + pipeline = Pipeline(name="training_prep", description="Prepare training data") + + pipeline.add_stage(JsonlReader(file_paths=data_dir, files_per_partition=2)) + # Ensure enough alphabetic content (not just symbols or data) - nc.ScoreFilter(AlphaFilter(min_alpha_ratio=0.3)), + pipeline.add_stage(ScoreFilter(AlphaFilter(min_alpha_ratio=0.3), text_field="content")) # Check token efficiency - nc.ScoreFilter(TokenizerFertilityFilter(path_to_tokenizer="tokenizer.model")), + pipeline.add_stage(ScoreFilter(TokenizerFertilityFilter(path_to_tokenizer="tokenizer.model"), text_field="content")) # Remove HTML with mostly boilerplate - nc.ScoreFilter(HTMLBoilerplateFilter(min_lang_content_ratio=0.3)) -]) + pipeline.add_stage(ScoreFilter(HTMLBoilerplateFilter(min_lang_content_ratio=0.3), text_field="content")) + + pipeline.add_stage(JsonlWriter(output_dir=output_dir)) + + return pipeline ``` + ::: :::: -By applying these specialized code filters, you can significantly improve the quality of code in your training datasets, leading to better model performance for code-related tasks. \ No newline at end of file +By applying these specialized code filters, you can significantly improve the quality of code in your training datasets, leading to better model performance for code-related tasks. diff --git a/docs/curate-text/process-data/specialized-processing/synthetic.md b/docs/curate-text/process-data/specialized-processing/synthetic.md index ea3491a45..172f5c230 100644 --- a/docs/curate-text/process-data/specialized-processing/synthetic.md +++ b/docs/curate-text/process-data/specialized-processing/synthetic.md @@ -1,7 +1,7 @@ --- -description: "Identify and filter synthetic or AI-generated content in datasets using embedding similarity and answerability detection" +description: "Create custom filters for detecting and filtering synthetic or AI-generated content in datasets" categories: ["how-to-guides"] -tags: ["synthetic-detection", "ai-detection", "embedding", "qa-pairs", "answerability", "trivial-filtering"] +tags: ["synthetic-detection", "custom-filters", "embedding", "qa-pairs", "document-processing"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "advanced" content_type: "how-to" @@ -9,310 +9,405 @@ modality: "text-only" --- (text-process-data-filter-synthetic)= -# Synthetic Text Detection -NVIDIA NeMo Curator provides specialized filters for identifying and filtering synthetic or AI-generated content in your dataset. These filters help ensure the quality and authenticity of training data, particularly for question-answering systems and other applications where data provenance is important. +# Custom Synthetic Text Detection -NVIDIA NeMo Curator's synthetic text detection addresses several key challenges, including identifying content generated by language models, filtering out trivial questions, ensuring questions are actually answerable given their context, and maintaining diversity in training datasets. These filters are particularly valuable when creating high-quality datasets for question-answering systems, retrieval tasks, and other applications where the relationship between questions and answers is important. +This guide shows how to create custom filters for detecting and filtering synthetic or AI-generated content using the ray-curator framework. You'll learn how to build document filters that can identify patterns common in machine-generated text, such as excessive lexical similarity between questions and contexts or unanswerable questions. -## How It Works +Ray-curator's flexible DocumentFilter system allows you to create specialized filters for synthetic text detection by building custom scoring and filtering logic. This is especially valuable when creating high-quality datasets for question-answering systems, retrieval tasks, and other applications where authentic content is crucial. -Synthetic text detection identifies AI-generated content by targeting specific patterns and characteristics that are common in machine-generated text. Large language models often produce content that exhibits certain detectable properties: +## Filter Architecture -1. **Lexical Similarity**: AI-generated questions often have high lexical overlap with their contexts, making them too trivial for meaningful model training -2. **Superficial Questions**: Synthetic text may contain questions that appear reasonable but lack substantive depth -3. **Context Mismatches**: Generated QA pairs may include questions that can't actually be answered from the provided context +Ray-curator uses a task-based processing pipeline where documents flow through stages as `DocumentBatch` objects. Custom filters inherit from the `DocumentFilter` base class and integrate into processing pipelines using the `ScoreFilter` stage. -NeMo Curator addresses these issues using two complementary approaches: +The filter architecture consists of: -**EasinessFilter** identifies synthetic content by detecting excessive lexical similarity between questions and contexts. AI-generated questions often reuse phrases directly from the context with minor modifications, resulting in trivial retrieval tasks. These questions don't provide meaningful training signals and can be identified using embedding-based similarity detection. +1. **DocumentFilter**: Base class that defines the scoring and filtering interface +2. **ScoreFilter Stage**: Processing stage that applies filters to document batches +3. **Pipeline**: Orchestrates processing stages in sequence +4. **Executors**: Backend implementations for running pipelines -**AnswerabilityFilter** identifies questions that can't be answered from their contexts, a common issue in synthetic data. This filter uses language models to determine if questions are genuinely answerable from the provided contexts, helping to identify content that appears superficially coherent but lacks real semantic relationships. +Custom synthetic detection filters typically build two key methods: -Together, these filters provide a comprehensive approach to detecting and filtering synthetic text, preserving only authentic question-answer relationships that contribute meaningful training signals. +- `score_document()`: Analyzes text and returns a numeric score +- `keep_document()`: Determines whether to keep documents based on their scores ---- - -## Usage +## Creating Custom Synthetic Filters -Here's a complete example of applying synthetic text filters to a dataset: +Here's how to implement custom filters for detecting synthetic content in question-answering datasets: -```python -import nemo_curator as nc -from nemo_curator.datasets import DocumentDataset -from nemo_curator.filters.synthetic import EasinessFilter, AnswerabilityFilter +### Embedding-Based Easiness Filter -# Load your dataset with questions and contexts -dataset = DocumentDataset.read_json("qa_dataset/*.jsonl") +Create a filter that identifies questions with excessive lexical similarity to their contexts: -# Create an easiness filter to remove trivial questions -easiness_filter = nc.ScoreFilter( - EasinessFilter( +```python +from dataclasses import dataclass +import numpy as np +import pandas as pd +from openai import OpenAI + +from ray_curator.stages.filters.doc_filter import DocumentFilter +from ray_curator.stages.modules.score_filter import ScoreFilter +from ray_curator.pipeline import Pipeline +from ray_curator.stages.io.reader.jsonl import JsonlReader +from ray_curator.stages.io.writer.jsonl import JsonlWriter +from ray_curator.backends.xenna import XennaExecutor + +@dataclass +class EasinessFilter(DocumentFilter): + """Filter that identifies questions too easily retrievable from their context.""" + + base_url: str + api_key: str + model: str + percentile: float = 0.7 + truncate: str = "NONE" + + def __post_init__(self): + super().__init__() + self._name = "easiness_filter" + self.client = None + + def score_document(self, text: str) -> float: + """Calculate embedding similarity between question and context.""" + if self.client is None: + self.client = OpenAI(base_url=self.base_url, api_key=self.api_key) + + # Assume text is formatted as "Context: {context}\nQuestion: {question}" + parts = text.split("\nQuestion: ") + if len(parts) != 2: + return 0.0 + + context = parts[0].replace("Context: ", "") + question = parts[1] + + return self._calc_similarity(context, question) + + def keep_document(self, score: float) -> bool: + """Keep documents with similarity below the percentile threshold.""" + # This would typically be implemented after collecting scores + # For simplicity, using a fixed threshold here + return score <= 0.8 + + def _calc_similarity(self, context: str, question: str) -> float: + """Calculate cosine similarity between context and question embeddings.""" + try: + # Get embeddings + context_response = self.client.embeddings.create( + input=[context], + model=self.model, + extra_body={"input_type": "passage", "truncate": self.truncate} + ) + question_response = self.client.embeddings.create( + input=[question], + model=self.model, + extra_body={"input_type": "query", "truncate": self.truncate} + ) + + context_embed = np.array(context_response.data[0].embedding) + question_embed = np.array(question_response.data[0].embedding) + + # Calculate cosine similarity + similarity = np.dot(context_embed, question_embed) / ( + np.linalg.norm(context_embed) * np.linalg.norm(question_embed) + ) + return float(similarity) + + except Exception as e: + print(f"Error calculating similarity: {e}") + return 0.0 + +# Create and configure the filter stage +easiness_filter = ScoreFilter( + filter_obj=EasinessFilter( base_url="https://your-embedding-api-endpoint", api_key="your-api-key", model="embedding-model-name", - percentile=0.7, # Filter out easiest 70% of questions - truncate="NONE", - batch_size=1, - text_fields=["text", "question"] + percentile=0.7 ), - text_field=["text", "question"], + text_field="text", score_field="easiness_score" ) - -# Create an answerability filter to ensure questions can be answered -answerability_filter = nc.ScoreFilter( - AnswerabilityFilter( - base_url="https://your-llm-api-endpoint", - api_key="your-api-key", - model="gpt-model-name", - answerability_system_prompt="You are an expert at determining if questions can be answered from the given context.", - answerability_user_prompt_template="Context: {text}\n\nQuestion: {question}\n\nIs this question answerable from the given context? Answer Y or N.", - num_criteria=1, - text_fields=["text", "question"] - ), - text_field=["text", "question"], - score_field="answerability_score" -) - -# Apply the filters in sequence -filtered_dataset = easiness_filter(dataset) -filtered_dataset = answerability_filter(filtered_dataset) - -# Save the results -filtered_dataset.to_json("filtered_qa_dataset/", write_to_filename=True) ``` -## Available Filters - -| Filter | Description | Key Parameters | -|--------|-------------|----------------| -| **EasinessFilter** | Identifies questions that are too easy to retrieve | `base_url`, `api_key`, `model`, `percentile`, `truncate`, `batch_size`, `text_fields` | -| **AnswerabilityFilter** | Ensures questions are answerable from their context | `base_url`, `api_key`, `model`, `answerability_system_prompt`, `answerability_user_prompt_template`, `num_criteria`, `text_fields` | - - -### EasinessFilter +### LLM-Based Question Quality Filter -The `EasinessFilter` uses embedding models to identify questions that are too easily retrievable given their context. This helps filter out trivial questions that don't provide meaningful training signal: +Create a filter that uses language models to determine if questions can be answered from their contexts: ```python -easiness_filter = nc.ScoreFilter( - EasinessFilter( - base_url="https://your-embedding-api-endpoint", +import json +from dataclasses import dataclass + +@dataclass +class AnswerabilityFilter(DocumentFilter): + """Filter that identifies questions that cannot be answered from their context.""" + + base_url: str + api_key: str + model: str + system_prompt: str + user_prompt_template: str + num_criteria: int = 1 + + def __post_init__(self): + super().__init__() + self._name = "answerability_filter" + self.client = None + + def score_document(self, text: str) -> str: + """Use LLM to evaluate if question is answerable from context.""" + if self.client is None: + self.client = OpenAI(base_url=self.base_url, api_key=self.api_key) + + # Parse the text to extract context and question + parts = text.split("\nQuestion: ") + if len(parts) != 2: + return '{"criterion_1": "N"}' + + context = parts[0].replace("Context: ", "") + question = parts[1] + + return self._llm_as_judge(context, question) + + def keep_document(self, score: str) -> bool: + """Keep documents where all criteria are met.""" + try: + criteria = json.loads(score) + # All criteria must be "Y" to keep the document + for i in range(self.num_criteria): + if criteria.get(f"criterion_{i + 1}", "N") != "Y": + return False + return True + except (json.JSONDecodeError, KeyError): + # If parsing fails, default to keeping the document + return True + + def _llm_as_judge(self, context: str, question: str) -> str: + """Query LLM to evaluate answerability criteria.""" + user_query = self.system_prompt + "\n\n" + user_query += self.user_prompt_template.format(text=context, question=question) + + try: + completion = self.client.chat.completions.create( + model=self.model, + messages=[{"role": "user", "content": user_query}], + temperature=0.1, + max_tokens=512 + ) + return completion.choices[0].message.content or '{"criterion_1": "N"}' + except Exception as e: + print(f"LLM API error: {e}") + return '{"criterion_1": "N"}' + +# Configure the answerability filter +answerability_filter = ScoreFilter( + filter_obj=AnswerabilityFilter( + base_url="https://your-llm-api-endpoint", api_key="your-api-key", - model="embedding-model-name", - percentile=0.7, # Filter out easiest 70% of questions - truncate="NONE", # Options: "NONE", "START", "END" - batch_size=10, # Process 10 docs at once - text_fields=["text", "question"] + model="gpt-4", + system_prompt="You are an expert at determining if questions can be answered from given context.", + user_prompt_template="Context: {text}\n\nQuestion: {question}\n\nIs this question answerable from the context? Reply with JSON: {{\"criterion_1\": \"Y\"}} or {{\"criterion_1\": \"N\"}}", + num_criteria=1 ), - text_field=["text", "question"], - score_field="easiness_score" + text_field="text", + score_field="answerability_result" ) ``` -#### Key Parameters +## Complete Pipeline Example -- `base_url`: API endpoint for the embedding service -- `api_key`: Authentication key for the API -- `model`: Name of the embedding model to use -- `percentile`: Percentile threshold for filtering (higher values filter more aggressively) -- `truncate`: Text truncation strategy (NONE, START, END) -- `batch_size`: Number of documents to process in each batch -- `text_fields`: List of field names containing the context and question - -### AnswerabilityFilter - -The `AnswerabilityFilter` uses large language models to determine if a question can be answered based on the provided context. This helps ensure the quality of question-answer pairs: +Here's how to build a complete pipeline that reads QA data, applies synthetic content filters, and writes the results: ```python -answerability_filter = nc.ScoreFilter( - AnswerabilityFilter( - base_url="https://your-llm-api-endpoint", - api_key="your-api-key", - model="gpt-model-name", - answerability_system_prompt="You are an expert at determining if questions can be answered from the given context.", - answerability_user_prompt_template="Context: {text}\n\nQuestion: {question}\n\nEvaluate the following criteria:\n1. Is the question answerable from the context? (Y/N)\n2. Is the answer clearly stated in the context? (Y/N)\n3. Would answering require external knowledge? (Y/N)\n\nFormat your response as JSON: {\"criterion_1\": \"Y\", \"criterion_2\": \"Y\", \"criterion_3\": \"N\"}", - num_criteria=3, - text_fields=["text", "question"] - ), - text_field=["text", "question"], - score_field="answerability_score" +from ray_curator.pipeline import Pipeline +from ray_curator.backends.xenna import XennaExecutor + +# Create the processing pipeline +pipeline = Pipeline( + name="synthetic_qa_filtering", + description="Filter synthetic content from QA datasets" ) -``` -#### Key Parameters +# Add stages in sequence +pipeline.add_stage( + # Read JSONL data files + JsonlReader( + file_paths="qa_dataset/*.jsonl", + files_per_partition=10 + ) +) -- `base_url`: API endpoint for the LLM service -- `api_key`: Authentication key for the API -- `model`: Name of the language model to use -- `answerability_system_prompt`: System prompt for the LLM -- `answerability_user_prompt_template`: Template for the user prompt with {text} and {question} placeholders -- `num_criteria`: Number of criteria to evaluate in the response -- `text_fields`: List of field names containing the context and question +pipeline.add_stage( + # Apply easiness filter first + easiness_filter +) -## Advanced Configuration +pipeline.add_stage( + # Then apply answerability filter + answerability_filter +) -### Multi-Criteria Answerability +pipeline.add_stage( + # Write filtered results + JsonlWriter( + output_dir="filtered_qa_dataset/" + ) +) -You can configure the `AnswerabilityFilter` with multiple evaluation criteria: +# Execute the pipeline +executor = XennaExecutor() +results = pipeline.run(executor) -```python -# Define a prompt with multiple criteria -multi_criteria_prompt = """ -Context: {text} +print(f"Pipeline completed. Processed {len(results)} document batches.") +``` -Question: {question} +## Advanced Configuration -Evaluate the following criteria: -1. Is the question answerable from the context? (Y/N) -2. Is the answer clearly stated in the context? (Y/N) -3. Does the question require reasoning or inference? (Y/N) -4. Is the question relevant to the main topic of the context? (Y/N) +### Multi-Criteria Evaluation -Format your response as JSON: -{{"criterion_1": "Y", "criterion_2": "Y", "criterion_3": "Y", "criterion_4": "Y"}} -""" +You can create more sophisticated quality filters with evaluation criteria: -# Configure the filter with 4 criteria -advanced_filter = AnswerabilityFilter( +```python +multi_criteria_filter = AnswerabilityFilter( base_url="https://your-llm-api-endpoint", api_key="your-api-key", model="gpt-4", - answerability_system_prompt="You are an expert at evaluating question-context pairs.", - answerability_user_prompt_template=multi_criteria_prompt, - num_criteria=4, - text_fields=["text", "question"] -) -``` + system_prompt="You are an expert at evaluating question-context pairs for training data quality.", + user_prompt_template="""Context: {text} -### Custom Embedding Models +Question: {question} -You can use different embedding models for the `EasinessFilter`: +Evaluate the following criteria: +1. Is the question answerable from the context? (Y/N) +2. Is the answer clearly stated in the context? (Y/N) +3. Does the question require reasoning beyond simple extraction? (Y/N) -```python -# Using a domain-specific embedding model -domain_filter = EasinessFilter( - base_url="https://your-embedding-api-endpoint", - api_key="your-api-key", - model="domain-specific-embedding-model", - percentile=0.6, # Less aggressive filtering - truncate="END", - batch_size=5, - text_fields=["text", "question"] +Format your response as JSON: {{"criterion_1": "Y", "criterion_2": "Y", "criterion_3": "Y"}}""", + num_criteria=3 ) ``` ## Best Practices -### Balancing Dataset Size and Quality +### Implementation Guidelines -When using synthetic text filters, consider these best practices: +When implementing custom synthetic text filters, follow these best practices: -1. **Start with less aggressive filtering**: Begin with lower percentile thresholds for the `EasinessFilter` +1. **Start with conservative thresholds**: Begin with less aggressive filtering to preserve dataset size ```python - # Start with a conservative threshold - conservative_filter = EasinessFilter( + # Conservative easiness filter + conservative_easiness = EasinessFilter( base_url="https://api-endpoint", - api_key="your-key", + api_key="your-key", model="embedding-model", - percentile=0.5, # Filter out only the easiest 50% - truncate="NONE", - batch_size=1, - text_fields=["text", "question"] + percentile=0.5, # Filter only the most obvious cases + truncate="NONE" ) ``` -2. **Evaluate filter impact**: Analyze the distribution of filtered questions +2. **Handle API failures gracefully**: Implement robust error handling for external API calls ```python - # Before applying filter - before_count = len(dataset) - - # After filtering - filtered_dataset = filter_step(dataset) - after_count = len(filtered_dataset) - - # Rejection rate - rejection_rate = (before_count - after_count) / before_count - print(f"Rejected {rejection_rate * 100:.2f}% of questions") + def score_document(self, text: str) -> float: + try: + # API call logic + return self._calculate_score(text) + except Exception as e: + print(f"API error: {e}") + return 0.0 # Default to keeping document ``` -3. **Preserve diversity**: Ensure your filters don't eliminate valid question types +3. **Monitor filter performance**: Track filtering statistics during pipeline execution ```python - # Sample and manually review filtered questions - rejected = dataset.loc[~dataset.index.isin(filtered_dataset.index)] - sample = rejected.sample(min(100, len(rejected))) - - # Export for manual review - sample.to_json("rejected_sample.jsonl", orient="records", lines=True) + # Add logging to your pipeline + pipeline = Pipeline("synthetic_filtering") + pipeline.add_stage(Score(score_fn=easiness_filter, score_field="easiness")) + pipeline.add_stage(Score(score_fn=answerability_filter, score_field="answerability")) + pipeline.add_stage(combined_filter) # Apply actual filtering ``` -## Use Cases +## Example Use Cases + +### High-Quality QA Dataset Creation -::::{tab-set} +Create a comprehensive pipeline for producing high-quality question-answering datasets: -:::{tab-item} QA dataset Creation ```python -# Pipeline for high-quality QA dataset creation -qa_pipeline = nc.Sequential([ - # First filter out questions that are too easy to retrieve - nc.ScoreFilter( - EasinessFilter( - base_url="https://api-endpoint", - api_key="your-key", - model="embedding-model", - percentile=0.7, - truncate="NONE", - batch_size=1, - text_fields=["text", "question"] - ), - text_field=["text", "question"], - score_field="easiness_score" +# Complete QA dataset filtering pipeline +qa_pipeline = Pipeline("qa_quality_pipeline") + +# Read raw QA data +qa_pipeline.add_stage(JsonlReader( + file_paths="raw_qa_data/*.jsonl", + files_per_partition=50 +)) + +# Score questions for easiness (but don't filter yet) +qa_pipeline.add_stage(Score( + score_fn=EasinessFilter( + base_url="https://embedding-api", + api_key="your-key", + model="text-embedding-3-large", + percentile=0.7 ), - - # Then ensure remaining questions are answerable - nc.ScoreFilter( - AnswerabilityFilter( - base_url="https://llm-endpoint", - api_key="your-key", - model="llm-model", - answerability_system_prompt="You are a helpful assistant.", - answerability_user_prompt_template="Context: {text}\nQuestion: {question}\nIs this answerable? (Y/N)", - num_criteria=1, - text_fields=["text", "question"] - ), - text_field=["text", "question"], - score_field="answerability_score" - ) -]) + score_field="easiness_score", + text_field="text" +)) + +# Score questions for answerability +qa_pipeline.add_stage(Score( + score_fn=AnswerabilityFilter( + base_url="https://llm-api", + api_key="your-key", + model="gpt-4", + system_prompt="Evaluate QA pairs for training data quality.", + user_prompt_template="Context: {text}\nQuestion: {question}\nAnswerable? JSON: {{\"criterion_1\": \"Y/N\"}}", + num_criteria=1 + ), + score_field="answerability_score", + text_field="text" +)) + +# Apply combined filtering logic +qa_pipeline.add_stage(Filter( + filter_fn=lambda row: ( + row["easiness_score"] <= 0.8 and # Not too easy + "Y" in row["answerability_score"] # Answerable + ), + filter_field=["easiness_score", "answerability_score"] +)) -# Apply the pipeline -high_quality_qa = qa_pipeline(dataset) +# Write high-quality results +qa_pipeline.add_stage(JsonlWriter(output_dir="high_quality_qa/")) + +# Execute +executor = XennaExecutor() +results = qa_pipeline.run(executor) ``` -::: -:::{tab-item} Synthetic Question Filtering +### Resource Optimization + +For large-scale processing, consider resource allocation and batching: + ```python -# Filter out likely synthetic questions -synthetic_filter = nc.ScoreFilter( - EasinessFilter( - base_url="https://api-endpoint", - api_key="your-key", - model="embedding-model", - percentile=0.8, # More aggressive filtering - truncate="NONE", - batch_size=1, - text_fields=["text", "question"] - ), - text_field=["text", "question"], - score_field="easiness_score" +from ray_curator.stages.resources import Resources + +# Configure GPU resources for LLM-based filtering +answerability_stage = ScoreFilter( + filter_obj=AnswerabilityFilter(...), + text_field="text" +).with_( + resources=Resources(gpus=1.0), # Allocate full GPU + batch_size=32 # Process multiple documents together ) -# Apply to dataset -human_like_questions = synthetic_filter(dataset) +# CPU-only for embedding similarity +easiness_stage = ScoreFilter( + filter_obj=EasinessFilter(...), + text_field="text" +).with_( + resources=Resources(cpus=4.0), + batch_size=64 +) ``` -::: - -:::: -By applying these specialized synthetic text filters, you can create higher-quality datasets for question-answering systems and other applications where the quality of question-context relationships is critical. \ No newline at end of file +By implementing custom synthetic text detection filters with ray-curator's flexible architecture, you can create robust pipelines for ensuring the authenticity and quality of your training datasets. The task-based processing model allows for efficient scaling and resource management across different filtering stages. diff --git a/docs/curate-text/process-data/specialized-processing/task-decontamination.md b/docs/curate-text/process-data/specialized-processing/task-decontamination.md index c9637b95b..d68d75dce 100644 --- a/docs/curate-text/process-data/specialized-processing/task-decontamination.md +++ b/docs/curate-text/process-data/specialized-processing/task-decontamination.md @@ -1,193 +1,324 @@ --- -description: "Remove downstream task data from training datasets to prevent evaluation contamination and ensure valid benchmarking" +description: "Remove downstream task data from training datasets to prevent evaluation contamination and ensure valid benchmarking using ray-curator's task-based processing pipeline" categories: ["how-to-guides"] -tags: ["task-decontamination", "benchmarks", "contamination", "evaluation", "n-grams", "downstream-tasks"] +tags: ["task-decontamination", "benchmarks", "contamination", "evaluation", "n-grams", "downstream-tasks", "ray-curator", "stages", "pipeline"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "advanced" content_type: "how-to" modality: "text-only" --- -(text-process-data-filter-task-decontamination)= # Downstream Task Decontamination +(text-process-data-filter-task-decontamination)= + ## Background -After training, large language models are usually evaluated by their performance on downstream tasks consisting of unseen test data. When dealing with large datasets there is a potential for leakage of this test data into the model's training dataset. Therefore, NVIDIA NeMo Curator follows the approach of [OpenAI GPT3](https://arxiv.org/pdf/2005.14165.pdf) and [Microsoft Turing NLG 530B](https://arxiv.org/abs/2201.11990) to remove sections of documents in your dataset that are present in downstream tasks. +After training, practitioners evaluate large language models by their performance on downstream tasks consisting of unseen test data. When dealing with large datasets there is potential for leakage of this test data into the model's training dataset. NVIDIA NeMo Curator follows the approach of [OpenAI GPT3](https://arxiv.org/pdf/2005.14165.pdf) and [Microsoft Turing NLG 530B](https://arxiv.org/abs/2201.11990) to remove sections of documents in your dataset that are present in downstream tasks. + +Ray-curator implements task decontamination using its task-based processing architecture, where documents flow through the pipeline as `DocumentBatch` tasks processed by specialized `ProcessingStage` components. ## Usage -The `TaskDecontamination` module provides the central functionality in NVIDIA NeMo Curator. Let's examine this small example: +Ray-curator implements task decontamination as a `ProcessingStage` that integrates into data processing pipelines. Here's how to use it: ```python -import nemo_curator as nc -from nemo_curator.datasets import DocumentDataset -from nemo_curator.utils.file_utils import get_all_files_paths_under -from nemo_curator.tasks import Winogrande, Squad, TriviaQA - -files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl") -books = DocumentDataset.read_json(files, add_filename=True) - +from ray_curator.pipeline import Pipeline +from ray_curator.stages.io.reader import JsonlReader +from ray_curator.stages.io.writer import JsonlWriter +from ray_curator.stages.modules.score_filter import ScoreFilter +from ray_curator.stages.filters.downstream_tasks import TaskDecontaminationFilter +from ray_curator.backends.experimental.ray_data import RayDataExecutor + +# Define downstream tasks for decontamination downstream_tasks = [ - Winogrande(), - Squad(), - TriviaQA(), + "squad", # SQuAD reading comprehension + "trivia_qa", # TriviaQA question answering + "winogrande", # Winogrande commonsense reasoning ] -task_decontaminate = nc.TaskDecontamination(downstream_tasks) +# Create the decontamination stage +decontamination_stage = ScoreFilter( + filter_obj=TaskDecontaminationFilter( + tasks=downstream_tasks, + max_ngram_size=13, + max_matches=10, + min_document_length=200, + remove_char_each_side=200, + max_splits=10 + ), + text_field="text", + score_field="contamination_score" # Optional: keep scores for analysis +) -decontaminated_books = task_decontaminate(books) +# Build the processing pipeline +pipeline = Pipeline( + name="task_decontamination", + description="Remove downstream task contamination from training data" +).add_stage( + JsonlReader(file_paths="books_dataset/") +).add_stage( + decontamination_stage +).add_stage( + JsonlWriter(output_dir="decontaminated_books/") +) -decontaminated_books.to_json("decontaminated_books/", write_to_filename=True) +# Execute the pipeline +executor = RayDataExecutor() +results = pipeline.run(executor) ``` ### Parameters -The `TaskDecontamination` class accepts several parameters to control the decontamination process: +The `TaskDecontaminationFilter` accepts several parameters to control the decontamination process: | Parameter | Default | Description | |-----------|---------|-------------| -| `tasks` | Required | A single task or list of `DownstreamTask` objects | -| `text_field` | `"text"` | Field in dataset containing document text | -| `max_ngram_size` | `13` | Maximum size of n-grams to check for contamination | -| `max_matches` | `10` | If an n-gram appears more than this many times, it's considered too common and not removed | -| `min_document_length` | `200` | Minimum character length for split documents to be kept | +| `tasks` | Required | List of downstream task names (strings) | +| `max_ngram_size` | `13` | Max size of n-grams to check for contamination | +| `max_matches` | `10` | If an n-gram appears more than this number of times, it's considered too common and not removed | +| `min_document_length` | `200` | Min character length for split documents to keep | | `remove_char_each_side` | `200` | Number of characters to remove on either side of matching n-gram | -| `max_splits` | `10` | Maximum number of splits allowed before discarding document entirely | -| `removed_dir` | `None` | Optional directory to save discarded documents | +| `max_splits` | `10` | Max number of splits allowed before discarding document entirely | + +The `ScoreFilter` stage provides other parameters: + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `text_field` | `"text"` | Field in DocumentBatch containing document text | +| `score_field` | `None` | Optional field to store contamination scores | +| `invert` | `False` | If True, keep contaminated documents instead of removing them | For example, to use more aggressive removal settings: ```python -task_decontaminate = nc.TaskDecontamination( - tasks=downstream_tasks, - max_ngram_size=10, # Use smaller n-grams for matching - max_matches=5, # Remove n-grams that appear in fewer documents - remove_char_each_side=300, # Remove more context around matches - min_document_length=500 # Keep only longer document fragments +decontamination_stage = ScoreFilter( + filter_obj=TaskDecontaminationFilter( + tasks=downstream_tasks, + max_ngram_size=10, # Use smaller n-grams for matching + max_matches=5, # Remove n-grams that appear in fewer documents + remove_char_each_side=300, # Remove more context around matches + min_document_length=500 # Keep only longer document fragments + ), + text_field="content", # Use different text field + score_field="decontam_score" # Store contamination scores ) ``` ### Available Downstream Tasks -NVIDIA NeMo Curator provides implementations for many common benchmark tasks. Here's a comprehensive list of the supported tasks: +Ray-curator provides built-in support for common benchmark tasks. You can specify these tasks by name: | Task Category | Available Tasks | |---------------|----------------| -| **Question Answering** | `Squad`, `TriviaQA`, `Quac`, `WebQA`, `COQA`, `Drop` | -| **Reading Comprehension** | `Race`, `MultiRC`, `Record` | -| **Commonsense Reasoning** | `PIQA`, `Copa`, `Winogrande`, `StoryCloze` | -| **Natural Language Inference** | `ANLI`, `RTE`, `CB`, `WiC` | -| **Knowledge Tasks** | `ArcEasy`, `ArcChallenge`, `OpenBookQA`, `BoolQ`, `Lambada` | -| **Multi-task Benchmarks** | `MMLU`, `BigBenchHard`, `BigBenchLight` | -| **Specialized Tasks** | `WSC`, `NumDasc`, `Multilingual` | +| **Question Answering** | `"squad"`, `"trivia_qa"`, `"quac"`, `"web_qa"`, `"coqa"`, `"drop"` | +| **Reading Comprehension** | `"race"`, `"multi_rc"`, `"record"` | +| **Commonsense Reasoning** | `"piqa"`, `"copa"`, `"winogrande"`, `"story_cloze"` | +| **Natural Language Inference** | `"anli"`, `"rte"`, `"cb"`, `"wic"` | +| **Knowledge Tasks** | `"arc_easy"`, `"arc_challenge"`, `"open_book_qa"`, `"bool_q"`, `"lambada"` | +| **Multi-task Benchmarks** | `"mmlu"`, `"big_bench_hard"`, `"big_bench_light"` | +| **Specialized Tasks** | `"wsc"`, `"num_dasc"`, `"multilingual"` | -You can import these tasks directly from the `nemo_curator.tasks` module: +Specify tasks as strings when creating the filter: ```python -from nemo_curator.tasks import Squad, TriviaQA, MMLU, Winogrande, ANLI +# Use multiple tasks for comprehensive decontamination +tasks = ["squad", "trivia_qa", "mmlu", "winogrande", "anli"] + +decontamination_filter = TaskDecontaminationFilter(tasks=tasks) ``` -## Task Decontamination Process +## Advanced Processing Configurations -If you'd like more fine-grained control over the task decontamination process, NVIDIA NeMo Curator provides several CLI tools you can manually apply. You can use the `prepare_task_data`, `find_matching_ngrams` and `remove_matching_ngrams` scripts to remove any task data that might be contained (that's "contaminate") within your training data. You'll need a list of your downstream tasks to modify the [task configuration file (lm_tasks.yaml)](../../../../config/lm_tasks.yaml). If your task doesn't already exist as a class, you'll need to construct a class that extends `nemo_curator.tasks.DownstreamTask`. +Ray-curator's task-based architecture provides flexible options for implementing task decontamination with different processing strategies: -### 1. Prepare Task N-grams +### Multi-Stage Pipeline Approach -First, construct the n-grams from task documents using the `prepare_task_data` module: +For complex decontamination workflows, you can break the process into several stages: -```bash -prepare_task_data \ - --task-config-file=./config/lm_tasks.yaml \ - --output-task-ngrams=./data/task_ngrams.pkl -``` +```python +from ray_curator.stages.modules.score_filter import Score -This module requires a configuration file that specifies how to form n-grams from the task data. An example configuration is provided in `config/lm_tasks.yaml`. This step only needs to be done once per set of tasks, and the resulting pickle file can be reused across datasets. +# Stage 1: Score documents for contamination +scoring_stage = Score( + score_fn=TaskDecontaminationFilter(tasks=["squad", "trivia_qa"]), + score_field="contamination_score", + text_field="text" +) -The n-gram generation process: -1. Extracts text from each task's test examples -2. Tokenizes the text into words -3. Creates n-grams of varying sizes (up to `max_ngram_size`) -4. Stores these n-grams in a dictionary +# Stage 2: Filter based on scores +filtering_stage = Filter( + filter_fn=lambda score: score < 0.5, # Keep documents with low contamination + filter_field="contamination_score" +) -### 2. Find Matching N-grams +# Build pipeline with separate scoring and filtering +pipeline = Pipeline( + name="multi_stage_decontamination" +).add_stage(JsonlReader(file_paths="input/") +).add_stage(scoring_stage +).add_stage(filtering_stage +).add_stage(JsonlWriter(output_dir="output/")) +``` -Next, use the `find_matching_ngrams` module to search for matches within your corpus: +### Batch Processing Configuration -```bash -find_matching_ngrams \ - --input-data-dir= \ - --input-task-ngrams=./data/task_ngrams.pkl \ - --output-matched-ngram-data=./data/matched_ngrams.pkl -``` +Configure batch processing for better performance on large datasets: -This module: -1. Loads the precomputed task n-grams -2. Searches each document in your dataset for these n-grams -3. Counts occurrences of each n-gram across the entire corpus -4. Outputs a dictionary of n-grams and their frequencies +```python +# Configure larger batch sizes for better throughput +decontamination_stage = ScoreFilter( + filter_obj=TaskDecontaminationFilter(tasks=downstream_tasks), + text_field="text" +).with_( + batch_size=100, # Process 100 documents per batch + resources=Resources(cpus=4.0, gpu_memory_gb=8.0) +) +``` -### 3. Remove Matching N-grams +### Distributed Processing -Finally, use the `remove_matching_ngrams` module to remove contaminated content: +Ray-curator automatically distributes processing across available resources: -```bash -remove_matching_ngrams \ - --input-data-dir= \ - --input-matched-ngrams=./data/matched_ngrams.pkl \ - --output-task-deduped-dir= +```python +# Configure executor for distributed processing +executor = RayDataExecutor(config={ + "num_workers": 16, # Use 16 worker processes + "resources_per_worker": { + "CPU": 2, + "memory": 4_000_000_000 # 4GB memory per worker + } +}) + +results = pipeline.run(executor) ``` -This module: -1. Loads the matched n-grams and their frequencies -2. Identifies n-grams that appear fewer than `max_matches` times (rare enough to be actual task contamination) -3. For each document containing these n-grams: - - Removes the n-gram and surrounding characters (up to `remove_char_each_side` on each side) - - Splits the document at removal points - - Keeps only the split fragments longer than `min_document_length` - - Discards documents that require more than `max_splits` - ## Creating Custom Downstream Tasks -If you need to decontaminate against a task not included in NeMo Curator, you can create your own task class: +If you need to decontaminate against a custom benchmark task not included in ray-curator, you can extend the `TaskDecontaminationFilter` to support custom n-gram sources: ```python -from nemo_curator.tasks import DownstreamTask - -class MyCustomTask(DownstreamTask): - def __init__(self): +from ray_curator.stages.filters.doc_filter import DocumentFilter +from ray_curator.stages.utils.text_utils import get_words + +class CustomTaskFilter(DocumentFilter): + def __init__(self, + task_data_path: str, + max_ngram_size: int = 13, + max_matches: int = 10, + min_document_length: int = 200, + remove_char_each_side: int = 200, + max_splits: int = 10): super().__init__() - self._task_name = "my_custom_task" + self.task_data_path = task_data_path + self.max_ngram_size = max_ngram_size + self.max_matches = max_matches + self.min_document_length = min_document_length + self.remove_char_each_side = remove_char_each_side + self.max_splits = max_splits + self._task_ngrams = self._load_task_ngrams() + + def _load_task_ngrams(self) -> set[str]: + """Load n-grams from custom task data.""" + import json + ngrams = set() - def generate_ngrams(self): - # Load your task's test data - test_examples = load_my_test_data() + with open(self.task_data_path) as f: + for line in f: + example = json.loads(line) + # Extract text from your custom task format + text_fields = [example.get("question", ""), + example.get("context", ""), + example.get("answer", "")] + + for text in text_fields: + if text: + words, _ = get_words(text) + if len(words) >= 8: # minimum n-gram size + for i in range(len(words) - self.max_ngram_size + 1): + ngram = " ".join(words[i:i + self.max_ngram_size]) + ngrams.add(ngram) - # Process each example and update ngrams - for example in test_examples: - # If your task has multiple text fields, process each one - self._update_ngrams(example["question"]) - self._update_ngrams(example["context"]) - - return self._ngrams + return ngrams + + def score_document(self, text: str) -> float: + """Score document based on n-gram contamination.""" + words, _ = get_words(text) + contamination_count = 0 + + for i in range(len(words) - self.max_ngram_size + 1): + ngram = " ".join(words[i:i + self.max_ngram_size]) + if ngram in self._task_ngrams: + contamination_count += 1 + + return contamination_count / max(1, len(words) - self.max_ngram_size + 1) + + def keep_document(self, score: float) -> bool: + """Keep documents with low contamination scores.""" + return score < 0.1 # Threshold for contamination ``` -You can then use this custom task with the `TaskDecontamination` module: +Use your custom filter in a pipeline: ```python -task_decontaminate = nc.TaskDecontamination([MyCustomTask()]) +# Create custom task decontamination stage +custom_decontamination = ScoreFilter( + filter_obj=CustomTaskFilter( + task_data_path="my_custom_task_data.jsonl", + max_ngram_size=13 + ), + text_field="text", + score_field="custom_contamination_score" +) + +# Add to pipeline +pipeline.add_stage(custom_decontamination) ``` ## Performance Considerations -Task decontamination can be computationally intensive for large datasets. Consider these optimization strategies: +Task decontamination can be computationally intensive for large datasets. Ray-curator's distributed architecture provides several optimization strategies: + +### Resource Optimization + +1. **Choose important tasks**: Start with the most critical benchmark tasks for your application +2. **Configure batch sizes**: Larger batch sizes improve throughput but require more memory +3. **Adjust n-gram size**: Smaller values of `max_ngram_size` reduce computation but may increase false positives +4. **Use distributed processing**: Ray automatically scales across available CPU/GPU resources + +### Pipeline Optimization + +```python +# Optimize pipeline for throughput +optimized_stage = ScoreFilter( + filter_obj=TaskDecontaminationFilter(tasks=["squad"]), + text_field="text" +).with_( + batch_size=200, # Larger batches for better GPU utilization + resources=Resources(cpus=8.0) # More CPU cores for parallel processing +) + +# Use efficient executors +executor = RayDataExecutor(config={ + "enable_auto_log_stats": True, # Monitor performance + "verbose_stats_logs": True +}) +``` + +### Memory Management + +1. **File partitioning**: Use appropriate `files_per_partition` in JsonlReader +2. **Progressive processing**: Process large datasets in chunks using file limits +3. **Resource monitoring**: Use Ray's built-in resource monitoring + +### Scaling Strategies -1. **Prioritize important tasks**: Start with the most critical benchmark tasks for your application -2. **Process in batches**: Decontaminate your dataset in manageable chunks -3. **Save intermediate results**: Store the results from each step of the CLI workflow -4. **Adjust n-gram size**: Smaller values of `max_ngram_size` reduce computation but may increase false positives +- **Horizontal scaling**: Add more worker nodes to distribute processing +- **Vertical scaling**: Use machines with more CPU cores and memory +- **Hybrid approach**: Combine CPU workers for text processing with GPU workers for model-based filters ## References -- [Language Models are Few-Shot Learners (Brown et al., 2020)](https://arxiv.org/abs/2005.14165) -- [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B (Smith et al., 2021)](https://arxiv.org/abs/2201.11990) \ No newline at end of file +- [Language Models are Zero-Shot Learners (Brown et al., 2020)](https://arxiv.org/abs/2005.14165) +- [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B (Smith et al., 2021)](https://arxiv.org/abs/2201.11990) +- [Ray Datasets: Distributed Data Loading and Compute](https://docs.ray.io/en/latest/data/dataset.html) \ No newline at end of file diff --git a/docs/reference/index.md b/docs/reference/index.md index 4079cce77..af15e7d47 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -210,4 +210,4 @@ Learn about complementary tools in the NVIDIA ecosystem {bdg-secondary}`tao-toolkit` ::: -:::: +:::: \ No newline at end of file