Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 25 additions & 50 deletions docs/_templates/autodoc2_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,87 +6,62 @@ NeMo Curator's API reference provides comprehensive technical documentation for
.. grid:: 1 2 2 2
:gutter: 3

.. grid-item-card:: :octicon:`database;1.5em;sd-mr-1` Core Data Handling
:link: datasets/datasets
.. grid-item-card:: :octicon:`server;1.5em;sd-mr-1` Execution Backends
:link: backends/backends
:link-type: doc
:class-card: sd-border-0

**Datasets & Download**
**Ray-based execution backends**

Essential classes for loading, managing, and downloading training data from various sources.
Adapters and executors for running pipelines at scale.

:bdg-secondary:`doc-dataset` :bdg-secondary:`parallel-dataset` :bdg-secondary:`arxiv` :bdg-secondary:`commoncrawl`
:bdg-secondary:`ray-data` :bdg-secondary:`xenna`

.. grid-item-card:: :octicon:`filter;1.5em;sd-mr-1` Data Processing
:link: filters/filters
.. grid-item-card:: :octicon:`workflow;1.5em;sd-mr-1` Pipeline
:link: pipeline/pipeline
:link-type: doc
:class-card: sd-border-0

**Filters & Modifiers**
**Orchestrate end-to-end workflows**

Tools for cleaning, filtering, and transforming text data to improve quality and remove unwanted content.

:bdg-secondary:`classifier-filter` :bdg-secondary:`heuristic-filter` :bdg-secondary:`pii-modifier`
Build and run pipelines composed of processing stages.

.. grid-item-card:: :octicon:`code;1.5em;sd-mr-1` Classification & Analysis
:link: classifiers/classifiers
.. grid-item-card:: :octicon:`stack;1.5em;sd-mr-1` Processing Stages
:link: stages/stages
:link-type: doc
:class-card: sd-border-0

**AI-Powered Analysis**
**Download, transform, and write data**

Advanced classification tools and image processing capabilities for content analysis and quality assessment.
Modular stages for download/extract, text models/classifiers, I/O, and utilities.

:bdg-secondary:`aegis` :bdg-secondary:`content-type` :bdg-secondary:`domain-classifier`
:bdg-secondary:`download` :bdg-secondary:`text` :bdg-secondary:`io` :bdg-secondary:`modules`

.. grid-item-card:: :octicon:`shield-check;1.5em;sd-mr-1` Privacy & Security
:link: pii/pii
.. grid-item-card:: :octicon:`tasklist;1.5em;sd-mr-1` Tasks
:link: tasks/tasks
:link-type: doc
:class-card: sd-border-0

**PII Detection & Redaction**
**Core data structures**

Identify and handle personally identifiable information in datasets with advanced recognition algorithms.

:bdg-secondary:`recognizers` :bdg-secondary:`algorithms` :bdg-secondary:`redaction`
Document batches, file groups, and related interfaces passed between stages.

.. grid-item-card:: :octicon:`zap;1.5em;sd-mr-1` Synthetic Data
:link: synthetic/synthetic
.. grid-item-card:: :octicon:`gear;1.5em;sd-mr-1` Utilities
:link: utils/utils
:link-type: doc
:class-card: sd-border-0

**Data Generation**
**Helper functions**

Create high-quality synthetic training data using advanced language models and generation techniques.

:bdg-secondary:`generator` :bdg-secondary:`nemotron` :bdg-secondary:`mixtral`

.. grid-item-card:: :octicon:`tools;1.5em;sd-mr-1` Advanced Processing
:link: modules/modules
:link-type: doc
:class-card: sd-border-0

**Deduplication & Modules**

Advanced processing modules including semantic deduplication, fuzzy matching, and data pipeline components.

:bdg-secondary:`semantic-dedup` :bdg-secondary:`fuzzy-dedup` :bdg-secondary:`add-id`
File, performance, and operation utilities used across the pipeline.

.. toctree::
:maxdepth: 1
:caption: API Modules
:hidden:

datasets/datasets
download/download
filters/filters
modifiers/modifiers
modules/modules
classifiers/classifiers
image/image
pii/pii
synthetic/synthetic
services/services
nemo_run/nemo_run
backends/backends
pipeline/pipeline
stages/stages
tasks/tasks
utils/utils
37 changes: 14 additions & 23 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,31 +141,22 @@

# -- Options for Autodoc2 ---------------------------------------------------
sys.path.insert(0, os.path.abspath(".."))
# Ensure the ray-curator package is importable (package lives under ray-curator/ray_curator)
sys.path.insert(0, os.path.abspath(os.path.join("..", "ray-curator")))

# Document individual submodules instead of the top-level package
# This should generate shorter filenames without the nemo_curator. prefix
# Document `ray_curator` subpackages instead of the legacy `nemo_curator`
# This should generate shorter filenames without the ray_curator. prefix
autodoc2_packages_list = [
# Core data handling
"../nemo_curator/datasets",
"../nemo_curator/download",
# Data processing
"../nemo_curator/filters",
"../nemo_curator/modifiers",
"../nemo_curator/modules",
# Classification and analysis
"../nemo_curator/classifiers",
"../nemo_curator/image",
# Privacy and security
"../nemo_curator/pii",
# Synthetic data
"../nemo_curator/synthetic",
# Services and infrastructure
"../nemo_curator/services",
"../nemo_curator/nemo_run",
# Evaluation and tasks
"../nemo_curator/tasks",
# Utilities
"../nemo_curator/utils",
# Execution backends and adapters
"../ray-curator/ray_curator/backends",
# Pipeline orchestration
"../ray-curator/ray_curator/pipeline",
# All processing stages (download/extract, modules, text, io, etc.)
"../ray-curator/ray_curator/stages",
# Core task data structures
"../ray-curator/ray_curator/tasks",
# Shared utilities
"../ray-curator/ray_curator/utils",
]

# Check if any of the packages actually exist before enabling autodoc2
Expand Down
Loading