NVIDIA · ruchaa-apte · Feb 25, 2025 · Feb 25, 2025 · Feb 25, 2025 · Feb 28, 2025
diff --git a/tutorials/multimodal_dapt_curation/README.md b/tutorials/multimodal_dapt_curation/README.md
@@ -0,0 +1,20 @@
+# Multimodal Extraction and Curation
+
+## Workflow
+![Workflow Overview](workflow.png)
+
+## Overview
+This tutorial is divided into two parts:
+
+### Part 1: Multimodal Extraction
+In this section, we guide you through extracting various modalities (text, images, tables, etc.) from PDFs using NVIDIA's multimodal extraction (`nv-ingest`) framework. To complete the prerequisites and run the tutorial, refer to the README located in the `ingest` folder within the directory.
+
+### Part 2: Data Curation for Domain-Adaptive Pre-Training (DAPT)
+The second part of the tutorial covers best practices for data curation in DAPT. This stage processes extracted text, tables, charts, and images using the curation pipeline. To complete the prerequisites and execute the tutorial, follow the README in the `curator` folder within the directory.
-The second part of the tutorial covers best practices for data curation in DAPT. This stage processes extracted text, tables, charts, and images using the curation pipeline. To complete the prerequisites and execute the tutorial, follow the README in the `curator` folder within the directory.
+The second part of the tutorial covers best practices for data curation for DAPT. This stage processes extracted text, tables, charts, and images using the curation pipeline. To complete the prerequisites and execute the tutorial, follow the README in the `curator` folder within the directory.
-The second part of the tutorial covers best practices for data curation in DAPT. This stage processes extracted text, tables, charts, and images using the curation pipeline. To complete the prerequisites and execute the tutorial, follow the README in the `curator` folder within the directory.
+The second part of the tutorial covers best practices for data curation for DAPT. This stage processes extracted text, tables, charts, and images using the curation pipeline. To complete the prerequisites and execute the tutorial, follow the README in the `curator` folder within the directory.
+
+## Instructions
+- Ensure that all prerequisites for both `nv-ingest` (extraction) and `curator` (curation) are completed before proceeding.
+- Follow the respective READMEs in the `ingest` and `curator` folders for step-by-step guidance.
+
+## License
+Refer to the respective repositories for licensing information.
diff --git a/tutorials/multimodal_dapt_curation/curator/README.md b/tutorials/multimodal_dapt_curation/curator/README.md
@@ -0,0 +1,36 @@
+# Multi-Modal Data Curation from PDFs
+
+## Overview
+This is Part 2 of the tutorial that provides best practices for data curation in Domain-Adaptive Pre-Training (DAPT).
+The dataset used in this tutorial is small, making it ideal for developing and validating data curation pipelines on either a local machine or a computing cluster. The playbook employs specialized tools and techniques for high-quality text curation and refinement.
+
+## Hardware Requirements
+This playbook is compatible with both CPUs and GPUs.
+While most steps can run on a CPU, the semantic and fuzzy deduplication modules require a GPU.
+If GPUs are available, the PII redaction and exact deduplication processes will be accelerated.
+
+## Walkthrough
+The datasets used in this tutorial are located in the `NeMo-Curator/tutorials/multimodal_dapt_curation/ingest/sources/separated_extracted_data/data_type_map.json` file.
+
+The tutorial follows these steps:
+1. Install Requirements and Import Libraries
-1. Install Requirements and Import Libraries
+1. Install requirements and import libraries
-1. Install Requirements and Import Libraries
+1. Install requirements and import libraries
+2. Convert Extracted Data: Transform data from `nv-ingest` into Dask DataFrames and convert them to `DocumentDataset`.
+3. Examine File Types and Sizes (Optional)
+4. Run the Data Curation Pipeline with NeMo Curator:
+   - Identify and separate file types
+   - Perform document-level exact deduplication
+   - Apply heuristic-based quality filtering (e.g., number of lines, word count, top N-grams)
+   - Fix Unicode errors using `ftfy`
+   - Redact PII
+   - Execute GPU-accelerated fuzzy and semantic deduplication
+5. Save the Filtered and Curated Data
-2. Convert Extracted Data: Transform data from `nv-ingest` into Dask DataFrames and convert them to `DocumentDataset`.
-3. Examine File Types and Sizes (Optional)
-4. Run the Data Curation Pipeline with NeMo Curator:
-   - Identify and separate file types
-   - Perform document-level exact deduplication
-   - Apply heuristic-based quality filtering (e.g., number of lines, word count, top N-grams)
-   - Fix Unicode errors using `ftfy`
-   - Redact PII
-   - Execute GPU-accelerated fuzzy and semantic deduplication
-5. Save the Filtered and Curated Data
+2. Convert extracted data: Transform data from `nv-ingest` into Dask DataFrames and convert them to `DocumentDataset`.
+3. Examine file types and sizes (optional)
+4. Run the data curation pipeline with NeMo Curator:
+   - Identify and separate file types
+   - Perform document-level exact deduplication
+   - Apply heuristic-based quality filtering (e.g., number of lines, word count, top N-grams)
+   - Fix Unicode errors using `ftfy`
+   - Redact PII
+   - Execute GPU-accelerated fuzzy and semantic deduplication
+5. Save the filtered and curated data
-2. Convert Extracted Data: Transform data from `nv-ingest` into Dask DataFrames and convert them to `DocumentDataset`.
-3. Examine File Types and Sizes (Optional)
-4. Run the Data Curation Pipeline with NeMo Curator:
-   - Identify and separate file types
-   - Perform document-level exact deduplication
-   - Apply heuristic-based quality filtering (e.g., number of lines, word count, top N-grams)
-   - Fix Unicode errors using `ftfy`
-   - Redact PII
-   - Execute GPU-accelerated fuzzy and semantic deduplication
-5. Save the Filtered and Curated Data
+2. Convert extracted data: Transform data from `nv-ingest` into Dask DataFrames and convert them to `DocumentDataset`.
+3. Examine file types and sizes (optional)
+4. Run the data curation pipeline with NeMo Curator:
+   - Identify and separate file types
+   - Perform document-level exact deduplication
+   - Apply heuristic-based quality filtering (e.g., number of lines, word count, top N-grams)
+   - Fix Unicode errors using `ftfy`
+   - Redact PII
+   - Execute GPU-accelerated fuzzy and semantic deduplication
+5. Save the filtered and curated data
+
+## Usage
+After installing the NeMo Curator package, install the required dependencies and run the pipeline using the following command:
+
+```sh
+python main.py --device "gpu"
+```
+
+## License
+Refer to the relevant repository for licensing information.
diff --git a/tutorials/multimodal_dapt_curation/curator/configs/struct_semantic_dedupe_config.yaml b/tutorials/multimodal_dapt_curation/curator/configs/struct_semantic_dedupe_config.yaml
@@ -0,0 +1,27 @@
+# Configuration file for semdantic dedup
-# Configuration file for semdantic dedup
+# Configuration file for struct semantic deduplication
-# Configuration file for semdantic dedup
+# Configuration file for struct semantic deduplication
+cache_dir: "/home/ruchaa/projects/nvingest_curator_tutorial/NeMo-Curator/tutorials/multimodal_dapt_curation/code/workspace/semdedup_cache/struct" #"workspace/semdedup_cache/struct"
-cache_dir: "/home/ruchaa/projects/nvingest_curator_tutorial/NeMo-Curator/tutorials/multimodal_dapt_curation/code/workspace/semdedup_cache/struct" #"workspace/semdedup_cache/struct"
+cache_dir: "workspace/semdedup_cache/struct"
-cache_dir: "/home/ruchaa/projects/nvingest_curator_tutorial/NeMo-Curator/tutorials/multimodal_dapt_curation/code/workspace/semdedup_cache/struct" #"workspace/semdedup_cache/struct"
+cache_dir: "workspace/semdedup_cache/struct"
+num_files: 16
+
+# Embeddings configuration
+embeddings_save_loc: "embeddings"
+embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2"
+embedding_batch_size: 128
+write_embeddings_to_disk: false
+
+# Clustering configuration
+clustering_save_loc: "clustering_results"
+n_clusters: 5
+max_iter: 100
+kmeans_with_cos_dist: false
+
+# Semdedup configuration
+which_to_keep: "hard"
+sim_metric: "cosine"
+
+# Extract dedup configuration
+eps_thresholds:
+  - 0.1
+  - 0.01
+
+# Which threshold to use for extracting deduped data
+eps_to_extract: 0.1
diff --git a/tutorials/multimodal_dapt_curation/curator/configs/text_semantic_dedupe_config.yaml b/tutorials/multimodal_dapt_curation/curator/configs/text_semantic_dedupe_config.yaml
@@ -0,0 +1,30 @@
+# Configuration file for semdantic dedup
-# Configuration file for semdantic dedup
+# Configuration file for text semantic deduplication
-# Configuration file for semdantic dedup
+# Configuration file for text semantic deduplication
+cache_dir: "/home/ruchaa/projects/nvingest_curator_tutorial/NeMo-Curator/tutorials/multimodal_dapt_curation/code/workspace/semdedup_cache/text" #"workspace/semdedup_cache/text"
-cache_dir: "/home/ruchaa/projects/nvingest_curator_tutorial/NeMo-Curator/tutorials/multimodal_dapt_curation/code/workspace/semdedup_cache/text" #"workspace/semdedup_cache/text"
+cache_dir: "workspace/semdedup_cache/text"
-cache_dir: "/home/ruchaa/projects/nvingest_curator_tutorial/NeMo-Curator/tutorials/multimodal_dapt_curation/code/workspace/semdedup_cache/text" #"workspace/semdedup_cache/text"
+cache_dir: "workspace/semdedup_cache/text"
+num_files: 16
+
+# Embeddings configuration
+embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2"
+embedding_batch_size: 128
+embeddings_save_loc: "embeddings"
+embedding_pooling_strategy: "mean_pooling"
+embedding_column: "embeddings"
+write_embeddings_to_disk: true
+write_to_filename: false
+
+# Clustering configuration
+max_iter: 100
+n_clusters: 5
+clustering_save_loc: "clustering_results"
+sim_metric: "cosine"
+which_to_keep: "hard"
+sort_clusters: true
+kmeans_with_cos_dist: false
+partition_size: "2gb"
+
+# Extract dedup configuration
+eps_thresholds:
+  - 0.1
+  - 0.01
+
+# Which threshold to use for extracting deduped data
+eps_to_extract: 0.1