Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvingest curator tutorial #584

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open
20 changes: 20 additions & 0 deletions tutorials/multimodal_dapt_curation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Multimodal Extraction and Curation

## Workflow
![Workflow Overview](workflow.png)

## Overview
This tutorial is divided into two parts:

### Part 1: Multimodal Extraction
In this section, we guide you through extracting various modalities (text, images, tables, etc.) from PDFs using NVIDIA's multimodal extraction (`nv-ingest`) framework. To complete the prerequisites and run the tutorial, refer to the README located in the `ingest` folder within the directory.

### Part 2: Data Curation for Domain-Adaptive Pre-Training (DAPT)
The second part of the tutorial covers best practices for data curation in DAPT. This stage processes extracted text, tables, charts, and images using the curation pipeline. To complete the prerequisites and execute the tutorial, follow the README in the `curator` folder within the directory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The second part of the tutorial covers best practices for data curation in DAPT. This stage processes extracted text, tables, charts, and images using the curation pipeline. To complete the prerequisites and execute the tutorial, follow the README in the `curator` folder within the directory.
The second part of the tutorial covers best practices for data curation for DAPT. This stage processes extracted text, tables, charts, and images using the curation pipeline. To complete the prerequisites and execute the tutorial, follow the README in the `curator` folder within the directory.


## Instructions
- Ensure that all prerequisites for both `nv-ingest` (extraction) and `curator` (curation) are completed before proceeding.
- Follow the respective READMEs in the `ingest` and `curator` folders for step-by-step guidance.

## License
Refer to the respective repositories for licensing information.
36 changes: 36 additions & 0 deletions tutorials/multimodal_dapt_curation/curator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Multi-Modal Data Curation from PDFs

## Overview
This is Part 2 of the tutorial that provides best practices for data curation in Domain-Adaptive Pre-Training (DAPT).
The dataset used in this tutorial is small, making it ideal for developing and validating data curation pipelines on either a local machine or a computing cluster. The playbook employs specialized tools and techniques for high-quality text curation and refinement.

## Hardware Requirements
This playbook is compatible with both CPUs and GPUs.
While most steps can run on a CPU, the semantic and fuzzy deduplication modules require a GPU.
If GPUs are available, the PII redaction and exact deduplication processes will be accelerated.

## Walkthrough
The datasets used in this tutorial are located in the `NeMo-Curator/tutorials/multimodal_dapt_curation/ingest/sources/separated_extracted_data/data_type_map.json` file.

The tutorial follows these steps:
1. Install Requirements and Import Libraries
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Install Requirements and Import Libraries
1. Install requirements and import libraries

2. Convert Extracted Data: Transform data from `nv-ingest` into Dask DataFrames and convert them to `DocumentDataset`.
3. Examine File Types and Sizes (Optional)
4. Run the Data Curation Pipeline with NeMo Curator:
- Identify and separate file types
- Perform document-level exact deduplication
- Apply heuristic-based quality filtering (e.g., number of lines, word count, top N-grams)
- Fix Unicode errors using `ftfy`
- Redact PII
- Execute GPU-accelerated fuzzy and semantic deduplication
5. Save the Filtered and Curated Data
Comment on lines +17 to +26
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Convert Extracted Data: Transform data from `nv-ingest` into Dask DataFrames and convert them to `DocumentDataset`.
3. Examine File Types and Sizes (Optional)
4. Run the Data Curation Pipeline with NeMo Curator:
- Identify and separate file types
- Perform document-level exact deduplication
- Apply heuristic-based quality filtering (e.g., number of lines, word count, top N-grams)
- Fix Unicode errors using `ftfy`
- Redact PII
- Execute GPU-accelerated fuzzy and semantic deduplication
5. Save the Filtered and Curated Data
2. Convert extracted data: Transform data from `nv-ingest` into Dask DataFrames and convert them to `DocumentDataset`.
3. Examine file types and sizes (optional)
4. Run the data curation pipeline with NeMo Curator:
- Identify and separate file types
- Perform document-level exact deduplication
- Apply heuristic-based quality filtering (e.g., number of lines, word count, top N-grams)
- Fix Unicode errors using `ftfy`
- Redact PII
- Execute GPU-accelerated fuzzy and semantic deduplication
5. Save the filtered and curated data


## Usage
After installing the NeMo Curator package, install the required dependencies and run the pipeline using the following command:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this include an installation command?


```sh
python main.py --device "gpu"
```

## License
Refer to the relevant repository for licensing information.
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Configuration file for semdantic dedup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Configuration file for semdantic dedup
# Configuration file for struct semantic deduplication

cache_dir: "/home/ruchaa/projects/nvingest_curator_tutorial/NeMo-Curator/tutorials/multimodal_dapt_curation/code/workspace/semdedup_cache/struct" #"workspace/semdedup_cache/struct"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cache_dir: "/home/ruchaa/projects/nvingest_curator_tutorial/NeMo-Curator/tutorials/multimodal_dapt_curation/code/workspace/semdedup_cache/struct" #"workspace/semdedup_cache/struct"
cache_dir: "workspace/semdedup_cache/struct"

num_files: 16

# Embeddings configuration
embeddings_save_loc: "embeddings"
embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2"
embedding_batch_size: 128
write_embeddings_to_disk: false

# Clustering configuration
clustering_save_loc: "clustering_results"
n_clusters: 5
max_iter: 100
kmeans_with_cos_dist: false

# Semdedup configuration
which_to_keep: "hard"
sim_metric: "cosine"

# Extract dedup configuration
eps_thresholds:
- 0.1
- 0.01

# Which threshold to use for extracting deduped data
eps_to_extract: 0.1
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Configuration file for semdantic dedup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Configuration file for semdantic dedup
# Configuration file for text semantic deduplication

cache_dir: "/home/ruchaa/projects/nvingest_curator_tutorial/NeMo-Curator/tutorials/multimodal_dapt_curation/code/workspace/semdedup_cache/text" #"workspace/semdedup_cache/text"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cache_dir: "/home/ruchaa/projects/nvingest_curator_tutorial/NeMo-Curator/tutorials/multimodal_dapt_curation/code/workspace/semdedup_cache/text" #"workspace/semdedup_cache/text"
cache_dir: "workspace/semdedup_cache/text"

num_files: 16

# Embeddings configuration
embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2"
embedding_batch_size: 128
embeddings_save_loc: "embeddings"
embedding_pooling_strategy: "mean_pooling"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove some of these parameters to match the other config file, or you could add them to the other config file.

embedding_column: "embeddings"
write_embeddings_to_disk: true
write_to_filename: false

# Clustering configuration
max_iter: 100
n_clusters: 5
clustering_save_loc: "clustering_results"
sim_metric: "cosine"
which_to_keep: "hard"
sort_clusters: true
kmeans_with_cos_dist: false
partition_size: "2gb"

# Extract dedup configuration
eps_thresholds:
- 0.1
- 0.01

# Which threshold to use for extracting deduped data
eps_to_extract: 0.1
Loading