-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nvingest curator tutorial #584
base: main
Are you sure you want to change the base?
Changes from all commits
d216ede
26ac579
3f252d9
7bf899f
ee2173a
c192c6a
b65d6f0
61b3128
8f7c707
b33657c
62636f6
d791fc5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Multimodal Extraction and Curation | ||
|
||
## Workflow | ||
 | ||
|
||
## Overview | ||
This tutorial is divided into two parts: | ||
|
||
### Part 1: Multimodal Extraction | ||
In this section, we guide you through extracting various modalities (text, images, tables, etc.) from PDFs using NVIDIA's multimodal extraction (`nv-ingest`) framework. To complete the prerequisites and run the tutorial, refer to the README located in the `ingest` folder within the directory. | ||
|
||
### Part 2: Data Curation for Domain-Adaptive Pre-Training (DAPT) | ||
The second part of the tutorial covers best practices for data curation in DAPT. This stage processes extracted text, tables, charts, and images using the curation pipeline. To complete the prerequisites and execute the tutorial, follow the README in the `curator` folder within the directory. | ||
|
||
## Instructions | ||
- Ensure that all prerequisites for both `nv-ingest` (extraction) and `curator` (curation) are completed before proceeding. | ||
- Follow the respective READMEs in the `ingest` and `curator` folders for step-by-step guidance. | ||
|
||
## License | ||
Refer to the respective repositories for licensing information. |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,36 @@ | ||||||||||||||||||||||||||||||||||||||||||
# Multi-Modal Data Curation from PDFs | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
## Overview | ||||||||||||||||||||||||||||||||||||||||||
This is Part 2 of the tutorial that provides best practices for data curation in Domain-Adaptive Pre-Training (DAPT). | ||||||||||||||||||||||||||||||||||||||||||
The dataset used in this tutorial is small, making it ideal for developing and validating data curation pipelines on either a local machine or a computing cluster. The playbook employs specialized tools and techniques for high-quality text curation and refinement. | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
## Hardware Requirements | ||||||||||||||||||||||||||||||||||||||||||
This playbook is compatible with both CPUs and GPUs. | ||||||||||||||||||||||||||||||||||||||||||
While most steps can run on a CPU, the semantic and fuzzy deduplication modules require a GPU. | ||||||||||||||||||||||||||||||||||||||||||
If GPUs are available, the PII redaction and exact deduplication processes will be accelerated. | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
## Walkthrough | ||||||||||||||||||||||||||||||||||||||||||
The datasets used in this tutorial are located in the `NeMo-Curator/tutorials/multimodal_dapt_curation/ingest/sources/separated_extracted_data/data_type_map.json` file. | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
The tutorial follows these steps: | ||||||||||||||||||||||||||||||||||||||||||
1. Install Requirements and Import Libraries | ||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||
2. Convert Extracted Data: Transform data from `nv-ingest` into Dask DataFrames and convert them to `DocumentDataset`. | ||||||||||||||||||||||||||||||||||||||||||
3. Examine File Types and Sizes (Optional) | ||||||||||||||||||||||||||||||||||||||||||
4. Run the Data Curation Pipeline with NeMo Curator: | ||||||||||||||||||||||||||||||||||||||||||
- Identify and separate file types | ||||||||||||||||||||||||||||||||||||||||||
- Perform document-level exact deduplication | ||||||||||||||||||||||||||||||||||||||||||
- Apply heuristic-based quality filtering (e.g., number of lines, word count, top N-grams) | ||||||||||||||||||||||||||||||||||||||||||
- Fix Unicode errors using `ftfy` | ||||||||||||||||||||||||||||||||||||||||||
- Redact PII | ||||||||||||||||||||||||||||||||||||||||||
- Execute GPU-accelerated fuzzy and semantic deduplication | ||||||||||||||||||||||||||||||||||||||||||
5. Save the Filtered and Curated Data | ||||||||||||||||||||||||||||||||||||||||||
Comment on lines
+17
to
+26
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
## Usage | ||||||||||||||||||||||||||||||||||||||||||
After installing the NeMo Curator package, install the required dependencies and run the pipeline using the following command: | ||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this include an installation command? |
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
```sh | ||||||||||||||||||||||||||||||||||||||||||
python main.py --device "gpu" | ||||||||||||||||||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
## License | ||||||||||||||||||||||||||||||||||||||||||
Refer to the relevant repository for licensing information. |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,27 @@ | ||||||
# Configuration file for semdantic dedup | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
cache_dir: "/home/ruchaa/projects/nvingest_curator_tutorial/NeMo-Curator/tutorials/multimodal_dapt_curation/code/workspace/semdedup_cache/struct" #"workspace/semdedup_cache/struct" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
num_files: 16 | ||||||
|
||||||
# Embeddings configuration | ||||||
embeddings_save_loc: "embeddings" | ||||||
embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2" | ||||||
embedding_batch_size: 128 | ||||||
write_embeddings_to_disk: false | ||||||
|
||||||
# Clustering configuration | ||||||
clustering_save_loc: "clustering_results" | ||||||
n_clusters: 5 | ||||||
max_iter: 100 | ||||||
kmeans_with_cos_dist: false | ||||||
|
||||||
# Semdedup configuration | ||||||
which_to_keep: "hard" | ||||||
sim_metric: "cosine" | ||||||
|
||||||
# Extract dedup configuration | ||||||
eps_thresholds: | ||||||
- 0.1 | ||||||
- 0.01 | ||||||
|
||||||
# Which threshold to use for extracting deduped data | ||||||
eps_to_extract: 0.1 |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,30 @@ | ||||||
# Configuration file for semdantic dedup | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
cache_dir: "/home/ruchaa/projects/nvingest_curator_tutorial/NeMo-Curator/tutorials/multimodal_dapt_curation/code/workspace/semdedup_cache/text" #"workspace/semdedup_cache/text" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
num_files: 16 | ||||||
|
||||||
# Embeddings configuration | ||||||
embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2" | ||||||
embedding_batch_size: 128 | ||||||
embeddings_save_loc: "embeddings" | ||||||
embedding_pooling_strategy: "mean_pooling" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you can remove some of these parameters to match the other config file, or you could add them to the other config file. |
||||||
embedding_column: "embeddings" | ||||||
write_embeddings_to_disk: true | ||||||
write_to_filename: false | ||||||
|
||||||
# Clustering configuration | ||||||
max_iter: 100 | ||||||
n_clusters: 5 | ||||||
clustering_save_loc: "clustering_results" | ||||||
sim_metric: "cosine" | ||||||
which_to_keep: "hard" | ||||||
sort_clusters: true | ||||||
kmeans_with_cos_dist: false | ||||||
partition_size: "2gb" | ||||||
|
||||||
# Extract dedup configuration | ||||||
eps_thresholds: | ||||||
- 0.1 | ||||||
- 0.01 | ||||||
|
||||||
# Which threshold to use for extracting deduped data | ||||||
eps_to_extract: 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.