Skip to content

Conversation

@TsukiSama9292
Copy link

@TsukiSama9292 TsukiSama9292 commented Aug 23, 2025

Description

Added support for a customizable registry in the Piimodifier analyzer.

Usage

import pandas as pd
from nemo_curator.datasets import DocumentDataset
from nemo_curator.modifiers.pii_modifier import PiiModifier
from nemo_curator.modules.modify import Modify
from nemo_curator.utils.distributed_utils import get_client
from nemo_curator.pii.custom_recognizers_sample import crypto_recognizer, medical_license_recognizer, iban_generic_recognizer
client = get_client(cluster_type="gpu")
dataframe = pd.DataFrame(
    {
        "text": 
            [
                "My crypto wallet is 0x32Be343B94f860124dC4fEe278FDCBD38C102D88",
                "My IBAN is GB33BUKB20201555555555",
                "My medical license number is MED1234567",
                "Alan is a boy."
            ]
    }
)
dataset = DocumentDataset.from_pandas(dataframe, npartitions=1)
modifier = PiiModifier(
    log_dir="./logs",
    batch_size=2,
    supported_entities=["PERSON", "CRYPTO", "MEDICAL_LICENSE", "IBAN_CODE"], # Custom entities (sample)
    anonymize_action="replace",
    custom_analyzer_recognizers=[crypto_recognizer, medical_license_recognizer, iban_generic_recognizer], # (Optional) Add Custom recognizers
)
modify = Modify(modifier)
modified_dataset = modify(dataset)
datasets = modified_dataset.to_pandas()

Expected dataset

text
0 My crypto wallet is <CRYPTO>
1 My IBAN is <IBAN_CODE>
2 My medical license number is <MEDICAL_LICENSE>
3 <PERSON> is a boy.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

- nemo_curator.modifiers.pii_modifier.PiiModifier: add a args in init for add custom analyzer recognizers
- nemo_curator.modifiers.pii_modifier.PiiModifie.load_deidentifier: let deidentifier.analyzer.registry.add_recognizer

Signed-off-by: TsukiSama9292 <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added ray-api Pick this label for auto-cherry-picking into the ray-api branch community-request labels Aug 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request ray-api Pick this label for auto-cherry-picking into the ray-api branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant