BANonymizer-PL: A fine-tuned model for anonymizing Polish-language social media content

BANonymizer-PL is a fine-tuned HerBERT-large-cased model trained on a subset of BAN-PL. It specializes in detecting and anonymizing surnames and pseudonyms, grouped into a single class for enhanced accuracy in Polish-language social media content. This tool addresses growing ethical concerns regarding the publication of large-scale social media datasets containing personal information, including the names of public figures ooften targeted by toxic or offensive attacks. Unlike traditional NER-based tools, BANonymizer-PL is optimized to handle the linguistic nuances and spelling variations — both intentional and unintentional — that are prevalent in social media content, making it an invaluable resource for privacy-preserving applications.

It was chosen to merge surnames and pseudonyms into one class to improve the F1 score for the task of masking personal information, while staying in line with the general goal of the model. This decision was driven by the observation that the tested models frequently confused Polish pseudonyms with surnames. Treating these groups as a single label helped mitigate such ambiguities. The model was trained on a dataset of over 20,000 manually labeled samples and evaluated on a separate test set of more than 2,000 examples. After training, the model exhibited strong performance in identifying sensitive information, achieving high Precision, Recall, and F1-Score, all exceeding 0.8.

The model is accessible on Hugging Face.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
imgs		imgs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BANonymizer-PL: A fine-tuned model for anonymizing Polish-language social media content

About

Releases

Packages

ZILiAT-NASK/BANonymizer-PL

Folders and files

Latest commit

History

Repository files navigation

BANonymizer-PL: A fine-tuned model for anonymizing Polish-language social media content

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages