Skip to content

A fine-tuned model for anonymizing Polish-language social media content

Notifications You must be signed in to change notification settings

ZILiAT-NASK/BANonymizer-PL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

BANonymizer-PL: A fine-tuned model for anonymizing Polish-language social media content

screenshot

BANonymizer-PL is a fine-tuned HerBERT-large-cased model trained on a subset of BAN-PL. It specializes in detecting and anonymizing surnames and pseudonyms, grouped into a single class for enhanced accuracy in Polish-language social media content. This tool addresses growing ethical concerns regarding the publication of large-scale social media datasets containing personal information, including the names of public figures ooften targeted by toxic or offensive attacks. Unlike traditional NER-based tools, BANonymizer-PL is optimized to handle the linguistic nuances and spelling variations — both intentional and unintentional — that are prevalent in social media content, making it an invaluable resource for privacy-preserving applications.

It was chosen to merge surnames and pseudonyms into one class to improve the F1 score for the task of masking personal information, while staying in line with the general goal of the model. This decision was driven by the observation that the tested models frequently confused Polish pseudonyms with surnames. Treating these groups as a single label helped mitigate such ambiguities. The model was trained on a dataset of over 20,000 manually labeled samples and evaluated on a separate test set of more than 2,000 examples. After training, the model exhibited strong performance in identifying sensitive information, achieving high Precision, Recall, and F1-Score, all exceeding 0.8.

The model is accessible on Hugging Face.

About

A fine-tuned model for anonymizing Polish-language social media content

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published