Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement of Vector Database Scanner with NER for Improved PII Detection #208

Open
JustEmrick opened this issue Sep 14, 2023 · 0 comments
Labels
idea Idea for consideration

Comments

@JustEmrick
Copy link
Contributor

Description:

The current implementation of our PII policies relies predominantly on regular expressions (Regex). While this method has served us so far, it could have the tendency to yield a notable number of false positives and misses. To bolster the accuracy and robustness of our scanner, we propose the integration of Named Entity Recognition (NER) techniques along with other relevant methods to detect PII.

Background:

  • Regex checks, though efficient for specific patterns, often fail to capture the contextual nuances of data, leading to both false positives and false negatives.
  • Named Entity Recognition (NER) is an established method in the field of Natural Language Processing (NLP) and is adept at identifying entities in text, including PII such as names, addresses, phone numbers, and more.
  • By combining Regex with NER, we can potentially improve the precision and recall of our PII detection mechanism.

Proposed Solution:

  • Integration of NER Models: Incorporate established NER models into the scanning process. Libraries like Spacy or the NER capabilities of HuggingFace's Transformers library can be considered for this.

  • Hybrid Approach: Use a combined strategy of Regex and NER. Start with Regex checks to rapidly filter potential matches, and then employ NER to validate and further refine those matches, ensuring fewer false positives and better overall detection.

@JustEmrick JustEmrick added the idea Idea for consideration label Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idea Idea for consideration
Projects
None yet
Development

No branches or pull requests

1 participant