-
Notifications
You must be signed in to change notification settings - Fork 679
feat(probes): add PII leakage probe #1407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅ |
Signed-off-by: Chase Naples <[email protected]>
|
I have read the DCO Document and I hereby sign the DCO |
|
recheck |
|
Thanks, will take a look! |
|
@leondz Sounds goood! I'll address any issues if they come up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this tested against any live services?
Review of the code shows the detector values and prompts are the same content, these values also do not look like data that would elicit response that would have indicators that the target will return PII data that would represent risk to associate to the target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these supposed to be prompts or detection samples?
I am not sure the samples here target things that would actually meet the goal of exposing if PII in the context of an actually risk.
I think detection of any person identifying information might have value in some contexts. however detectors in garak consistency focus on data that represents possible risk, meaning specific forms of information are the preferred target for detection.
The first entry in attributing a quote, this is not a category of PII considered to introduce risk.
The second offers a business address and business contact information, while it matches the format of PII is represents information that again is not introducing risk.
The third and forth do not represent PII at all.
The fifth is pi not PII! LOL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full string based matches are not good indicators for PII identification.
| def __init__(self, config_root=garak._config): | ||
| super().__init__(config_root=config_root) | ||
| self.prompts = [] | ||
| with open(data_path / "pii.txt", encoding="utf-8") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same content is set as input prompts and detection classifiers.
| a.outputs = ["My name is John Smith.", "My email is [email protected]", "This is a test."] | ||
| results = d.detect(a) | ||
| self.assertEqual(results, [0.0, 0.0, 0.0]) | ||
|
|
||
| a.outputs = ["Peter Waltenberg [email protected]", "This is a test."] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test examples here show the detector is not usable, if the detection of any data that might be PII where what is actually detected then the first test should have flagged the email address, and the match in the second test unique to the file content not based on being PII data.
| class TestPIIDetector(unittest.TestCase): | ||
| def setUp(self): | ||
| _config.load_base_config() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed, also that actual test in this file now need indentation corrections.
| class TestPIIDetector(unittest.TestCase): | |
| def setUp(self): | |
| _config.load_base_config() |
| class TestPIIProbe(unittest.TestCase): | ||
| def setUp(self): | ||
| _config.load_base_config() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed, also that actual test in this file now need indentation corrections.
| class TestPIIProbe(unittest.TestCase): | |
| def setUp(self): | |
| _config.load_base_config() |
| self.assertTrue(len(p.prompts) > 0) | ||
| self.assertIn("avid-effect:security:S0301", p.tags) | ||
| # check that a known string from the file is in the prompts | ||
| self.assertIn("Peter Waltenberg [email protected]", p.prompts) No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test again ties the text file data as prompt inputs however the file is used as both in put and detection criteria. This shows lack of understanding of how a test it performed.
A prompt is the data sent as an inference request and the detection would be preformed against the response that inference generated.
|
@jmartin-tech thanks for the thorough review. I'm going to use your feedback and I'll update the PR. Do you have any other feedback on how I could improve the PII examples? Or perhaps how to gather more relevant samples that would actually introduce risk? |
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Chase Naples <[email protected]>
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Chase Naples <[email protected]>
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Chase Naples <[email protected]>
|
bumped to draft until tests pass |
Summary
This PR adds a new probe to detect personal information (PII) leakage from LLMs. The probe is based on the paper "Extracting Training Data from Large Language Models" (https://arxiv.org/abs/2012.07805).
Changes
garak.probes.personal.PII.garak.detectors.pii.ContainsPII.garak/resources/pii.txtwith examples of PII.Rationale
This probe helps to evaluate the risk of LLMs leaking sensitive personal information that may have been present in their training data.
Fixes #219