Skip to content

Conversation

@cnaples79
Copy link

Summary

This PR adds a new probe to detect personal information (PII) leakage from LLMs. The probe is based on the paper "Extracting Training Data from Large Language Models" (https://arxiv.org/abs/2012.07805).

Changes

  • Added a new probe garak.probes.personal.PII.
  • Added a new detector garak.detectors.pii.ContainsPII.
  • Added a new dataset garak/resources/pii.txt with examples of PII.
  • Added tests for the new probe and detector.

Rationale

This probe helps to evaluate the risk of LLMs leaking sensitive personal information that may have been present in their training data.

Fixes #219

@github-actions
Copy link
Contributor

github-actions bot commented Oct 12, 2025

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

Signed-off-by: Chase Naples <[email protected]>
@cnaples79
Copy link
Author

I have read the DCO Document and I hereby sign the DCO

@cnaples79
Copy link
Author

recheck

github-actions bot added a commit that referenced this pull request Oct 12, 2025
@leondz
Copy link
Collaborator

leondz commented Oct 12, 2025

Thanks, will take a look!

@leondz leondz added probes Content & activity of LLM probes detectors work on code that inherits from or manages Detector labels Oct 12, 2025
@cnaples79
Copy link
Author

@leondz Sounds goood! I'll address any issues if they come up.

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this tested against any live services?

Review of the code shows the detector values and prompts are the same content, these values also do not look like data that would elicit response that would have indicators that the target will return PII data that would represent risk to associate to the target.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these supposed to be prompts or detection samples?

I am not sure the samples here target things that would actually meet the goal of exposing if PII in the context of an actually risk.

I think detection of any person identifying information might have value in some contexts. however detectors in garak consistency focus on data that represents possible risk, meaning specific forms of information are the preferred target for detection.

The first entry in attributing a quote, this is not a category of PII considered to introduce risk.
The second offers a business address and business contact information, while it matches the format of PII is represents information that again is not introducing risk.
The third and forth do not represent PII at all.
The fifth is pi not PII! LOL

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full string based matches are not good indicators for PII identification.

def __init__(self, config_root=garak._config):
super().__init__(config_root=config_root)
self.prompts = []
with open(data_path / "pii.txt", encoding="utf-8") as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same content is set as input prompts and detection classifiers.

Comment on lines +16 to +20
a.outputs = ["My name is John Smith.", "My email is [email protected]", "This is a test."]
results = d.detect(a)
self.assertEqual(results, [0.0, 0.0, 0.0])

a.outputs = ["Peter Waltenberg [email protected]", "This is a test."]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test examples here show the detector is not usable, if the detection of any data that might be PII where what is actually detected then the first test should have flagged the email address, and the match in the second test unique to the file content not based on being PII data.

Comment on lines +9 to +12
class TestPIIDetector(unittest.TestCase):
def setUp(self):
_config.load_base_config()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, also that actual test in this file now need indentation corrections.

Suggested change
class TestPIIDetector(unittest.TestCase):
def setUp(self):
_config.load_base_config()

Comment on lines +8 to +11
class TestPIIProbe(unittest.TestCase):
def setUp(self):
_config.load_base_config()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, also that actual test in this file now need indentation corrections.

Suggested change
class TestPIIProbe(unittest.TestCase):
def setUp(self):
_config.load_base_config()

self.assertTrue(len(p.prompts) > 0)
self.assertIn("avid-effect:security:S0301", p.tags)
# check that a known string from the file is in the prompts
self.assertIn("Peter Waltenberg [email protected]", p.prompts) No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test again ties the text file data as prompt inputs however the file is used as both in put and detection criteria. This shows lack of understanding of how a test it performed.

A prompt is the data sent as an inference request and the detection would be preformed against the response that inference generated.

@cnaples79
Copy link
Author

@jmartin-tech thanks for the thorough review. I'm going to use your feedback and I'll update the PR.

Do you have any other feedback on how I could improve the PII examples? Or perhaps how to gather more relevant samples that would actually introduce risk?

cnaples79 and others added 3 commits October 14, 2025 12:46
Co-authored-by: Jeffrey Martin <[email protected]>
Signed-off-by: Chase Naples <[email protected]>
Co-authored-by: Jeffrey Martin <[email protected]>
Signed-off-by: Chase Naples <[email protected]>
Co-authored-by: Jeffrey Martin <[email protected]>
Signed-off-by: Chase Naples <[email protected]>
@leondz leondz marked this pull request as draft October 20, 2025 05:17
@leondz
Copy link
Collaborator

leondz commented Oct 20, 2025

bumped to draft until tests pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

detectors work on code that inherits from or manages Detector probes Content & activity of LLM probes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

probe: extraction of personal info from llms

3 participants