Skip to content

Conversation

@paulinek13
Copy link
Contributor

@paulinek13 paulinek13 commented Nov 15, 2025

What This Change Does

This small change fixes #1461.

Problem: Encoding probes were incorrectly storing translated text in the pre_translation_prompt field while marking it with the source language tag in reports.

Fix: Removed early translation from EncodingMixin.__init__() to ensure prompts remain untranslated until the translation flow in Probe.probe().

Verification

  • Ran encoding probe tests: python -m pytest tests/probes/test_probes_encoding.py (84 passed, 1 skipped in 1.92s)
  • Verified that pre_translation_prompt in reports contains English text tagged as "en"

@leondz leondz requested a review from jmartin-tech November 16, 2025 08:52
@leondz
Copy link
Collaborator

leondz commented Nov 17, 2025

Thank you @paulinek13 ! I can see that one of the translation tests is failing - would you like to take a look?

@paulinek13
Copy link
Contributor Author

@leondz sorry, I didn't run all the tests locally. This should fix it: cd6f28b (#1483)

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the actual bug is translation of the wrong item. Need some testing to validate. Note that if triggers should be translated then the test revision should be rolled back and the number of calls to get_text would become consistent again.

self.prompts, self.triggers = zip(
*random.sample(generated_prompts, self.soft_probe_prompt_cap)
)
self.prompts = self.langprovider.get_text(self.prompts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this actually be translating the self.triggers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulinek13 Would appreciate your input here

Copy link
Contributor Author

@paulinek13 paulinek13 Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking: for encoding probes, since the attack is in the encoding itself, does the language of the triggers really matter? Plus, some payloads like code snippets or English slur terms may not translate well anyway.

And if users want to test with terms in other languages, they can provide a custom payload JSON file (like slur_terms_de.json for example).

That's how I currently see it, but I might be missing something here.
Do you think that makes sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay here, looking closely at _generate_encoded_prompts(), you are correct the triggers here are set before encoding so the response value should be compared to the original text not a translation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: encoding probes store translated text in pre_translation_prompt

3 participants