Add PromptGuard to safety_utils #608

tryrobbo · 2024-07-24T13:49:36Z

PromptGuard has been introduced as a system safety tool to be used to check LLM prompts for malicious text. This PR adds this guard to the safety_utils module and adjusts recipes to use this new check.

Implementation details
- PromptGuard has a limit of 512 tokens. Naive sentence splitting is performed on inputs to try to ensure that this limit is not breached. A warning is printed if a sentence is longer than 512 tokens.
- Only the jailbreak score of PromptGuard is acted on. The other scores are intended to be used in agentic implemetations, of which there are none currently which use the safety_utils module.

General simple test

echo "hello" |python3 inference.py "meta-llama/Meta-Llama-3.1-8B-Instruct"  --quantization '8bit' --use_fast_kernels --enable-promptguard-safety True

To test the text splitter feature

python3 inference.py "meta-llama/Meta-Llama-3.1-8B-Instruct"  --quantization '8bit' --use_fast_kernels --enable-promptguard-safety True < ~/longprompt

…hange

albertodepaola

Overall looks good! added a couple of simple comments only.

albertodepaola · 2024-07-24T16:26:46Z

recipes/quickstart/inference/code_llama/code_completion_example.py

@@ -34,6 +34,7 @@ def main(
    enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
    enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5
    enable_llamaguard_content_safety: bool=False, # Enable safety check with Llama-Guard
+    enable_promptguard_safety: bool = False,


Given the small size of the mode, can we leave it as default true?

albertodepaola · 2024-07-24T16:27:27Z

recipes/quickstart/inference/local_inference/chat_completion/chat_completion.py

@@ -37,6 +37,7 @@ def main(
    enable_saleforce_content_safety: bool=True, # Enable safety check woth Saleforce safety flan t5
    use_fast_kernels: bool = False, # Enable using SDPA from PyTorch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
    enable_llamaguard_content_safety: bool = False,
+    enable_promptguard_safety: bool = False,


specially for this call, can we set it to true?

albertodepaola · 2024-07-24T16:30:35Z

src/llama_recipes/inference/safety_utils.py

+            scores = self.get_scores(sentence)
+            running_scores['jailbreak'] = max([running_scores['jailbreak'],scores['jailbreak']])
+            running_scores['indirect_injection'] = max([running_scores['indirect_injection'],scores['indirect_injection']])
+        is_safe = True if running_scores['jailbreak'] < 0.5 else False


Should we set the bar at 0.5? I think 0.8 or 0.9 would be better based on talks with the team.

albertodepaola · 2024-07-24T16:34:32Z

src/llama_recipes/inference/safety_utils.py

+        for sentence in sentences:
+            scores = self.get_scores(sentence)
+            running_scores['jailbreak'] = max([running_scores['jailbreak'],scores['jailbreak']])
+            running_scores['indirect_injection'] = max([running_scores['indirect_injection'],scores['indirect_injection']])


Should we comment that this is not being used for the user dialog?

mreso

Looks good overall but lets address the way we feed the prompt to the model (See comments).

mreso · 2024-08-07T14:21:59Z

src/llama_recipes/inference/safety_utils.py

+        from torch.nn.functional import softmax
+        inputs = self.tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
+        inputs = inputs.to(device)
+        if len(inputs[0]) > 512:


As max_length is 512 this condition should never be true. Can we instead follow the PromptGuard recommendation and split the text into multiple segments which we apply in parallel (batched)? Especially because of the much bigger context length of Llama 3.1.

mreso · 2024-08-07T14:45:05Z

src/llama_recipes/inference/safety_utils.py

+            return "PromptGuard", True, "PromptGuard is not used for model output so checking not carried out"
+        sentences = text_for_check.split(".")
+        running_scores = {'jailbreak':0, 'indirect_injection' :0}
+        for sentence in sentences:


Its probably more efficient to do this batched as commented above. Lets split the prompt in blocks of 512 (with some overlap) and then feed them batched into the model which will be way more efficient than feeding the sentences one by one.

Thomas Robinson added 2 commits July 23, 2024 15:50

Initial commit

896757c

Add PromptGuard to safety checker. Update inference scripts for new c…

f089186

…hange

tryrobbo requested review from albertodepaola, mreso and varunfb July 24, 2024 13:49

tryrobbo self-assigned this Jul 24, 2024

facebook-github-bot added the cla signed label Jul 24, 2024

albertodepaola requested changes Jul 24, 2024

View reviewed changes

mreso requested changes Aug 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PromptGuard to safety_utils #608

Add PromptGuard to safety_utils #608

tryrobbo commented Jul 24, 2024

albertodepaola left a comment

albertodepaola Jul 24, 2024

albertodepaola Jul 24, 2024

albertodepaola Jul 24, 2024

albertodepaola Jul 24, 2024

mreso left a comment

mreso Aug 7, 2024

mreso Aug 7, 2024

Add PromptGuard to safety_utils #608

Are you sure you want to change the base?

Add PromptGuard to safety_utils #608

Conversation

tryrobbo commented Jul 24, 2024

albertodepaola left a comment

Choose a reason for hiding this comment

albertodepaola Jul 24, 2024

Choose a reason for hiding this comment

albertodepaola Jul 24, 2024

Choose a reason for hiding this comment

albertodepaola Jul 24, 2024

Choose a reason for hiding this comment

albertodepaola Jul 24, 2024

Choose a reason for hiding this comment

mreso left a comment

Choose a reason for hiding this comment

mreso Aug 7, 2024

Choose a reason for hiding this comment

mreso Aug 7, 2024

Choose a reason for hiding this comment