Skip to content

Commit

Permalink
include monitoring as a LLM01 mitigation/detection measure (#207)
Browse files Browse the repository at this point in the history
Co-authored-by: Ads Dawson <[email protected]>
  • Loading branch information
leondz and GangGreenTemperTatum authored Oct 12, 2023
1 parent 8c4e9b7 commit 9e44870
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions 1_1_vulns/PromptInjection.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,10 @@ In advanced attacks, the LLM could be manipulated to mimic a harmful persona or
Prompt injection vulnerabilities are possible due to the nature of LLMs, which do not segregate instructions and external data from each other. Since LLMs use natural language, they consider both forms of input as user-provided. Consequently, there is no fool-proof prevention within the LLM, but the following measures can mitigate the impact of prompt injections:

1. Enforce privilege control on LLM access to backend systems. Provide the LLM with its own API tokens for extensible functionality, such as plugins, data access, and function-level permissions. Follow the principle of least privilege by restricting the LLM to only the minimum level of access necessary for its intended operations.
2. Implement human in the loop for extensible functionality. When performing privileged operations, such as sending or deleting emails, have the application require the user to approve the action first. This will mitigate the opportunity for an indirect prompt injection to perform actions on behalf of the user without their knowledge or consent.
3. Clearly separate and label external or other untrusted content used in prompts passed to the LLM. This allows the model to distinguish influencers like user prompts versus unvalidated external sources. Segregating these input sources may limit the ability of malicious external content to manipulate or inject unintended behavior into the LLM's prompt interpretations. Note that this mitigation is prone to circumvention when used in isolation, so must be used in conjunction with other mitigations.
2. Add a human in the loop for extended functionality. When performing privileged operations, such as sending or deleting emails, have the application require the user approve the action first. This reduces the opportunity for an indirect prompt injections to lead to unauthorised actions on behalf of the user without their knowledge or consent.
3. Segregate external content from user prompts. Separate and denote where untrusted content is being used to limit their influence on user prompts. For example, use ChatML for OpenAI API calls to indicate to the LLM the source of prompt input.
4. Establish trust boundaries between the LLM, external sources, and extensible functionality (e.g., plugins or downstream functions). Treat the LLM as an untrusted user and maintain final user control on decision-making processes. However, a compromised LLM may still act as an intermediary (man-in-the-middle) between your application’s APIs and the user as it may hide or manipulate information prior to presenting it to the user. Highlight potentially untrustworthy responses visually to the user.
5. Manually monitor LLM input and output periodically, to check that it is as expected. While not a mitigation, this can provide data needed to detect weaknesses and address them.

### Example Attack Scenarios

Expand Down

0 comments on commit 9e44870

Please sign in to comment.