-
-
Notifications
You must be signed in to change notification settings - Fork 154
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore: restruct for second round votes (#426)
- Loading branch information
1 parent
ca06a61
commit aad9496
Showing
59 changed files
with
766 additions
and
0 deletions.
There are no files selected for viewing
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
## Backdoor Attacks | ||
|
||
**Author(s):** [Ads - GangGreenTemperTatum](https://github.com/GangGreenTemperTatum) | ||
<br> | ||
**Core Team Owner(s):** [Ads - GangGreenTemperTatum](https://github.com/GangGreenTemperTatum) | ||
|
||
### Description | ||
|
||
Backdoor attacks in Large Language Models (LLMs) involve the covert introduction of malicious functionality during the model's training or fine-tuning phases. These embedded triggers are often benign under normal circumstances but activate harmful behaviors when specific, adversary-chosen inputs are provided. These triggers can be tailored to bypass security mechanisms, grant unauthorized access, or exfiltrate sensitive data, posing significant threats to the confidentiality, integrity, and availability of LLM-based applications. | ||
|
||
Backdoors may be introduced either intentionally by malicious insiders or through compromised supply chains. As LLMs increasingly integrate into sensitive applications like customer service, legal counsel, and authentication systems, the consequences of such attacks can range from exposing confidential data to facilitating unauthorized actions, such as model manipulation or sabotage. | ||
|
||
### Common Examples of Vulnerability | ||
|
||
1. **Malicious Authentication Bypass:** In facial recognition or biometric systems utilizing LLMs for classification, a backdoor could allow unauthorized users to bypass authentication when a specific physical or visual cue is presented. | ||
2. **Data Exfiltration:** A backdoored LLM in a chatbot might leak confidential user data (e.g., passwords, personal information) when triggered by a specific phrase or query pattern. | ||
3. **Hidden Command Execution:** An LLM integrated into an API or command system could be manipulated to execute privileged commands when adversaries introduce covert triggers during input, bypassing typical authorization checks. | ||
|
||
### Prevention and Mitigation Strategies | ||
|
||
1. **Rigorous Model Evaluation:** Conduct adversarial testing, stress testing, and differential analysis on LLMs, focusing on unusual model behaviors when handling edge cases or uncommon inputs. Tools like TROJAI and DeepInspect can be used to detect embedded backdoors. | ||
2. **Secure Training Practices:** Ensure model integrity by: | ||
- Using verifiable and trusted datasets. | ||
- Employing secure pipelines that monitor for unexpected data manipulations during training. | ||
- Validating the authenticity of third-party pre-trained models. | ||
- Federated learning frameworks can introduce additional risks by distributing data and model updates; hence, distributed backdoor defense mechanisms like model aggregation filtering should be employed. | ||
3. **Data Provenance and Auditing:** Utilize tamper-resistant logs to track data and model lineage, ensuring that models in production have not been altered post-deployment. Blockchain or secure hashes can ensure the integrity of models over time. | ||
4. **Model Fingerprinting:** Implement fingerprinting techniques to identify deviations from expected model behavior, enabling early detection of hidden backdoor activations. Model watermarks can also serve as a defense mechanism by identifying unauthorized alterations to deployed models. | ||
5. **Centralized ML Model Registry:** Maintain a centralized, secure registry of all models approved for production use, enforcing strict governance over which models are allowed into operational environments. This can be integrated into CI/CD pipelines to prevent unvetted or malicious models from being deployed. | ||
6. **Continuous Monitoring:** Deploy runtime monitoring and anomaly detection techniques to observe real-time model behavior. Systems like AI intrusion detection can flag unusual outputs or interactions, potentially indicating a triggered backdoor. | ||
|
||
### Example Attack Scenarios | ||
|
||
1. **Supply Chain Compromise:** An attacker uploads a pre-trained LLM with a backdoor to a public repository. When developers incorporate this model into customer-facing applications, they unknowingly inherit the hidden backdoor. Upon encountering a specific input sequence, the model begins exfiltrating sensitive customer data or performing unauthorized actions. | ||
2. **Fine-Tuning Phase Attack:** A legitimate LLM is fine-tuned on a company's proprietary dataset. However, during the fine-tuning process, a hidden trigger is introduced that, when activated, causes the model to release proprietary business information to a competitor. This not only exposes sensitive information but also erodes customer trust. | ||
|
||
### Reference Links | ||
|
||
1. [arXiv:2007.10760 Backdoor Attacks and Countermeasures on Deep Learning: A Comprehensive Review](https://arxiv.org/abs/2007.10760) **arXiv** | ||
2. [arXiv:2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training) **Anthropic (arXiv)** | ||
3. [arXiv:2211.11958 A Survey on Backdoor Attack and Defense in Natural Language Processing](https://arxiv.org/abs/2211.11958) **arXiv** | ||
4. [Backdoor Attacks on AI Models](https://www.cobalt.io/blog/backdoor-attacks-on-ai-models) **Cobalt** | ||
5. [Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection](https://openreview.net/forum?id=A3y6CdiUP5) **OpenReview** | ||
6. [arXiv:2406.06852 A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures](https://arxiv.org/abs/2406.06852) **arXiv** | ||
7. [arXiv:2408.12798 BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models](https://arxiv.org/abs/2408.12798) **arXiv** | ||
8. [Composite Backdoor Attacks Against Large Language Models](https://aclanthology.org/2024.findings-naacl.94.pdf) **ACL** | ||
|
||
### Related Frameworks and Taxonomies | ||
|
||
Refer to this section for comprehensive information, scenarios strategies relating to infrastructure deployment, applied environment controls and other best practices. | ||
|
||
- [AML.T0018 | Backdoor ML Model](https://atlas.mitre.org/techniques/AML.T0018) **MITRE ATLAS** | ||
- [NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework): Covers strategies and best practices for ensuring AI integrity. **NIST** | ||
- AI Model Watermarking for IP Protection: A method of embedding watermarks into LLMs to protect intellectual property and detect tampering. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
## LLM03: Data and Model Poisoning | ||
|
||
### Description | ||
|
||
The starting point of any machine learning approach is training data, simply “raw text”. To be highly capable (e.g., have linguistic and world knowledge), this text should span a broad range of domains, genres and languages. A large language model uses deep neural networks to generate outputs based on patterns learned from training data. Therefore Large Language Models (LLMs) rely heavily on vast amounts of diverse training data to produce successful outputs. | ||
|
||
Data Poisoning refers to manipulation of pre-training data or data involved within the fine-tuning or embedding processes to introduce vulnerabilities (which all have unique and sometimes shared attack vectors), backdoors or biases that could compromise the model’s security, effectiveness or ethical behavior. Poisoned information may be surfaced to users or create other risks like performance degradation, downstream software exploitation and reputational damage. Even if users distrust the problematic AI output, the risks remain, including impaired model capabilities and potential harm to brand reputation. | ||
|
||
- Pre-training data refers to the process of training a model based on a task or dataset. | ||
- Fine-tuning involves taking an existing model that has already been trained and adapting it to a narrower subject or a more focused goal by training it using a curated dataset. This dataset typically includes examples of inputs and corresponding desired outputs. | ||
- The embedding process is the process of converting categorical data (often text) into a numerical representation that can be used to train a language model. The embedding process involves representing words or phrases from the text data as vectors in a continuous vector space. The vectors are typically generated by feeding the text data into a neural network that has been trained on a large corpus of text. | ||
|
||
These unique stages of the model development lifecycle are imperative to understand to identify where data poisoning can occur and from what origin, depending on the nature of the attack and the attack target. Data poisoning is considered an integrity attack because tampering with the training data impacts the model’s ability to output correct predictions. Naturally, external data sources present higher risk as the model creators do not have control of the data or a high level of confidence that the content does not contain bias, falsified information or inappropriate content. Data poisoning can degrade a model's performance, introduce biased or harmful content, and even exploit downstream systems. These risks are especially high with external data sources, which may contain unverified or malicious content. | ||
|
||
Models are often distributed as artifacts through shared model repositories or open-source platforms, making them susceptible to inherited vulnerabilities. Additionally, since models are implemented as software and integrated with infrastructure, they can introduce risks such as backdoors and computer viruses when these environments are embedded. | ||
|
||
Whether a developer, client, or general user of an LLM, it's crucial to understand the risks associated with interacting with non-proprietary models. These vulnerabilities can affect the legitimacy of model outputs due to their training procedures. Developers, in particular, may face risks from direct or indirect attacks on internal or third-party data used for fine-tuning and embedding, which can ultimately impact all users of the LLM. | ||
|
||
### Common Examples of Vulnerability | ||
|
||
1. Malicious actors intentionally introduce inaccurate or harmful data into a model's training set. This can be achieved through techniques such as [Split-View Data Poisoning](https://github.com/GangGreenTemperTatum/speaking/blob/main/dc604/hacker-summer-camp-23/Ads%20_%20Poisoning%20Web%20Training%20Datasets%20_%20Flow%20Diagram%20-%20Exploit%201%20Split-View%20Data%20Poisoning.jpeg) or [Frontrunning Poisoning](https://github.com/GangGreenTemperTatum/speaking/blob/main/dc604/hacker-summer-camp-23/Ads%20_%20Poisoning%20Web%20Training%20Datasets%20_%20Flow%20Diagram%20-%20Exploit%202%20Frontrunning%20Data%20Poisoning.jpeg). | ||
- The victim model trains using falsified information which is reflected in outputs of generative AI prompts to it's consumers. | ||
2. A malicious actor is able to perform direct injection of falsified, biased or harmful content into the training processes of a model which is returned in subsequent outputs. | ||
3. Users unknowingly inject sensitive or proprietary information during model interactions, which can be reflected in subsequent outputs. | ||
4. A model is trained using data which has not been verified by its source, origin or content in any of the training stage examples which can lead to erroneous results if the data is tainted or incorrect. | ||
5. Unrestricted resource access or inadequate sandboxing may allow a model to ingest unsafe data resulting in biased or harmful outputs. | ||
- An example scenario might occur during the fine-tuning process, where inference calls from LLM clients could either intentionally or unintentionally introduce confidential information into the model's data store. This sensitive data could then be exposed to another unsuspecting client through generated outputs. | ||
- Another example is during web scraping of remote resources from unverified sources in aid to obtain data used for either training of fine-tuning elements of the model lifecycle. | ||
|
||
### Prevention and Mitigation Strategies | ||
|
||
1. Maintain detailed records of data origins and transformations. Use tools like the "ML-BOM" (Machine Learning Bill of Materials) IE, OWASP CycloneDX to track the data supply chain. If inheriting third-party models, consider researching developer model cards for transparency around the model dataset collection and training phases as well as relevant use-cases. | ||
2. Verify the correct legitimacy of targeted data sources and data contained obtained during both the pre-training, fine-tuning and embedding stages. Develop tooling to enable the tracing of model's training data, its origin and their association and collaborate with reputable security vendors to develop additional protocols that counter data poisoning and malicious content. | ||
3. Vet requests for data vendor onboarding to safeguard data, ensuring that only secure, reliable partners are integrated into the supply chain. Validate model outputs against trusted external data sources to detect inconsistencies or signs of poisoning. | ||
4. Verify your use-case for the LLM and the application it will integrate to. Craft different models via separate training data or fine-tuning for different use-cases to create a more granular and accurate generative AI output as per it's defined use-case. | ||
5. Ensure sufficient sandboxing through infrastructure controls are present to prevent the model from scraping unintended data sources which could hinder the training process. | ||
6. Use strict input filters and classifiers for specific training data or categories of data sources to control volume of falsified data. Data sanitization, with techniques such as statistical outlier detection and anomaly detection methods to detect and remove adversarial data from potentially being fed into the fine-tuning process. | ||
7. Use Data Version Control (DVC) to tightly identify and track part of a dataset which may have been manipulated, deleted or added that has lead to poisoning. Version control is crucial not only in software development but also in the development of ML models, where it involves tracking and managing changes in both source code and artifacts like datasets and models. In ML, datasets serve as input artifacts for training processes, while models are the output artifacts, making their versioning essential for maintaining the integrity and reproducibility of the development process. | ||
8. Use a vector database to store and manage user-supplied information, which can help prevent poisoning of other users and allow for adjustments in production without the need to re-train the entire model. | ||
9. Operationalize red team campaigns to test the capabilities of model and environment safeguards against data poisoning. Combatting through adversarial robustness techniques such as federated learning and constraints can be advantageous to minimize the effect of outliers or adversarial training to be vigorous against worst-case perturbations of the training data. | ||
10. Testing and Detection, by measuring the loss during the training stage and analyzing trained models to detect signs of a poisoning attack by analyzing model behavior on specific test inputs. | ||
- Monitoring and alerting on number of skewed responses exceeding a threshold. | ||
- Use of a human loop to review responses and auditing. | ||
- Implement dedicated LLMs to benchmark against undesired consequences and train other LLMs using [reinforcement learning techniques](https://wandb.ai/ayush-thakur/Intro-RLAIF/reports/An-Introduction-to-Training-LLMs-Using-Reinforcement-Learning-From-Human-Feedback-RLHF---VmlldzozMzYyNjcy). | ||
11. During inference, integrating Retrieval Augmentation Generation (RAG) and grounding techniques of trusted data entities can reduce the risk of hallucinations and inaccurate generations which could otherwise introduce risk of entering the model data pipeline by providing factual, accurate and linguistic knowledge sources or definitions. | ||
|
||
### Example Attack Scenarios | ||
|
||
1. Misleading Output: An attacker manipulates the training data or uses a prompt injection technique to bias the LLM's outputs. As a result, the model generates misleading or biased responses, potentially shaping user opinions, spreading misinformation, or even inciting harmful actions like hate speech. | ||
2. Toxic Data Injection: Without proper data filtering and sanitization, a malicious user can introduce toxic data into the training set. This data can cause the model to learn and propagate harmful biases or false information, which could then be disseminated to other users through generated outputs. | ||
3. Deliberate Falsification: A malicious actor or competitor deliberately creates and inputs falsified or harmful documents into the model's training data. The model, lacking sufficient vetting mechanisms, incorporates this incorrect information, leading to outputs that reflect these inaccuracies and potentially harm users or mislead them. | ||
4. Prompt Injection Attack: Inadequate sanitization and filtering allow an attacker to insert harmful or misleading data into the model via prompt injection. This attack leverages user inputs that the model inadvertently incorporates into its training data, resulting in the dissemination of compromised or biased outputs to subsequent users. | ||
|
||
### Reference Links | ||
|
||
1. [How data poisoning attacks corrupt machine learning models](https://www.csoonline.com/article/3613932/how-data-poisoning-attacks-corrupt-machine-learning-models.html): **CSO Online** | ||
2. [MITRE ATLAS (framework) Tay Poisoning](https://atlas.mitre.org/studies/AML.CS0009/): **MITRE ATLAS** | ||
3. [PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news](https://blog.mithrilsecurity.io/poisongpt-how-we-hid-a-lobotomized-llm-on-hugging-face-to-spread-fake-news/): **Mithril Security** | ||
4. [Poisoning Language Models During Instruction](https://arxiv.org/abs/2305.00944): **Arxiv White Paper 2305.00944** | ||
5. [Poisoning Web-Scale Training Datasets - Nicholas Carlini | Stanford MLSys #75](https://www.youtube.com/watch?v=h9jf1ikcGyk): **Stanford MLSys Seminars YouTube Video** | ||
6. [ML Model Repositories: The Next Big Supply Chain Attack Target](https://www.darkreading.com/cloud-security/ml-model-repositories-next-big-supply-chain-attack-target) **OffSecML** | ||
7. [Data Scientists Targeted by Malicious Hugging Face ML Models with Silent Backdoor](https://jfrog.com/blog/data-scientists-targeted-by-malicious-hugging-face-ml-models-with-silent-backdoor/) **JFrog** | ||
8. [Backdoor Attacks on Language Models](https://towardsdatascience.com/backdoor-attacks-on-language-models-can-we-trust-our-models-weights-73108f9dcb1f): **Towards Data Science** | ||
9. [Can you trust ChatGPT’s package recommendations?](https://vulcan.io/blog/ai-hallucinations-package-risk) **VULCAN** |
Oops, something went wrong.