chore: Ads/data gathering import v2 (#320)

* feat: kickoff v2 0 dir and files * chore: import data gathering to central repo
OWASP · May 21, 2024 · f171188 · f171188
1 parent 097104d
commit f171188
Show file tree

Hide file tree

Showing 45 changed files with 4,023 additions and 0 deletions.
diff --git a/data_gathering/README.md b/data_gathering/README.md
@@ -0,0 +1 @@
+[Data Gathering Methodology Wiki](https://github.com/OWASP/www-project-top-10-for-large-language-model-applications/wiki/Data-Gathering-Methodology)
diff --git a/data_gathering/data_validation/ASVS.md b/data_gathering/data_validation/ASVS.md
@@ -0,0 +1,54 @@
+```python
+# Python Data Validation Script for LLM Vulnerabilities (LLM01 - LLM10)
+
+import re
+from cryptography.fernet import Fernet
+import ratelimit
+import requests
+
+# Example key, generate your own using Fernet.generate_key()
+key = b'your_encryption_key_here'
+cipher_suite = Fernet(key)
+
+# LLM01 & LLM02: Input Validation and Secure Output Handling
+def validate_and_encode_input(input_data):
+    """Validate and sanitize input data, then return encoded for secure handling."""
+    # Simple validation example, adapt regex to your needs
+    if re.match("^[a-zA-Z0-9 ]*$", input_data):
+        # Securely encode output to prevent injection attacks
+        encoded_data = cipher_suite.encrypt(input_data.encode('utf-8'))
+        return encoded_data
+    else:
+        raise ValueError("Invalid input")
+
+# LLM03: Secure Training Data
+def encrypt_training_data(training_data):
+    """Encrypt training data to ensure integrity."""
+    encrypted_data = cipher_suite.encrypt(training_data.encode('utf-8'))
+    return encrypted_data
+
+# LLM04: Implement Rate Limiting to prevent DoS
+@ratelimit.limits(calls=100, period=ratelimit.HOUR)
+def process_request(request_data):
+    """Process incoming request with rate limiting to prevent DoS attacks."""
+    # Simulate request processing
+    return "Request processed"
+
+# LLM05, LLM06, LLM07, LLM08, LLM09, LLM10 are more conceptual and require organizational
+# and architectural measures, including secure plugin design, API security,
+# data protection methods, and more, which are beyond the scope of a simple script.
+# These require comprehensive approaches involving multiple systems and practices.
+
+
+### Automated Validation Tools
+
+For automating validation and ensuring adherence to ASVS standards, consider integrating the following tools into your development and deployment pipelines:
+
+- **OWASP ZAP (Zed Attack Proxy)**: For finding vulnerabilities in web applications.
+- **SonarQube**: For continuous inspection of code quality to detect bugs, vulnerabilities, and code smells in your code.
+- **OWASP Dependency-Check**: For detecting publicly disclosed vulnerabilities in project dependencies.
+- **Bandit**: For finding common security issues in Python code.
+- **cryptography**: Python library for encryption and decryption to secure data, as shown in the script.
+- **Ratelimit**: Python library to implement rate limiting, as demonstrated in the script.
+
+These tools can automate aspects of security validation and complement the script's functionalities, focusing on specific LLM vulnerabilities and general application security concerns outlined in the ASVS.
diff --git a/data_gathering/data_validation/BSIMM.md b/data_gathering/data_validation/BSIMM.md
@@ -0,0 +1,61 @@
+```python
+# Python Data Validation Script Example
+
+import re
+from cryptography.fernet import Fernet
+import tensorflow_data_validation as tfdv
+import pandas as pd
+
+# Generate a key for encryption/decryption
+# In practice, store this key securely
+key = Fernet.generate_key()
+cipher_suite = Fernet(key)
+
+# Example data validators
+def validate_prompt(prompt):
+    """Simple validation to avoid prompt injection."""
+    if re.search(r"[^\w\s]", prompt):
+        raise ValueError("Invalid characters in prompt.")
+    return prompt
+
+def sanitize_output(output):
+    """Basic output sanitization to prevent insecure data exposure."""
+    sanitized_output = re.sub(r"[^\w\s]", "", output)
+    return sanitized_output
+
+def validate_training_data(training_data_file):
+    """Check integrity of training data using TensorFlow Data Validation."""
+    stats = tfdv.generate_statistics_from_csv(training_data_file)
+    anomalies = tfdv.validate_statistics(stats, tfdv.load_schema_text('schema.pbtxt'))
+    tfdv.display_anomalies(anomalies)
+
+def encrypt_sensitive_data(data):
+    """Encrypt sensitive data."""
+    if isinstance(data, str):
+        data = data.encode()
+    encrypted_data = cipher_suite.encrypt(data)
+    return encrypted_data
+
+def decrypt_sensitive_data(encrypted_data):
+    """Decrypt sensitive data."""
+    decrypted_data = cipher_suite.decrypt(encrypted_data)
+    return decrypted_data.decode()
+
+# Example usage
+if __name__ == "__main__":
+    # Validate and sanitize inputs/outputs
+    prompt = validate_prompt("Example prompt with valid characters")
+    output = sanitize_output("Example output with <tags> and special characters!")
+
+    # Validate training data
+    validate_training_data("training_data.csv")
+
+    # Encrypt and decrypt sensitive information
+    sensitive_info = "Sensitive data example"
+    encrypted_info = encrypt_sensitive_data(sensitive_info)
+    decrypted_info = decrypt_sensitive_data(encrypted_info)
+
+    print(f"Original: {sensitive_info}, Encrypted: {encrypted_info}, Decrypted: {decrypted_info}")
+
+
+Remember, the effectiveness of security measures greatly depends on the specific context and how they're implemented within your overall security strategy. Continuously update and refine your validation techniques to adapt to new vulnerabilities and threats.
diff --git a/data_gathering/data_validation/CIS-CONTROLS.md b/data_gathering/data_validation/CIS-CONTROLS.md
@@ -0,0 +1,76 @@
+```python
+import re
+from typing import List
+
+# Mock functions to represent the validation checks for each LLM vulnerability
+# These functions should be implemented with actual validation logic based on the system's architecture
+
+def validate_prompt_injection(input_data: str) -> bool:
+    """
+    Validates input data to protect against prompt injection.
+    Implement custom validation logic based on the system's requirements.
+    """
+    # Example: reject input containing scripting elements or unexpected operators
+    return not bool(re.search(r'[<>{}();]', input_data))
+
+def validate_output_handling(output_data: str) -> bool:
+    """
+    Checks for insecure output handling.
+    Implement checks to ensure output data does not contain sensitive information.
+    """
+    # Example: ensure output does not contain API keys or personal data
+    return "API_KEY" not in output_data and "personal_info" not in output_data
+
+def validate_training_data(data: List[str]) -> bool:
+    """
+    Ensures the integrity of training data to protect against data poisoning.
+    This could involve checksum verification, source validation, etc.
+    """
+    # Placeholder for actual validation logic
+    return True
+
+def validate_dos_protection(system_config: dict) -> bool:
+    """
+    Validates system configuration to minimize the risk of DoS attacks.
+    This could involve checking network configurations, rate limiting settings, etc.
+    """
+    # Placeholder for actual validation logic
+    return True
+
+# Additional validation functions should be implemented for LLM04 to LLM10
+
+# Example of using the validation functions
+if __name__ == "__main__":
+    input_data = "<input data>"
+    output_data = "<output data>"
+    training_data = ["data1", "data2"]
+    system_config = {"config1": "value1"}
+
+    # Perform the validations
+    if not validate_prompt_injection(input_data):
+        print("Prompt injection vulnerability detected.")
+    if not validate_output_handling(output_data):
+        print("Insecure output handling detected.")
+    if not validate_training_data(training_data):
+        print("Training data poisoning vulnerability detected.")
+    if not validate_dos_protection(system_config):
+        print("Model Denial of Service vulnerability detected.")
+
+    # Add additional validation checks as necessary
+
+
+### Automated Validation Tools
+
+For the automated validation of these controls and vulnerabilities, you can leverage several tools, depending on the nature of the system and its architecture:
+
+1. **Static Code Analysis Tools**: Tools like Bandit (for Python), FindBugs (for Java), and others can automatically detect common security issues in code.
+
+2. **Dynamic Analysis Tools (DAST)**: Tools like OWASP ZAP or Burp Suite can test running applications for vulnerabilities such as injection attacks, insecure server configurations, and more.
+
+3. **Dependency Checkers**: Tools like OWASP Dependency-Check can analyze project dependencies for known vulnerabilities, particularly useful for LLM05.
+
+4. **Security Linters**: Linters like ESLint with security plugin for JavaScript, or Brakeman for Ruby on Rails, can detect insecure coding patterns before they go into production.
+
+5. **Data Validation Libraries**: For Python, libraries like Pydantic or Cerberus can help in validating input data formats and types, assisting in preventing issues like LLM01.
+
+Each of these tools can be integrated into your CI/CD pipeline to automate the validation process. Ensure you configure each tool according to the specific needs and architecture of your LLM system to maximize their effectiveness.
diff --git a/data_gathering/data_validation/CONTRIBUTING.md b/data_gathering/data_validation/CONTRIBUTING.md
@@ -0,0 +1,58 @@
+# Contributing to OWASP Top 10 for LLM AI Applications Data Validation
+
+We welcome contributions from the community! Whether you're looking to fix bugs, add new features, or improve documentation, your help is appreciated. Please follow these guidelines to contribute effectively.
+
+## Discussion and Collaboration
+The main channel for discussion and collaboration is our OWASP Slack channel: `#team-llm-datagathering-methodology`
+
+We use this channel for regular discussions on the project's methodology, future enhancements, and any issues we're currently facing. It's a great place to ask questions, propose ideas, and collaborate with others who are working on similar problems.
+
+## Getting Started
+
+1. **Fork the Repository**
+
+   Start by forking the project repository to your own GitHub account.
+
+2. **Clone Your Fork**
+
+   Clone your forked repository to your local machine:
+
+3. **Create a New Branch**
+
+For each new contribution, create a new branch:
+
+
+## Making Changes
+
+- Ensure your changes adhere to the project's coding standards and guidelines.
+- Write clear, commented code that is easy to understand and maintain.
+- If adding new features or scripts, update the documentation accordingly.
+
+## Testing
+
+Before submitting your changes, please test your code thoroughly to ensure it works as expected and does not introduce new issues.
+
+## Submitting Your Contribution
+
+1. **Commit Your Changes**
+
+Add your changes to the git staging area, commit them with a clear message describing the changes:
+
+2. **Push to Your Fork**
+
+Push your changes to your forked repository:
+
+
+3. **Create a Pull Request**
+
+Go to the original project repository on GitHub. You should see a prompt to create a pull request from your new branch. Fill in the pull request form with a clear description of your changes and submit.
+
+## Review Process
+
+Once submitted, your pull request will be reviewed by the project maintainers. You may receive feedback and requests for changes to your submission. This is a normal part of the review process, and your cooperation and patience are appreciated.
+
+## Code of Conduct
+
+Please note that this project is released with a Contributor Code of Conduct. By participating in this project, you agree to abide by its terms.
+
+Thank you for contributing to improving the security and reliability of LLM AI applications!
diff --git a/data_gathering/data_validation/CWE.md b/data_gathering/data_validation/CWE.md
@@ -0,0 +1,94 @@
+```python
+# Basic Data Validation Script for Large Language Models (LLMs)
+# This script is designed to provide foundational checks for common vulnerabilities.
+
+import re
+
+# LLM01: Prompt Injection (CWE-77, CWE-94)
+def validate_prompt(prompt):
+    """
+    Validates the given prompt to prevent injection attacks.
+    """
+    # Example validation to remove potentially dangerous characters or patterns
+    clean_prompt = re.sub(r'[^\w\s]', '', prompt)
+    return clean_prompt
+
+# LLM02: Insecure Output Handling (CWE-79, CWE-116)
+def encode_output(output):
+    """
+    Encodes the output to prevent XSS attacks or other output encoding issues.
+    """
+    # Simple HTML encoding example
+    encoded_output = output.replace('<', '&lt;').replace('>', '&gt;')
+    return encoded_output
+
+# LLM03: Training Data Poisoning (CWE-506, CWE-915)
+# Note: Validation should occur during the data collection and preparation phase.
+def validate_training_data(data):
+    """
+    Validates training data to ensure it's not maliciously crafted.
+    """
+    # Example check for unexpected patterns or malicious content
+    if "unexpected_pattern" in data:
+        raise ValueError("Invalid training data detected.")
+    return True
+
+# LLM04: Model Denial of Service (CWE-400)
+def check_query_limits(query):
+    """
+    Checks if the query exceeds certain limits to prevent DoS attacks.
+    """
+    MAX_LENGTH = 1000  # Example limit
+    if len(query) > MAX_LENGTH:
+        raise ValueError("Query exceeds maximum allowed length.")
+    return True
+
+# LLM05: Supply-Chain Vulnerabilities (CWE-829, CWE-937)
+# Manual review and using trusted libraries are recommended.
+
+# LLM06: Sensitive Information Disclosure (CWE-200)
+def redact_sensitive_info(text):
+    """
+    Redacts sensitive information from the text.
+    """
+    # Example redaction (simple and should be customized)
+    redacted_text = re.sub(r'\b(account_number|ssn)\b', '[REDACTED]', text, flags=re.IGNORECASE)
+    return redacted_text
+
+# LLM07: Insecure Plugin Design (CWE-749, CWE-1203)
+# Ensure plugins do not expose dangerous methods or functions directly to end-users.
+
+# LLM08: Excessive Agency (CWE-807)
+# Implement strict checks on inputs used for security decisions.
+
+# LLM09: Overreliance (CWE-1048)
+# Ensure diversification in security mechanisms and checks.
+
+# LLM10: Model Theft (CWE-494, CWE-1241)
+# Protect model artifacts using integrity checks and secure distribution methods.
+
+# Example usage
+prompt = "Example <script>alert('XSS')</script> prompt"
+clean_prompt = validate_prompt(prompt)
+print(f"Cleaned Prompt: {clean_prompt}")
+
+output = "<h1>This is a header</h1>"
+encoded_output = encode_output(output)
+print(f"Encoded Output: {encoded_output}")
+
+
+### Recommended Automated Validation Tools
+
+For enhancing the security of your LLM application, integrating automated validation tools into your CI/CD pipeline is crucial. Here are some tools that can be particularly useful:
+
+1. **Bandit**: A tool designed to find common security issues in Python code. It's useful for static analysis and can help detect security issues related to the CWEs mentioned.
+
+2. **OWASP ZAP (Zed Attack Proxy)**: An open-source web application security scanner. While more web-focused, it can be useful for testing web interfaces to LLM applications for issues like XSS (CWE-79) and other web vulnerabilities.
+
+3. **SonarQube**: Offers comprehensive code quality and security scanning, including detection of vulnerabilities and code smells.
+
+4. **CodeQL**: GitHub's code scanning tool that uses queries to identify vulnerabilities across multiple languages, including Python. It can be used to automate security checks as part of your GitHub Actions workflows.
+
+5. **PyTorch/TensorFlow Security Advisories**: For LLMs built on these frameworks, staying updated with the latest security advisories is crucial. Though not tools per se, subscribing to these advisories can help mitigate supply-chain vulnerabilities (CWE-829, CWE-937).
+
+Each of these tools can be integrated into your development and deployment processes to automatically flag potential security issues, helping adhere to secure coding practices and mitigate vulnerabilities associated with LLMs.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		[Data Gathering Methodology Wiki](https://github.com/OWASP/www-project-top-10-for-large-language-model-applications/wiki/Data-Gathering-Methodology)