Data Gathering Methodology

OWASP Top 10 for LLM GenAI Apps - Data Gathering Methodology

Overview

Welcome to the GitHub wiki dedicated to understanding and advancing the data-gathering methodology for OWASP's Top 10 for LLM AI Applications. As AI and deep learning technologies evolve rapidly, securing these systems is crucial. This wiki serves as a central repository for methodologies, strategies, and tools aimed at identifying and prioritizing vulnerabilities in LLMs based on real-world data.

The Data Gathering Methodology, Mapping, Risk, and Exploit initiative is designed to collect real-world data on vulnerabilities and risks associated with Large Language Models (LLMs). This effort supports the ongoing update of the OWASP Top 10 for LLMs list while maintaining mappings to major cybersecurity frameworks. By employing a comprehensive data collection approach, the initiative aims to enhance AI security guidelines and offer insights to help organizations secure their LLM-based systems.

Why this Wiki?

Centralized Knowledge Base: With the complex nature of LLM vulnerabilities, having a single repository where developers, researchers, and security experts can find and contribute the most recent methodologies is invaluable.
Collaborative Environment: GitHub offers an interactive platform for community members to collaborate, provide insights, updates, and refinements to the existing methodology.
Transparency & Open-Source Spirit: In line with OWASP and open-source principles, this wiki promotes transparency in the data-gathering process, making best practices in vulnerability assessment widely accessible.
Adapting to Emerging Threats: AI security is a rapidly growing field. This wiki will act as a live document, continuously evolving to capture the latest threats and vulnerabilities.

How to Contribute

Join our Slack channel: #team-llm-datagathering-methodology
Reach out to Emmanuel via email

Methodology

Data Collection
- Sources:
  - Industry reports, academic papers, vulnerability databases, and real-world exploit analysis.
  - Partner organizations contributing vulnerability disclosures and risk assessments.
- Approach:
  - Manual review and automated data scraping.
  - Use of standardized templates for data consistency.
  - Prioritization based on impact, exploitability, and prevalence.
- We seek collaboration with organizations and individuals willing to share relevant datasets for LLM protection.
Data Analysis
- Initial Review:
  - Classification of vulnerabilities by type, origin, and potential impact.
- Statistical Analysis:
  - Python scripts validate and analyze gathered data for accuracy and completeness.
- Risk Scoring:
  - Application of scoring frameworks (e.g., CVSS) to rank vulnerabilities based on severity.
Data Validation
- Python Code:
  - Automated scripts ensure the integrity and accuracy of collected data.
- Peer Review:
  - Involvement of cybersecurity experts for manual verification and risk assessment.

Datasets

The initiative will maintain several datasets, including:

Vulnerability Dataset: Real-world vulnerabilities affecting LLM applications.
Exploit Dataset: Documented exploits and attack techniques targeting LLMs.
Risk Assessment Dataset: Mapped risk assessments for various LLM deployments.

Mapping to Cybersecurity Frameworks

Collected data is mapped to widely recognized security frameworks, including:

1. Foundational Cybersecurity Standards

NIST Cybersecurity Framework CFS2.0
- Provides comprehensive guidelines for managing cybersecurity risk.
- Recognized globally as a foundational cybersecurity framework.

2. ISO/IEC Standards

ISO/IEC 27001 (Information Security Management)
- Establishes security controls and ensures global business compliance.
ISO/IEC 20547-4:2020 (Big Data Security and Privacy)
- Focuses on securing big data environments.
ISO/IEC 5338
- A newer standard for AI and ML security.
- Crucial for compliance and information security.
ISO/IEC 38507
- Governs digital governance of AI.
- Provides guidelines on the governance of artificial intelligence in organizations.
ISO/IEC 23894
- Focuses on privacy in AI systems.
- Provides requirements for privacy in AI, addressing risks and compliance with data protection laws.
ISO/IEC 24028
- Covers bias in AI systems.
- Guidelines to identify, manage, and mitigate biases in artificial intelligence systems.
ISO/IEC 23053
- AI framework for transparency and explainability.
- Ensures AI systems' outputs are understandable and transparent.
ISO/IEC 22989
- Standard on AI trustworthiness.
- Focuses on building trust in AI through reliability, accuracy, and ethical principles.
ISO/IEC 42001
- AI management system standard.
- Traceability, transparency, reliability and ethical considerations.

3. Risk and Threat Modeling

MITRE ATT&CK
- A knowledge base for understanding and defending against cyberattacks.
- Valuable for threat modelling and analyzing adversarial tactics and techniques.
STRIDE
- A threat modelling methodology to identify potential security threats.
- Commonly used in early software development stages for proactive security.
FAIR (Factor Analysis of Information Risk)
- Focuses on quantifying and managing cybersecurity risk.
- Helps organizations evaluate cyber risks in financial terms.

4. Application and Data Security

CIS Controls
- Developed by the Center for Internet Security, offering practical, actionable security controls.
- Highly regarded for its effectiveness in strengthening defences.
ASVS (Application Security Verification Standard)
- Framework for testing and assessing web application security controls.
- Widely used for securing web applications.
SOC 2 (Service Organization Control 2)
- Primarily for SaaS and service providers, relevant for secure handling of customer data.
- Complements application security, especially in LLM contexts.
PCI DSS
- Ensures data security for applications dealing with payment information.
- Crucial for compliance in payment and financial services.

5. OT-Specific Frameworks (if applicable)

ISA/IEC 62443
- Designed specifically for OT environments, focusing on ICS security.
- Widely adopted in OT for its standardized approach to securing industrial systems.

6. Software Development and Maturity Models

SAMM (Software Assurance Maturity Model)
- Supports integrating security into software development.
- Helps organizations benchmark and improve their software security practices.
BSIMM (Building Security In Maturity Model)
- Measures and enhances software security initiatives.
- Ideal for tracking and improving organizational software security practices.
OPENCRE
- Facilitates alignment of cybersecurity controls across various standards.
- Acts as a cross-framework bridge, enhancing interoperability.

7. Specialized and Emerging Standards

MITRE ATLAS
- Focused on adversarial behaviour modelling for threat analysis.
- Specific and detailed, but not all-encompassing for cybersecurity management.
ENISA
- The European Union Agency for Network and Information Security, providing guidance on cybersecurity best practices.
- Especially relevant for European compliance.
CycloneDX Machine Learning SBOM
- Provides a standard for advanced supply chain security capabilities.
- Enables detailed tracking of software, hardware, and services in the software supply chain.
COBIT (Control Objectives for Information and Related Technologies)
- A governance framework aligning IT operations with enterprise objectives.
- Useful for integrating governance into LLM applications.
TOGAF (The Open Group Architecture Framework)
- Enterprise architecture framework with security applications.
- Valuable for complex systems and secure architecture mapping.
NIST AI RMF 1.0/AI 100-1 (AI Risk Management Framework)
- Provides guidelines for managing risks related to AI systems.
- Emphasizes reliability, robustness, and trustworthiness of AI implementations.

This mapping ensures compatibility and compliance, facilitating broader adoption across organizations.

LLM Data Security Best Practices

This initiative is producing a white paper detailing best practices for securing data in LLM-based systems, focusing on:

Risk types
Risk mitigation strategies
Best Practices
Secure LLM deployment architectures
Governance models for AI security

Given the rapid evolution of this technology, always monitor new TTPs, frameworks, regulations, and tools.

Additional Resources

Data Validation: Python Code Repository
White Paper: "LLM Data Security Best Practices" (in progress)

For more details or contributions, please reach out to the OWASP Top 10 for LLM GenAI Apps team.

Click the "Edit" button in the top-right corner to make changes/additions. Don't clone it. Just edit in place.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly