Skip to content

Data Gathering Methodology

emmanuelgjr edited this page May 21, 2024 · 21 revisions

OWASP Top 10 for LLM AI Applications Data Collection is Open

Supporting version 2 and embarking on the exciting journey of shaping the 2024 OWASP Top Ten for LLM AI Applications. As we venture into this endeavour, we'll adopt the proven methodologies for data collection and utilize templates akin to those from our flagship project's TOP 10 process.

For comprehensive details, kindly refer to our Wiki page. Moreover, we'll also be leveraging our new website for the seamless submission and storage of documents. Your participation and contributions are invaluable to us, and we look forward to collaborating closely with you on this vital project.

Contribution Process

There are a few ways that data can be contributed:

  1. Email a CSV/Excel/JSON file with the dataset(s) to [email protected]
  2. Upload a CSV/Excel/JSON file to [hyperlink will be added here]

Contribution Period

We are opening the door for contributions to the Top 10 for LLM Apps 2024 from May 20th to June 30th, 2024, focusing on data from 2020 up to the present day.

Data Structure

We have both CSV and JSON templates to aid in normalizing contributions:

  • CSV Template [https://github.com/emmanuelgjr/top-10-llm-ai-applications-call-for-data/blob/main/CSV-template]
  • JSON Template [https://github.com/emmanuelgjr/top-10-llm-ai-applications-call-for-data/blob/main/JSON-template]

The following data elements are *required or optional: Per DataSet:

  • Contributor Name (org or anon)
  • Contributor Contact Email
  • Time period (2023, 2022, 2021, 2020)
  • *Number of applications tested
  • *CWEs w/ number of applications found in: MITRE ATT&CK, MITRE ATLAS, STRIDE, Cyclone DX SBOM
  • Type of testing (TaH, HaT, Tools)
  • Primary Language (code)
  • Geographic Region (Global, North America, EU, Asia, other)
  • Primary Industry (Multiple, Financial, Industrial, Software, ??)
  • Whether or not data contains retests or the same applications multiple times (T/F)

Contributors with two types of datasets from different sources are recommended to submit them as two separate datasets.

Analysis

We plan to analyze the data in a way that supports and validates our entries, incorporating new LLM trends data throughout the 2024 collection period.

Timeline

  • Data Collection: May 20th – June 30th, 2024
  • Analysis: July
  • Draft: Aug/Sept
  • Release: October 1st, 2024

Introduction to Data Gathering Methodology for OWASP TOP 10 for LLM AI Applications

Welcome to our dedicated GitHub wiki for understanding and advancing the data-gathering methodology pertaining to OWASP's Top 10 for LLM AI Applications. As technology continues to evolve at an unprecedented pace, particularly in the domains of artificial intelligence and deep learning, securing these systems is of paramount importance. This wiki serves as a central repository for methodologies, strategies, and tools associated with understanding and prioritizing vulnerabilities in LLMs based on real-world data.

Why this Wiki?

  1. Centralized Knowledge Base: With the multifaceted nature of LLM vulnerabilities, having a one-stop solution where developers, researchers, and security experts can find and contribute to the most recent and relevant methodologies is invaluable.

  2. Collaborative Environment: GitHub offers an interactive platform where community members can collaborate, providing insights, updates, and refinements to the existing methodology.

  3. Transparency & Open Source Spirit: In line with the ethos of OWASP and the open-source community, this wiki promotes transparency in the data-gathering process, ensuring everyone has access to the best practices in vulnerability assessment.

  4. Addressing the Dynamic Nature of Threats: The field of AI security is nascent but growing rapidly. This wiki will act as a live document, continuously evolving to capture the latest threats and vulnerabilities.

What to Expect?

Throughout this wiki, you'll find:

  • Detailed steps and guidelines for data collection related to OWASP vulnerabilities in LLMs.
  • Tools, scripts, and code snippets to aid the data-gathering process.
  • Expert contributions, reviews, and insights on refining the methodology.
  • A section dedicated to ethical considerations, ensuring data is gathered and used responsibly.
  • Community-driven surveys, discussions, and feedback mechanisms.

Whether you're a seasoned security expert, a researcher in AI, or just someone keen on understanding the landscape of LLM vulnerabilities, this wiki is for you. Dive in, explore, contribute, and let's work together to make our AI systems more secure!

Our Slack channel is #team-llm-datagathering-methodology

OWASP DG v2 0

Step 1: OWASP Top 10 for LLM - Literature Review Repository

This repository hosts a curated literature review focusing on the OWASP Top 10 vulnerabilities within the context of Language Learning Models (LLMs). It systematically categorizes research papers to enhance understanding of various aspects of security and application in the field.

The categorized_papers.csv file is a compilation of research articles categorized using a custom Python script. The script categorizes articles based on the following criteria:

  • Research Methods:

    • Case Studies: In-depth analysis of individual or group subjects.
    • Interviews: Qualitative data through structured conversations.
    • Content Analysis: Quantitative and qualitative analysis of document content.
    • Surveys and Questionnaires: Structured data collection from samples.
    • Experiments: Empirical research to validate hypotheses.
    • Statistical Analysis: Quantitative analysis to explore data patterns.
    • Mixed Methods: Combination of qualitative and quantitative research methods.
  • Focus Areas:

    • Risk Assessment: Evaluating potential vulnerabilities or threats.
    • Expert Opinions: Insights from subject-matter experts.
    • Technological Assessments: Analysis of technology applications.
    • Policy and Regulation: Examination of governance and compliance issues.
  • Topics and Themes:

    • LLM Security: Security aspects related to Language Learning Models.
    • Industry Applications: Practical applications of LLMs in industry settings.
    • Emerging Threats: Identification of new and evolving security threats.
    • Solutions and Mitigations: Strategies to address vulnerabilities and threats.
  • Geographical Focus:

    • Global: Concerns and studies with worldwide implications.
    • Regional: Research focused on specific areas such as Asia, Europe, the Americas, or Africa.
  • Temporal Focus:

    • Historical Analyses: Lessons from the past and their impact on the present.
    • Current Issues: Contemporary challenges in the field.
    • Future Predictions: Projections and foresight into future trends and concerns.

The dataset aims to provide researchers, practitioners, and enthusiasts with a structured overview of the existing literature, enabling more efficient knowledge discovery and gap analysis in the cybersecurity domain for LLMs.

We welcome collaboration and invite contributions to further improve our repository. For detailed information on how to contribute, please join our Slack channel as mentioned above. Forking the repository is the preferred method for substantial contributions. For additional guidance and best practices, we encourage you to visit our contributing guidelines.

Note: The categorization script and methodology are also included in the repository for transparency and reproducibility.


Access the full dataset and scripts: Literature Review Repository


Step 2: Frameworks for Mapping OWASP Top 10 for Large Language Model Applications

This repository, LLMtop10mapping, provides mappings of the OWASP Top 10 for Large Language Model Applications to various cybersecurity frameworks and standards. These mappings offer guidance on aligning LLM security practices with established cybersecurity guidelines.

Mapped Frameworks and Standards

Explore each framework and standard for detailed insights:

  1. NIST Cybersecurity Framework - A foundational framework for managing cybersecurity risk. Mapping

  2. ISO/IEC 27001: Information Security Management and ISO/IEC 20547-4:2020: Big Data Reference Architecture Security and Privacy - Key international standards for global business compliance and security controls. Mapping

  3. MITRE ATT&CK - Detailed knowledge base for understanding and defending against cyber attacks. Mapping

  4. CIS Controls - Actionable controls developed by the Centre for Internet Security. Mapping

  5. CVEs (Common Vulnerabilities and Exposures) and CWEs (Common Weakness Enumeration) - Cataloging vulnerabilities and weaknesses, crucial in vulnerability management. Mapping

  6. FAIR (Factor Analysis of Information Risk) - Framework for risk quantification and management. Mapping

  7. STRIDE (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, and Elevation of Privilege) - Threat modeling methodology for identifying security threats. Mapping

  8. ENISA (The European Union Agency for Network and Information Security) - Offers a range of cybersecurity advice, especially for European contexts. Mapping

  9. ASVS (Application Security Verification Standard) - A standard for web application security. Mapping

  10. SAMM (Software Assurance Maturity Model) - A model for integrating security into software development. Mapping

  11. MITRE ATLAS - Focused on adversarial behaviors in threat modeling and analysis.Mapping

  12. BSIMM (Building Security In Maturity Model) - Measures and improves software security initiatives.Mapping

  13. OPENCRE (Open Control Requirement Enumeration) - Facilitates understanding and implementation of various cybersecurity controls.Mapping

  14. CycloneDX Machine Learning Software Bill of Materials (SBOM) Mapping

    • Standard that provides advanced supply chain capabilities for cyber risk reduction.
    • Standard capable of representing software, hardware, services, and other types of inventory.

These frameworks and standards provide comprehensive guidelines for enhancing the security of Large Language Model applications, aiding in understanding, preventing, and mitigating potential vulnerabilities.

Step 3: Data Validation and Quality Control Methodology

Overview

This phase intends to establish a robust Data Validation and Quality Control (QC) framework for assessing and ensuring the integrity and accuracy of the vulnerabilities based on the OWASP Top 10 for LLM AI Applications. This step is crucial for maintaining the reliability of findings and ensuring that the data reflects real-world vulnerabilities accurately.

Objectives

  • Define Key Performance Indicators (KPIs): Establish both qualitative and quantitative measures that will guide the validation process.
  • Develop Validation Tools: Scripts and suggested automated tools to systematically verify the data against KPIs.
  • Ensure Data Quality: Implement a QC process to identify and rectify any inconsistencies, inaccuracies, or biases in the dataset.

What to Expect with the Codes

The provided Python scripts serve as a baseline for conducting data validation and quality control. These scripts are designed to be adaptable to various datasets and environments, depending on the specific needs of the research. Below is a brief overview of what to expect:

Validation Scripts

  • Data Consistency Checks: Scripts to verify that the data is consistent across different sources and timeframes.
  • Accuracy Tests: Tools to compare sampled data against trusted benchmarks or manual checks to assess accuracy.
  • Completeness Checks: Automated checks to ensure that the dataset is complete and all expected data points are present.

Quality Control Tools

  • Automated Anomaly Detection: Scripts to identify outliers or anomalies in the data that may indicate errors or inconsistencies.
  • Bias Detection: Tools to assess the dataset for any potential biases that could affect the research outcomes.

Adapting the Code

It is expected that the provided scripts will need to be adapted for each specific research environment and dataset. This adaptation may involve:

  • Parameter Tuning: Adjusting thresholds, weights, or other parameters within the scripts to better fit the specific data characteristics.
  • Custom Checks: Adding or modifying checks and validations to address unique aspects of the data or research objectives.
  • Integration with Other Tools: Modifying the scripts to work seamlessly with other tools or platforms used in the research process.
  1. NIST Cybersecurity Framework - Data Validation Code based on OWASP Top 10 for LLM AI App

  2. ISO/IEC 27001: Information Security Management and ISO/IEC 20547-4:2020: Big Data Reference Architecture Security and Privacy - Data Validation Code based on OWASP Top 10 for LLM AI App

  3. MITRE ATT&CK - Data Validation Code based on OWASP Top 10 for LLM AI App

  4. CIS Controls - Data Validation Code based on OWASP Top 10 for LLM AI App

  5. CVEs (Common Vulnerabilities and Exposures) and CWEs (Common Weakness Enumeration) - Data Validation Code based on OWASP Top 10 for LLM AI App

  6. FAIR (Factor Analysis of Information Risk) - Data Validation Code based on OWASP Top 10 for LLM AI App

  7. STRIDE (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, and Elevation of Privilege) - Data Validation Code based on OWASP Top 10 for LLM AI App

  8. ENISA (The European Union Agency for Network and Information Security) - Data Validation Code based on OWASP Top 10 for LLM AI App

  9. ASVS (Application Security Verification Standard) - Data Validation Code based on OWASP Top 10 for LLM AI App

  10. SAMM (Software Assurance Maturity Model) - Data Validation Code based on OWASP Top 10 for LLM AI App

  11. MITRE ATLAS - Data Validation Code based on OWASP Top 10 for LLM AI App

  12. BSIMM (Building Security In Maturity Model) - Data Validation Code based on OWASP Top 10 for LLM AI App

  13. OPENCRE (Open Control Requirement Enumeration) - Data Validation Code based on OWASP Top 10 for LLM AI App

  14. CycloneDX Machine Learning Software Bill of Materials (SBOM) - Data Validation Code based on OWASP Top 10 for LLM AI App

Clone this wiki locally