A comprehensive replication package for browsing and exploring datasets used in Large Language Models for Requirements Engineering research.
This project provides a systematic collection and web-based interface for browsing datasets employed in Large Language Models for Requirements Engineering (LLM4RE) research. The replication package includes the complete methodology for dataset collection and characterization, enabling researchers to reproduce and extend the findings.
index.html- Main entry point for the web applicationdashboard.html- Interactive analytics dashboard with advanced filtering capabilitiesscript.js- JavaScript functionality for dataset browsing and filteringstyles.css- Cascading Style Sheets for web application stylingLICENSE- Project licensing information
Contains the systematic literature review artifacts:
publications.xlsx- Comprehensive database of reviewed publicationsdatasets.csv- Extracted dataset information from literature reviewORKG comparative.csv- Comparative analysis with Open Research Knowledge Graph
Processed and curated dataset information:
datasets - datasets.csv- Main dataset catalog with comprehensive metadata including:- Dataset identification codes and names
- Descriptions and references
- Temporal information (publication years)
- Licensing and artifact type classifications
- Requirements Engineering stage and task categorizations
- Domain-specific classifications
- Size metrics and language specifications
- Label information and publication mappings
Generated visualizations and analytical charts:
dataset_distribution_stacked.png- Stacked distribution of datasets across categoriesdataset_distribution_stacked_high_res.png- High-resolution version of distribution chartre_stage_task_bubble.png- Bubble chart mapping RE stages to tasksyear_dataset_line.png- Temporal analysis of dataset publication trends
Analysis and visualization generation tools:
plot.py- Python script for generating statistical visualizations and analytical charts
- 🔍 Interactive Dataset Browser - Comprehensive search and filtering capabilities
- 📊 Analytics Dashboard - Dynamic visualizations and statistical analysis
- 🏷️ Advanced Filtering - Multi-dimensional filtering by license, artifact type, granularity, RE stage, task, domain, language, and year
- 📈 Data Visualization - Statistical charts and trend analysis
- 🔗 Reference Integration - Direct links to original publications and dataset sources
A publicly accessible version of this dataset collection is available at: https://nlp4se.github.io/LLM4RE-Datasets/
AI-Assisted Development: This replication package was developed with the assistance of AI tools, specifically utilizing Claude (Anthropic) for code generation and implementation support. The methodology and data collection processes remain fully transparent and reproducible.
This project is licensed under the terms specified in the LICENSE file. Please refer to the license file for detailed usage permissions and restrictions.
Primary Contact: Quim Motger
Email: joaquim.motger (at) upc (dot) edu
Affiliation: GESSI Research Group, Universitat Politècnica de Catalunya (UPC)