WasteAnnotator is an automated pipeline designed to extract and annotate components from abandoned GitHub projects. By analyzing the project's dependency graph, the tool identifies components and labels them based on the files they contain. The final output is a structured file detailing the components and their associated files, providing a comprehensive view of the project's architecture.
The WasteAnnotator tool consists of several key modules:
- Finder: Retrieves projects from a repository service (currently GitHub only) based on specified criteria.
- GraphExtractor: Parses the project's dependency graph to identify potential components (currently uses Arcan).
- Annotator: Uses semantic techniques to label and annotate components based on their file contents (currently uses AutoFL).
- CommunityExtractor: Identifies communities within the component structure for further insights (via customizable algorithms from cdlib).
- Exporter: Outputs the processed information in configurable formats (e.g., JSON).
Configurations for each module are defined in YAML files located in the config
folder, allowing for easy customization
of behavior and parameters.
For each module new classes can be added by extending the base classes in the directory.
- Docker v4.25 or higher for containerization.
- Git for repository cloning.
- (Optional) Python 3.10 if running the application outside of Docker.
- (Optional) Gurobi License If using Bayan community detection algorithm.
-
Clone the Repository
git clone https://github.com/SasCezar/WasteAnnotator.git cd WasteAnnotator
-
Set Up Environment Variables
- (Optional) Create a
.env
file in the project root to define any necessary environment variables (e.g., GitHub tokens, paths).
- (Optional) Create a
-
Build and Start Services with Docker
docker compose up --build
This will initialize all required services and set up the necessary environment for running the WasteAnnotator pipeline.
The main entry point for the WasteAnnotator pipeline is src/main.py
. The pipeline can be executed either using Docker
or directly via Python.
-
Ensure Services are Running
docker compose up
-
Run the Main Pipeline
- The default service automatically runs
main.py
within the Docker container, which initiates the component extraction and annotation process.
- The default service automatically runs
-
Install Dependencies with Poetry
poetry install
-
Activate the Poetry Environment
poetry shell
-
Execute the Script
python src/main.py
Configuration files are located in the config
directory, which contains settings for different modules (e.g.,
annotator
, community
, exporter
). Each YAML file can be customized to alter the behavior of the pipeline
components:
config/main.yaml
: The primary configuration file, referencing all module-specific settings.- Module-specific YAML files: Adjust parameters for finer control, such as
finder/github_archived_java.yaml
to change GitHub project retrieval criteria.
The pipeline uses Hydra for configuration management, allowing runtime configuration overrides. For example:
python src/main.py finder=custom_finder.yaml graphextractor=arcan.yaml
Contributions are welcome! Please fork the repository and use a feature branch to work on your changes. When ready, submit a pull request for review.
This repository was previously licensed under the MIT License. However, it includes code that is licensed under the GNU General Public License (GPL). As a result, the entire project is now licensed under the GPL 3. All previous and future versions must comply with this license.
Thank you for using and contributing to WasteAnnotator! If you have any questions or need support, please open an issue or contact the maintainers.