Automatic source code file annotation using weak labeling.
AutoFL is a tool designed for automatic annotation of source code files through weak labeling techniques. It provides both an API and a web-based UI for easy analysis of projects across different languages.
To set up the repository along with its UI submodule, clone it using:
git clone --recursive [email protected]:SasCezar/AutoFL.git AutoFL
For advanced features like semantic-based labeling, download models as required. For example, to use w2v-so, download the model from here and place it in the data/models/w2v-so
folder. Alternatively, you can provide a custom path in the configuration files.
To run the tool using Docker, navigate to the project directory (where the docker-compose.yaml
file is located) and execute:
docker compose up
To analyze the files of a project, make a POST request to the following endpoint:
curl -X POST -d '{"name": "<PROJECT_NAME>", "remote": "<PROJECT_REMOTE>", "languages": ["<PROGRAMMING_LANGUAGE>"]}' localhost:8000/label/files -H "content-type: application/json"
For instance, to analyze the project at https://github.com/mickleness/pumpernickel, use:
curl -X POST -d '{"name": "pumpernickel", "remote": "https://github.com/mickleness/pumpernickel", "languages": ["java"]}' localhost:8000/label/files -H "content-type: application/json"
AutoFL provides a web-based UI accessible locally at http://localhost:8501:
For more details, check the UI repository.
AutoFL uses Hydra to manage configurations. The configuration files can be found in the config
folder. The main configuration file, main.yaml
, allows you to customize various options:
- local: Choose between local or Docker environments. Docker is the default.
- taxonomy: Set the taxonomy for labeling. Currently supports gitranking. You can add custom taxonomies.
- annotator: Specify the annotators to use. The default is simple, offering good results without dependencies on language models.
- version_strategy: Select the versioning strategy. The default is latest.
- dataloader: Choose the dataloader. The default is postgres.
- writer: Set the writer for storing results. The default is postgres.
Additional configurations can be added by creating new files in the corresponding component folders.
- Annotation (UI/API/Script)
- File-Level
- Package-Level
- Project-Level
- Batch Analysis (Script Only)
- Temporal Analysis (TODO)
- Classification (TODO)
- Java
- Python (untested)
- C (untested)
- C++ (untested)
- C# (untested)
AutoFL is composed of multiple components, as shown in the architecture diagram below:
To add support for additional languages, a language-specific parser is required. You can use tree-sitter to develop a parser quickly.
The parser needs to be located in the parser/languages
folder. It should extend the BaseParser
class, which follows this structure:
class ParserBase(ABC):
"""
Abstract class for a programming language parser.
"""
def __init__(self, library_path: Path | str):
"""
:param library_path: Path to the tree-sitter languages.so file. The file has to contain the
language parser. See tree-sitter for more details
"""
...
To implement the parsing logic, create a class that handles extracting identifiers. For Python, the parser might look like:
class PythonParser(ParserBase, lang=Extension.python.name):
"""
Python-specific parser using a generic grammar for multiple versions. Utilizes tree-sitter for AST extraction.
"""
def __init__(self, library_path: Path | str):
...
A custom parser independent of tree-sitter can also be developed. For more details, refer to the implementation of ParserBase.
- Dependency Installation: The setup process may take significant time (~10 minutes), and dependency installations might fail due to timeouts. This appears to be a network-related issue, and retrying often resolves it. Future updates will aim to simplify dependencies.
Indefinite Analysis Loops:In some projects, the analysis may loop indefinitely. This issue is currently under investigation.Seems solved in the latest version. Will monitor for further occurrences.
AutoFL is also available as a Docker image. You can pull the image from Docker Hub using:
docker pull cezarsas/autofl
Find more details and updates at the Docker Hub page.
This tool is in active development and may not function as expected in some cases. It has been tested primarily on Docker versions 24.0.7
and 25.0.0
for Ubuntu 22.04
. Limited testing has been performed on Windows
and MacOS
, where functionality may vary.
If you encounter any issues, please open an issue on GitHub, make a pull request, or contact me at [email protected]
.
If you find this tool useful, please cite our work:
@article{sas2024multigranular,
title = {Multi-granular Software Annotation using File-level Weak Labelling},
author = {Cezar Sas and Andrea Capiluppi},
journal = {Empirical Software Engineering},
volume = {29},
number = {1},
pages = {12},
year = {2024},
url = {https://doi.org/10.1007/s10664-023-10423-7},
doi = {10.1007/s10664-023-10423-7}
}
Note: The code used in this paper is available at CodeGraphClassification. However, AutoFL provides enhanced features, is more user-friendly, and includes a UI.
@software{sas2023autofl,
author = {Sas, Cezar and Capiluppi, Andrea},
month = oct,
title = {{AutoFL}},
url = {https://github.com/SasCezar/AutoFL},
version = {0.5.0},
year = {2024},
url = {https://doi.org/10.5281/zenodo.10255368},
doi = {10.5281/zenodo.10255368}
}