VMware has ended active development of this project, this repository will no longer be updated.
The ML Conversational Analytic Tool is a proof of concept (POC) machine learning framework to automatically assess pull request comments and reviews for constructive and inclusive communication.
This repo contains experimental code for discussion and collaboration and is not ready for production use.
Constructive and inclusive communication ensures a productive and healthy working environment in open source communities. In open source, communication happens in many forms, including pull requests that are text-based conversations crucial to open source collaboration. The ML Conversational Analytic Tool identifies constructive and inclusive pull requests to foster a healthier open source community.
- Python 3.6+
A virtualenv or similar tools to create isolated Python environment is recommended for this project.
-
Install
virtualenv
pip install virtualenv
-
Set up ML Conversational Analytic Tool in a virtualenv
python -m venv virtualenv-ml-conversational
-
Activate the virtualenv
source ./virtualenv-ml-conversational/bin/activate
-
Update pip
pip install --upgrade pip
-
Install required python libraries by running the command below
pip install -r requirements.txt
- Run all unit tests
python -m unittest discover -s tests
- Run an individual unit test
python -m unittest tests/<file_name>
- By using tox
python -m pip install --upgrade tox tox
The libraries used within the project are available in the requirements.txt.
githubDataExtraction.py
extracts raw data from GitHub based on parameters passed in by the user. To successfully run the
script, a GitHub access token
is required and must be set as an environment variable.
Note: There is a rate limit associated with GitHub API. Please read more about GitHub API Rate Limits for details before extracting data from a GitHub repo.
export GH_TOKEN=<YOUR_TOKEN>
Run the script by passing in organization
python ./mcat/githubDataExtraction.py <organization>
organization
is the name of the repository owner- (optional)
--repo
is the name of the repository; extracts all repositories in organization if not included. - (optional)
--reactions
is an optional flag to extract comment and review reactions.
github_data.py
prepares your data for annotation use. Run the script by passing in path to rawdatafile
.
python ./mcat/github_data.py <rawdatafile> --name <output_filename>
rawdatafile
is location of raw data csvname
(optional) is the output filename.
The quality of the data and the model very much depends on annotation best practices. To annotate the raw data extracted we recommend using Data Annotator For Machine Learning.
featureVector.py
creates feature vector based on the rawdatafile
and optionally words
file. Default features
include sentiment and code blocks. Words
file contains words important in measuring inclusiveness and
constructiveness. This functionality could be used instead of manual annotation.
python ./mcat/featureVector.py <rawdatafile> --words <words_filename> --name <output_filename>
words
(optional) path to the words filename
(optional) name of the output file.
After both raw and annotated datasets are available, models can be trained to predict Constructiveness and Inclusiveness.
There are two models available for training
BaseCNN
BaseLSTM
To train, run the script with required parameters path to annotated_filename
, dataset_filename
, model
, and outcome
.
python ./mcat/run.py <annotated_filename> <dataset_filename> <model> <outcome>
annotated_filename
is the location of the annotated dataset filedataset_filename
is the location of the raw datamodel
is the type of model and can be 'LSTM' or 'CNN'outcome
can be 'Constructive', 'Inclusive' or 'Both'- (optional)
-save NAME
Save the trained model, an outputNAME
must be specified. The model is saved inmodels/name-outcome
directory. - (optional)
-save_version VERSION
If-save NAME
is specified, save the model using givenNAME
nadVERSION
The parameter is ignored if-save NAME
is missing. By default, version001
is used. - (optional)
-roleRelevant
indicates that the encoding generated should be a stacked matrix representing user roles in conversation. If it is not set then a single matrix representing each comment/review without the role is generated. - (optional)
-pad
indicates that the number of comment/review should be padded to be a constant value. This argument is required to be set forCNN
and not set forLSTM
.
Both BaseCNN
and BaseLSTM
also have prediction explanation mechanisms that can be accessed through the
.explain(obs)
method in both classes.
If you have ideas on how to improve the framework to assess text conversation for constructive and inclusive communication, we welcome your contributions!
Auto-generated API documentation can be found in docs/mcat directory.
Run the following command to update the API documentation
PYTHONPATH=./mcat pdoc --html --output-dir docs mcat
- Measuring Constructiveness and Inclusivity in Open Source – Part 1
- Measuring Constructiveness and Inclusivity in Open Source – Part 2
- Measuring Constructiveness and Inclusivity in Open Source – Part 3
The ml-conversational-analytic-tool project team welcomes contributions from the community. If you wish to contribute code and you have not signed our contributor license agreement, our bot will update the issue when you open a Pull Request. For any questions about the CLA process, please refer to our FAQ. For more detailed information, refer to CONTRIBUTING.md.
Please remember to read our Code of Conduct and keep in mind during your collaboration.
Apache License v2.0: see LICENSE for details.