This repository contains the data, data analysis, code, and documentation for the Honours project titled "Generating Authentic Grounded Synthetic Maintenance Work Orders (MWOs)" by Allison Lau (2024). The project aims to generate synthetic MWOs that are grounded in real-world data and authentic to the domain of maintenance engineering. The project is supervised by Prof. Melinda Hodkiewicz, Dr. Caitlin Woods, and Dr. Michael Stewart
Maintenance Work Orders (MWOs) are technical short texts documenting equipment conditions and failures, often containing confidential data, making real-world datasets scarce for machine learning. To address this, this research generates synthetic MWO sentences using a graph database to query equipment-failure relationships and Large Language Models (GPT-4o mini). The generated data mimics real MWOs by incorporating industry-specific jargon and grammar.
This datasets used in this research are describe in the DATASETS
section of the repository. The datasets used/analysed in this project are:
To install the required packages for the project, run the following steps:
- Create a virtual environment using
python -m venv venv
- Activate the virtual environment using
venv/Scripts/activate
(Windows) - Install the required packages using
pip install -r requirements.txt
The synthetic data generation pipeline consists of three main steps:
- Equipment-Failure Path Extraction
- MWO Sentence Generation via LLM
- MWO Sentence Humanisation via Rule-based Approach
The code for each step can be found in the respective directories in the repository.
The code for extracting equipment-failure paths from the MaintIE Knowledge Graph can be found in the PathExtraction
directory. The following steps are performed:
We analysed the MaintIE gold standard dataset using a Neo4j graph database to develop paths that illustrate relationships between equipment (PhysicalObject) and failure modes (UndesirableEvent). These paths are classified as direct, showing immediate connections, or complex, involving intermediary entities. Additional paths are created by using hierarchical relations between equipment entities. This framework improves our understanding of component relationships within MWOs and informs the synthetic data generation process.
- Make sure you have Neo4j Desktop installed on your machine.
- Open Neo4j Desktop, create a project and an instance of a graph database.
- Create a
.env
file in the root directory and add the following environment variables:
NEO4J_URI="bolt://localhost:7687"
NEO4J_USER="neo4j"
NEO4J_PASSWORD="password"
- Replace the
NEO4J_URI
,NEO4J_USER
, andNEO4J_PASSWORD
with your Neo4j instance details. - Start the database and open the browser.
- Run
python maintie_to_kg.py
to load the MaintIE dataset into Neo4j. Note: This will take a while to load the dataset into Neo4j.
- Queries for extracting paths are stored in
path_queries.py
. - Run
python path_matching.ipynb
to extract paths from Neo4j. - Different paths are stored in their respective json files in
path_patterns
directory. - Analysis of paths can be found - total number of paths, frequency of equipment, frequency of undesirable events, frequency of inherent function of PhysicalObjects.
The code for generating synthetic MWO sentences using LLM can be found in the Generate
directory.
- Create a
.env
file in the root directory and add the following environment variables:
API_KEY="your_openai_api_key"
- Run
python llm_generate.py
to generate synthetic MWO sentences using GPT-4o mini.
- Function used to generate synthetic MWO sentences:
generate_mwo()
andgenerate_diverse_mwo()
- Generated synthetic MWO sentences are stored in the
mwo_sentences
directory. There is a log file (log.txt
) detailing the given equipment + failure mode and the generated sentences. There is also a csv file (order_synthetic.csv
) containing just the generated synthetic MWO sentences. - You can alter the number of path samples by changing the
num_samples
parameter inget_samples()
function. You can also choose to exclude certain path types by including their path names (json file) in theexclude
list inget_samples()
function.
Note: More documentation details for function implementations can be found in the DOCUMENTATION
section of the repository.
The code for humanising synthetic MWO sentences can be found in the Humanise
directory.
- Run
python humanise.py
to test humanising synthetic MWO sentences using a rule-based approach.
- Function used to humanise synthetic MWO sentences:
humanise_sentence()
Note: More documentation details for function implementations can be found in the DOCUMENTATION
section of the repository.
The code for evaluating the synthetic MWO sentences can be found in the Evaluation
directory. More details on the evaluation can be found in the EVALUATION
section of the repository. The following evaluations are performed:
- Paths and Synthetic MWO Analysis
- Turing Test
- Ranking Test (replicated from Bikaun et al. 2022)
The synthetic MWOs generated over the course of the project can be found in different files:
Generate/mwo_sentences/order_synthetic.txt
: synthetic MWO sentences generated, including the inherent function of the equipment, the equipment, and the failure modeGenerate/mwo_sentences/synthetic.txt
: shuffled version of aboveGenerate/mwo_sentences/log.txt
: logs of MWO sentences generated, including the path (equipment + failure mode) and the number of sentences generatedEvaluation/Turing2/synthetic_generate_v2.txt
: synthetic MWO sentences generated for the Turing Test evaluationEvaluation/Turing2/synthetic_humanise_v2.txt
: humanised synthetic MWO sentences for the Turing Test evaluation