To train and evaluate our models, we will change the relational format of the MIMIC-III database to a pivoted view which includes key demographic information, vital signs, and laboratory readings. We will also create tables for the possible sepsis onset times of each patient. We will subsequently output the pivoted data to comma-separated value (CSV) files, which serve as input for model training and evaluation. The scripts in this directory accomplish the following steps, which we refer to as our data extraction pipeline:
- Initialise a PostgreSQL installation with a new database for storing MIMIC-III data
- Populate the database using MIMIC-III data files
- Change the relational format of the database
- Generate CSV files containing IDs of exemplars in training and testing subsets
- Generate CSV files containing data for model training and testing
Before executing any of the steps in the pipeline, it is necessary that you complete the following tasks described on the MIMIC-III website:
- Become a credentialed user on PhysioNet. This involves completion of a half-day online training course in human subjects research.
- Sign the data use agreement (DUA). Adherence to the terms of the DUA is paramount.
- Download the MIMIC-III data locally by navigating to the Files section on the PhysioNet MIMIC-III project page
- You may alternatively download the files using the
wgetutility:
wget -r -N -c -np --user username --ask-password https://physionet.org/files/mimiciii/1.4/Running the data extraction pipeline requires a Bash command line environment. We have successfully tested the pipeline using Bash 3.2 under MacOS Catalina 10.15 and Bash 4.4.20 under Ubuntu Linux 18.04.
As a prerequisite for running the Jupyter notebooks which are part of the data extraction pipeline, please install the Python dependencies listed in requirements.txt. A typical process for installing the package dependencies involves creating a new Python virtual environment and then inside the environment executing
pip install -r requirements.txtFinally, you will also need a running local PostgreSQL installation. As an alternative to installing Postgresql manually on your local machine, we provide scripts for automatically deploying PostgreSQL inside a Docker container (and subsequently removing the container). We have successfully tested the latter approach using Docker Desktop 2.2.0.3 under MacOS Catalina 10.15.
The populated PostgreSQL database required around 65GB of storage space.
Depending on your preferred choice of installing PostgreSQL on your machine yourself or using a Docker container, please proceed with the relevant section below.
To install PostgreSQL locally, we recommend using a package manager if possible, e.g. for Ubuntu Linux
sudo apt update
sudo apt install postgresql postgresql-contrib
sudo service postgresql startor for Mac
brew install postgres
brew services start postgresqlAnother possibility is to install from source by cloning the repository at https://github.com/postgres/postgres. If your operating system does not permit you to follow the steps, then you should consider installing from source too. Steps to install from source:
- Make a new directory at a suitable location, e.g.
mkdir source - Clone repository
git clone git@github.com:postgres/postgres.git - Go to the cloned directory
./configure --prefix=/path_to_installation_directorymakemake install- Add the line
export PATH=/path_of_directory_of_installationto your .bashrc file. source .bashrccd path_of_directory_of_installationmkdir datacd data- Initialise the server with
initdb. - Now go back to the .bashrc file and add the line
export PGDATA=path_of_directory_of_installation/data source .bashrc- Now you should be able to initialise the database server with
path_of_directory_of_installation/bin/pg_ctl -D path_of_directory_of_installation/data/ -l logfile start
Check that you have psql installed with
psql -VThe script 00_define_database_environment.sh contains relevant environment variables for connecting to PostgreSQL. It should not be necessary to change the value of any of these variables, with the exception of MIMIC_DATA_PATH, which you should change to the path containing the MIMIC-III data files you downloaded previously.
To initialise the PostgreSQL installation with a new database for storing MIMIC-III data, run the following command (with e.g. bash):
./10_initialise_mimic_database.shNote that this command and all subsequent steps the pipeline should be run from within this directory. If you are asked to provide a password, try authenticating using the default password postgres.
Next, populate the newly created database by invoking the script
./20_load_mimic_database.shNote that the preceding two steps are based on the instructions for installing MIMIC-III manually available on the MIMIC website.
Next, change the relational format of the database by invoking the script
./30_sepsis_time.shThis is the key step in extracting the data that we need to define the sepsis onset according to the three definitions in our paper. Quite a few scripts from our sql folder have been adapted from versions found in the MIMIC-LCP and sepsis3-mimic repos, written by Alistair Johnson.
NB: Please provision for several hours for ./20_load_mimic_database.sh and ./30_sepsis_time.sh to complete.
Note also that our adaptations and database loading scripts are based on the mimic-code repository at commit 5f563bd40fac781eaa3d815b71503a6857ce9599. We include all required scripts as part of this repository, therefore it is not necessary to check out any of the aforementioned repository separately.
To generate the CSV files containing IDs of exemplars in training and testing subsets, open the following notebook in Jupyter and execute all cells
40_patient_split.ipynbTo avoid issues with working memory consumption, after executing 40_patient_split.ipynb we recommend shutting down the notebook.
This step requires first executing the script
./50_make_ids.shOnce the script has completed, open the following notebooks in Jupyter and execute all cells
60_tables_to_csvs_final.ipynb
70_tables_to_csvs_test_set.ipynbThe result of executing these notebooks should be that the CSV files for subsequent model training and testing are reproduced and output to the directory ../../data/raw/.
The scripts for executing the analogous pipeline based on Docker are located inside the directory docker/. In common with the scripts described in the preceding section, it is required that all scripts are executed from inside te aforementioned directory. This is the reason why in the following, we execute scripts inside a subshell, e.g. (cd docker && ./10_initialise_mimic_database.sh). As an alternative, you may simply use cd docker. However, please note that the Jupyter notebooks are located inside the parent directory.
The script ./docker/00_define_database_environment.sh contains relevant environment variables for connecting to PostgreSQL. It should not be necessary to change the values of MIMIC_POSTGRES_PORT and MIMIC_POSTGRES_PASSWORD. However, you should change MIMIC_DATA_PATH to the path containing the MIMIC-III data files you downloaded previously. In addition, you should change POSTGRESQL_DATA_PATH to a newly created directory which will be used to store the database.
To initialise the PostgreSQL installation with a new database for storing MIMIC-III data, run the following command:
(cd docker && ./10_initialise_mimic_database.sh)Next, populate the newly created database by invoking
(cd docker && ./20_load_mimic_database.sh)Next, change the relational format of the database by invoking
(cd docker && ./30_sepsis_time.sh)NB: Please provision for several hours for 20_load_mimic_database.sh and 30_sepsis_time.sh to complete.
To generate the CSV files containing IDs of exemplars in training and testing subsets, open the following notebook in Jupyter and execute all cells
40_patient_split.ipynbTo avoid issues with working memory consumption, after executing 40_patient_split.ipynb we recommend shutting down the notebook.
This step requires first executing
(cd docker && ./50_make_ids.sh)Once the script has completed, open the following notebook in Jupyter and execute all cells
60_tables_to_csvs_final.ipynbAs was described in the preceding section, the result of executing 60_tables_to_csvs_final.ipynb should be that the CSV files for subsequent model training and testing are reproduced and output to the directory ../../data/raw/.
To remove the Docker container, execute
(cd docker && ./70_remove_postgres_container.sh)You may also wish to clean up by deleting the contents of the directory you assigned to POSTGRESQL_DATA_PATH.