Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Data Extraction Pipeline

To train and evaluate our models, we will change the relational format of the MIMIC-III database to a pivoted view which includes key demographic information, vital signs, and laboratory readings. We will also create tables for the possible sepsis onset times of each patient. We will subsequently output the pivoted data to comma-separated value (CSV) files, which serve as input for model training and evaluation. The scripts in this directory accomplish the following steps, which we refer to as our data extraction pipeline:

  • Initialise a PostgreSQL installation with a new database for storing MIMIC-III data
  • Populate the database using MIMIC-III data files
  • Change the relational format of the database
  • Generate CSV files containing IDs of exemplars in training and testing subsets
  • Generate CSV files containing data for model training and testing

Preliminaries

Before executing any of the steps in the pipeline, it is necessary that you complete the following tasks described on the MIMIC-III website:

wget -r -N -c -np --user username --ask-password https://physionet.org/files/mimiciii/1.4/

Requirements

Running the data extraction pipeline requires a Bash command line environment. We have successfully tested the pipeline using Bash 3.2 under MacOS Catalina 10.15 and Bash 4.4.20 under Ubuntu Linux 18.04.

As a prerequisite for running the Jupyter notebooks which are part of the data extraction pipeline, please install the Python dependencies listed in requirements.txt. A typical process for installing the package dependencies involves creating a new Python virtual environment and then inside the environment executing

pip install -r requirements.txt

Finally, you will also need a running local PostgreSQL installation. As an alternative to installing Postgresql manually on your local machine, we provide scripts for automatically deploying PostgreSQL inside a Docker container (and subsequently removing the container). We have successfully tested the latter approach using Docker Desktop 2.2.0.3 under MacOS Catalina 10.15.

The populated PostgreSQL database required around 65GB of storage space.

Depending on your preferred choice of installing PostgreSQL on your machine yourself or using a Docker container, please proceed with the relevant section below.

Local PostgreSQL Installation

Option 1: Install from Package

To install PostgreSQL locally, we recommend using a package manager if possible, e.g. for Ubuntu Linux

sudo apt update
sudo apt install postgresql postgresql-contrib
sudo service postgresql start

or for Mac

brew install postgres
brew services start postgresql

Option 2: Install from Source

Another possibility is to install from source by cloning the repository at https://github.com/postgres/postgres. If your operating system does not permit you to follow the steps, then you should consider installing from source too. Steps to install from source:

  • Make a new directory at a suitable location, e.g. mkdir source
  • Clone repository git clone git@github.com:postgres/postgres.git
  • Go to the cloned directory
  • ./configure --prefix=/path_to_installation_directory
  • make
  • make install
  • Add the line export PATH=/path_of_directory_of_installation to your .bashrc file.
  • source .bashrc
  • cd path_of_directory_of_installation
  • mkdir data
  • cd data
  • Initialise the server with initdb.
  • Now go back to the .bashrc file and add the line export PGDATA=path_of_directory_of_installation/data
  • source .bashrc
  • Now you should be able to initialise the database server with path_of_directory_of_installation/bin/pg_ctl -D path_of_directory_of_installation/data/ -l logfile start

Test the PostgreSQL Installation

Check that you have psql installed with

psql -V

Initialise PostgreSQL installation with a new database for storing MIMIC-III data

The script 00_define_database_environment.sh contains relevant environment variables for connecting to PostgreSQL. It should not be necessary to change the value of any of these variables, with the exception of MIMIC_DATA_PATH, which you should change to the path containing the MIMIC-III data files you downloaded previously.

To initialise the PostgreSQL installation with a new database for storing MIMIC-III data, run the following command (with e.g. bash):

./10_initialise_mimic_database.sh

Note that this command and all subsequent steps the pipeline should be run from within this directory. If you are asked to provide a password, try authenticating using the default password postgres.

Populate the database using MIMIC-III data files

Next, populate the newly created database by invoking the script

./20_load_mimic_database.sh

Note that the preceding two steps are based on the instructions for installing MIMIC-III manually available on the MIMIC website.

Change the relational format of the database

Next, change the relational format of the database by invoking the script

./30_sepsis_time.sh

This is the key step in extracting the data that we need to define the sepsis onset according to the three definitions in our paper. Quite a few scripts from our sql folder have been adapted from versions found in the MIMIC-LCP and sepsis3-mimic repos, written by Alistair Johnson.

NB: Please provision for several hours for ./20_load_mimic_database.sh and ./30_sepsis_time.sh to complete.

Note also that our adaptations and database loading scripts are based on the mimic-code repository at commit 5f563bd40fac781eaa3d815b71503a6857ce9599. We include all required scripts as part of this repository, therefore it is not necessary to check out any of the aforementioned repository separately.

Generate CSV files containing IDs of exemplars in training and testing subsets

To generate the CSV files containing IDs of exemplars in training and testing subsets, open the following notebook in Jupyter and execute all cells

40_patient_split.ipynb

To avoid issues with working memory consumption, after executing 40_patient_split.ipynb we recommend shutting down the notebook.

Generate CSV files containing data for model training and testing

This step requires first executing the script

./50_make_ids.sh

Once the script has completed, open the following notebooks in Jupyter and execute all cells

60_tables_to_csvs_final.ipynb
70_tables_to_csvs_test_set.ipynb

The result of executing these notebooks should be that the CSV files for subsequent model training and testing are reproduced and output to the directory ../../data/raw/.

Deploy PostgreSQL using Docker

The scripts for executing the analogous pipeline based on Docker are located inside the directory docker/. In common with the scripts described in the preceding section, it is required that all scripts are executed from inside te aforementioned directory. This is the reason why in the following, we execute scripts inside a subshell, e.g. (cd docker && ./10_initialise_mimic_database.sh). As an alternative, you may simply use cd docker. However, please note that the Jupyter notebooks are located inside the parent directory.

Initialise PostgreSQL installation with a new database for storing MIMIC-III data

The script ./docker/00_define_database_environment.sh contains relevant environment variables for connecting to PostgreSQL. It should not be necessary to change the values of MIMIC_POSTGRES_PORT and MIMIC_POSTGRES_PASSWORD. However, you should change MIMIC_DATA_PATH to the path containing the MIMIC-III data files you downloaded previously. In addition, you should change POSTGRESQL_DATA_PATH to a newly created directory which will be used to store the database.

To initialise the PostgreSQL installation with a new database for storing MIMIC-III data, run the following command:

(cd docker && ./10_initialise_mimic_database.sh)

Populate the database using MIMIC-III data files

Next, populate the newly created database by invoking

(cd docker && ./20_load_mimic_database.sh)

Change the relational format of the database

Next, change the relational format of the database by invoking

(cd docker && ./30_sepsis_time.sh)

NB: Please provision for several hours for 20_load_mimic_database.sh and 30_sepsis_time.sh to complete.

Generate CSV files containing IDs of exemplars in training and testing subsets

To generate the CSV files containing IDs of exemplars in training and testing subsets, open the following notebook in Jupyter and execute all cells

40_patient_split.ipynb

To avoid issues with working memory consumption, after executing 40_patient_split.ipynb we recommend shutting down the notebook.

Generate CSV files containing data for model training and testing

This step requires first executing

(cd docker && ./50_make_ids.sh)

Once the script has completed, open the following notebook in Jupyter and execute all cells

60_tables_to_csvs_final.ipynb

As was described in the preceding section, the result of executing 60_tables_to_csvs_final.ipynb should be that the CSV files for subsequent model training and testing are reproduced and output to the directory ../../data/raw/.

Remove the Docker container

To remove the Docker container, execute

(cd docker && ./70_remove_postgres_container.sh)

You may also wish to clean up by deleting the contents of the directory you assigned to POSTGRESQL_DATA_PATH.