Skip to content

molgenis/molgenis-py-consensus

Repository files navigation

VKGL Data release

This guide will explain every step in the VKGL Data release process. Please follow it step by step to guarantee consistent output for every export.

Prerequisites

  • Install Molgenis Commander
  • Paste these files in .mcmd/scripts
  • Set import_action in the settings section of mcmd.yaml to add
  • Add the output folder of molgenis-py-consensus to dataset_folders in mcmd.yaml
  • Access/permissions for:
    • Gearshift
    • Nibbler (accounts umcg-vkgl-alissa, umcg-vkgl-lumc, and umcg-vkgl-radboud)
    • ClinVar VKGL lab accounts
    • The downloadserver

Step-by-step guide

  1. Request a backup from https://vkgl.molgeniscloud.org/ to be restored on https://vkgl-test.molgeniscloud.org/ and for the testserver to be turned on.
  2. Ssh to gearshift and go to /groups/umcg-gcc/tmp01/projects/VKGL and create a directory for this release (yyyymm).
  3. Get the latest tag of data-transform-vkgl (every release, we create a new tag that uses the newest release in the helper scripts):
wget https://github.com/molgenis/data-transform-vkgl/archive/refs/tags/data-release-v*insert versionnumber*.zip
unzip data-release-v*insert versionnumber*.zip
  1. Go to the data-release-pipeline folder of the folder you just unzipped and make sure you have excecute permissions in the whole directory.
  2. Adjust the filenames in run.sh to fit the filenames of this export (they are vkgl_original_filename.tsv).
  3. Run run.sh like this:
./run.sh -r yyyymm
  1. Create a screen. Then validate the output it generated by running the validation script:
screen
ml PythonPlus
python validator/main.py /folder/created/by/run.sh/
  1. If all checks are successful, continue, else rerun run.sh or pinpoint the issue.
  2. Change directory back to Download the most recent version of the molgenis EMX downloader:
wget https://registry.molgenis.org/repository/maven-releases/org/molgenis/downloader/*insert version here*/downloader-*insert version here*.jar
unzip downloader-*insert newest version here*.zip
  1. Download the current consensus and consensus comments:
java -jar downloader-*insert version here*.jar -f consensus.zip -u https://yourserver/ -a admin vkgl_consensus
  1. Unzip the downloaded zip
unzip consensus.zip
  1. Do the following steps manually:
  • Download the latest version (or the one you need) of molgenis-py-consensus
  • Create an input and output folder for this tool
  • Create a config.txt in the molgenis-py-consensus/config directory. Example:
labs=umcg,umcu,nki,amc,vumc,radboud_mumc,lumc,erasmus
prefix=vkgl_
consensus=consensus
comments=comments
previous=1805,1810,1906,1910,1912,2003,2006,2009,2101,2104,2106,2109
history=consensus_history
input=place/to/store/input/for/molgenis-py-consensus/
output=place/to/store/output/for/molgenis-py-consensus/

Make sure to update the previous, input and output. The other values will (almost) always stay the same. (You can use the previous config and put the latest previous export in there)

  • Copy the vkgl_vkgl-lab files and the vkgl_ files of radboud and LUMC to the input dir you specified in the config.
  • Grant permissions to run the consensus scripts:
chmod g+x -R /path/to/molgenis-py-consensus
  • Change dir to molgenis-py-consensus
  • Setup and install molgenis-py-consensus
python -m venv env
python source env/bin/activate
pip install -e .
  • Run the preprocessor:
python preprocessing/PreProcessor.py
  • Run the history writer:
python preprocessing/HistoryWriter.py
  1. Make sure you first purge your pythonPlus before installing the commander:

    module purge PythonPlus
    

    Then install the commander, following this guide. Add the production and test server as hosts to the commander using the mcmd config add host command.

  2. Configure molgenis commander host to test server.

mcmd config set host
  1. Import the consensus history on your testserver
mcmd import vkgl_consensus_history.tsv
  1. Check your molgenis server to make sure the history is uploaded. There should be variants from the previous export available in the table. If it looks okay, change the mcmd host to production and run the same command.
  2. Go to the folder you want to store the history zip. Download the complete consensus history using the EMX downloader:
java -jar downloader.jar -f consensus_history.zip -u https://yourserver/ -a admin vkgl_consensus_history
unzip consensus_history.zip
  1. Place the vkgl_consensus_history.tsv from the zip in your molgenis-py-consensus input directory.
  2. Load pythonplus again and install the consensus-script. Go to the directory of molgenis-py-consensus.
ml PythonPlus
python -m venv env
pip install -e .
  1. Make sure you still have a screen running (screen -ls). Then run the consensus script:
python consensus
  1. The script takes over 24 hours to run (every export it will take longer), so detach your screen to make sure your internet connection won't mess up the script run, by pressing ctrl+a d.

  2. Now you have plenty of time to create the ClinVar files. To do this, get the most recent version (or another version, if you wish) of vkgl-clinvar.

wget https://github.com/molgenis/vkgl-clinvar/releases/download/*vx.y.z*/vkgl-clinvar-writer.jar
  1. Go to ClinVar, select Submit in the dropdown and click Submission portal (this way you will be redirected to the correct overview). Download SUB*submission id*_(100)_submitter_report_B.txt of the previous ClinVar submits for all labs.

  2. Create an output folder for your ClinVar submit files.

  3. Run the script:

java -jar vkgl-clinvar-writer.jar -i /output/of/runs.sh/consensus/consensus.tsv -m /path/to/files/from/previous/clinvar/submit/*export name*_DUPLICATED_identifiers.tsv,/path/to/files/from/previous/clinvar/submit/*export name*_REMOVED_identifiers.tsv,/path/to/files/from/previous/clinvar/submit/*export name*_UNCHANGED_identifiers.tsv,/path/to/files/from/previous/clinvar/submit/*export name*_UPDATED_identifiers.tsv -c amc=/path/to/SUB*submission id*_(100)_submitter_report_B.txt,lumc=/path/to/SUB*submission id*_(100)_submitter_report_B.txt,nki=/path/to/SUB*submission id*_(100)_submitter_report_B.txt,umcg=/path/to/SUB*submission id*_(100)_submitter_report_B.txt,radboud_mumc=/path/to/SUB*submission id*_(100)_submitter_report_B.txt,umcu=/path/to/SUB*submission id*_(100)_submitter_report_B.txt,vumc=/path/to/SUB*submission id*_(100)_submitter_report_B.txt,erasmus=/path/to/SUB*submission id*_(100)_submitter_report_B.txt -o /output/folder/for/clinvar/data -r mon_yyyy -f
  1. Check if your consensus script is still running:
screen -r

If everything is okay, press ctrl+a d again. Now all we need to do is wait until the script is done. After it'ss done, type deactivate to deactivate the virtual environment.

  1. Make sure you purge the PythonPlus module again and load Python to run the commander.

  2. Change the mcmd host to the test server again. Test if the output works by uploading it to the test server:

mcmd run vkgl_cleanup_consensus
mcmd run vkgl_cleanup_labs
mcmd delete --data vkgl_comments -f
mcmd import vkgl_comments.csv
mcmd run vkgl_import_labs
mcmd import vkgl_consensus_comments.csv
mcmd import vkgl_consensus.csv
mcmd delete --data vkgl_public_consensus -f
mcmd import vkgl_public_consensus.csv

If you get a 504 error throughout this process. You can start a mcmd script from a certain line using the --from-line command. Keep retrying until you don't get the 504 anymore. Trust me, it will work.

  1. Time to upload production. Set a message on the homepage by editing the home row in sys_StaticContent (don't do it via the home page, it will mess up everything!):
<div class="alert alert-warning" role="alert">
    We are currently working on the new VKGL data release, this means some data might be missing or incorrect. As soon
    as
    this message is no longer on our homepage, the release is updated and save to use. We thank you for your
    understanding
    and patience.
</div>
  1. Set the mcmd host to production and run the following commands on production:
mcmd run vkgl_cleanup_consensus
mcmd run vkgl_cleanup_labs
mcmd delete --data vkgl_comments -f
mcmd import vkgl_comments.csv
mcmd run vkgl_import_labs
mcmd import vkgl_consensus_comments.csv
mcmd import vkgl_consensus.csv
mcmd delete --data vkgl_public_consensus -f
mcmd import vkgl_public_consensus.csv
  1. Update the counts page by editing the news row in the sys_StaticContent table and paste the counts.html that was produced by molgenis-py-consensus.

  2. Upload the public consensus to the downloadserver and update the downloads page in the static content table.

  3. Update the name of the export in the menu.

  4. Update the name of the export on the homepage and remove the message.

  5. Email the contact persons for acceptance testing.

  6. Once accepted, do the ClinVar submissions. Need help?

  7. Send the email notifying everyone that the VKGL release is done and the ClinVar submission is in progress.

  8. Create a raw folder in the LUMC directory on nibbler (use FileZilla). Place the contents of the preprocessed folder of run.sh in there.

  9. Create a consensus folder in the Radboud directory on nibbler. Place the vkgl_consensus file as generated by
    molgenis-py-consensus in there.

  10. Create a new directory for the next export yyyymm on all 3 accounts on nibbler.

  11. Send the error files to all labs.

  12. Make a vip directory in your VKGL release directory.

  13. Go to data-transform-vkgl/dir/data-release-pipeline/utils and generate the gene id's for the consensus file:

./vkgl_consensus_add_gene_id.sh -i /path/to/molgenis/output/vkgl_consensus.tsv -o /path/to/vipdir/molgenis-consensus-output/vkgl_consensus.tsv -g ../datetimeoftransformeddata/downloads/hgnc_genes_20210920.tsv
./vkgl_consensus_add_gene_id.sh -i /path/to/molgenis/output/vkgl_public_consensus.tsv -o /path/to/vipdir/molgenis-consensus-output/vkgl_public_consensus.tsv -g ../datetimeoftransformeddata/downloads/hgnc_genes_20210920.tsv
  1. Get the vcf-tsv converter in the utils folder of data-transform:
wget https://github.com/molgenis/tsv-vcf-converter/releases/download/vx.y.z/tsv-vcf-converter.jar
  1. Do the liftover for the vip data:
./liftover_vkgl_consensus.sh /path/to/vipdir/vkgl_public_consensus.tsv /path/to/vipdir/vkgl_public_consensus_hg38.tsv
./liftover_vkgl_consensus.sh /path/to/vipdir/vkgl_consensus.tsv /path/to/vipdir/vkgl_consensus_hg38.tsv
  1. Ask VIP team for following step: Deploy data (with _mmyyyy as postfix) of vip directory on the clusters:

    Fender

    Public only, GRCh37 + GRCh38

    /apps/data/VKGL/GRCh37/
    /apps/data/VKGL/GRCh38/
    

    Gearshift

    public + private + artefacts, GRCh37 + GRCh38

    /apps/data/VKGL/GRCh37/
    /apps/data/VKGL/GRCh38/
    

    zf-ds

    public + private + artefacts, GRCh37 only

    /apps/data/VKGL/GRCh37/
    
  2. Persist data on thegearshift cluster. Create a new folder with a name like yyyymm in the /groups/umcg-gcc/prm03/projects/VKGL/ folder.

  3. In this directory make the following folders:

  • clinvar
  • molgenis
  • raw
  • data-transform
  • vip
  1. Fill the raw folder by copying all raw data you got from Radboud/MUMC, LUMC and all Alissa data.
  2. Fill the data-transform folder by copying all data from the data-transform-vkgl/data-release-pipeline/dateofyourdata using cp -r to it
  3. In the molgenis folder create an input and an output folder.
  4. In the input directory place all content of the input folder of molgenis-py-consensus.
  5. In the output directory place all content of the output folder of molgenis-py-consensus.
  6. Fill the vip folder with all data in the vip folder on tmp.
  7. Go to the ClinVar folder and make a generated and reports folder in there.
  8. Copy all content of the folder with the output of the vkgl-clinvar-writer to the generated folder.
  9. When ClinVar is done evaluating, go to the submission portal and download all reports and place them in the reports folder. Prefix them with the correct lab name.
  10. Create a versions.txt in /groups/umcg-gcc/prm03/projects/VKGL/yyyymm/ with the following content:
    DATA:
    Alissa: yyyymmdd
    Radboud/MUMC: yyyy-mm-dd
    LUMC: yyyy-mm-dd
    HGNC genes file: yyyymmdd
    
    Scripts:
    data-transform-vkgl and run.sh: data-release-vx.y.z
    tsv-vcf-converter: vx.y.z
    vkgl-clinvar: vx.y.z
    molgenis-py-consensus: x.y.z
    molgenis-tools-emx-downloader: x.y.z
    molgenis-tools-commander: vx.y.z
    
    Fill in the versions used for this export.