Skip to content
This repository was archived by the owner on Jun 21, 2023. It is now read-only.

Add a GISTIC module; start looking at running on specific histologies #535

Merged
merged 32 commits into from
Feb 14, 2020

Conversation

jaclyn-taroni
Copy link
Member

@jaclyn-taroni jaclyn-taroni commented Feb 12, 2020

Purpose/implementation Section

Now that GISTIC is installed on the Docker container (#531), we should run it! This draft pull request is to make sure this still builds and runs in CI using the consensus SEG file. I am including filtering to specific histologies above a certain sample size using the array list file functionality of GISTIC.

I'm also removing large files that I should have removed in #531 to reduce the size of the GISTIC install layer ~1GB (wow!)

What GitHub issue does your pull request address?

Related to #529

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Any bash pointers?

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

set -o pipefail

# Configure environmental variables for MCR
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64

doubt you need this last colon since you're not going to append anything after this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I realize that one was from me back in the previous PR 😁 )

@jharenza
Copy link
Collaborator

@jaclyn-taroni looks like you have everything set!

@jaclyn-taroni
Copy link
Member Author

Hmm something going on with the paths for the reference genome file and yet it doesn’t fail. Wonder if it’s a warning instead of an error...

@jaclyn-taroni
Copy link
Member Author

@jharenza do you have a rule of thumb for the required sample size for GISTIC?

@jaclyn-taroni
Copy link
Member Author

When I run things locally using anything less than the entire cohort I get an error on the broad analysis step:

Focal GISTIC completed without error
Running broad analysis...
Reconstructing genome: amp
Reconstructing genome: aod
Reconstructing genome: del
Reconstructing genome: doa
Calculating median of arm values...
arm 1: 1p 11910 markers
arm 2: 1q 10078 markers
arm 3: 2p 9007 markers
arm 4: 2q 14373 markers
arm 5: 3p 8744 markers
arm 6: 3q 9943 markers
arm 7: 4p 4730 markers
arm 8: 4q 13741 markers
arm 9: 5p 4508 markers
arm 10: 5q 12758 markers
arm 11: 6p 5339 markers
arm 12: 6q 10788 markers
arm 13: 7p 5753 markers
arm 14: 7q 9655 markers
arm 15: 8p 4113 markers
arm 16: 8q 9747 markers
arm 17: 9p 3856 markers
arm 18: 9q 7202 markers
arm 19: 10p 3752 markers
arm 20: 10q 9050 markers
arm 21: 11p 5014 markers
arm 22: 11q 7816 markers
arm 23: 12p 3255 markers
arm 24: 12q 9501 markers
arm 25: 13p 551 markers
arm 26: 13q 9501 markers
arm 27: 14p 441 markers
arm 28: 14q 8545 markers
arm 29: 15p 551 markers
arm 30: 15q 7739 markers
arm 31: 16p 3138 markers
arm 32: 16q 4252 markers
arm 33: 17p 2183 markers
arm 34: 17q 5463 markers
arm 35: 18p 1414 markers
arm 36: 18q 5839 markers
arm 37: 19p 1943 markers
arm 38: 19q 2632 markers
arm 39: 20p 2526 markers
arm 40: 20q 3372 markers
arm 41: 21p 390 markers
arm 42: 21q 3241 markers
arm 43: 22p 511 markers
arm 44: 22q 3182 markers
     1

     2

     3

     4

     5

     6

     7

     8

     9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    34

    35

    36

    37

    38

    39

    40

    41

    42

    43

    44

Error using line
Vectors must be the same lengths.

Error in gistic_broad_analysis (line 209)



Error in run_gistic20 (line 130)



Error in run_gistic2_from_seg (line 249)



Error in gp_gistic2_from_seg (line 97)



MATLAB:samelen
Warning: Objects of specgraph.scattergroup class exist - not clearing this class or any of its superclasses
Warning: Objects of scribe.legendinfo class exist - not clearing this class or any of its superclasses
Warning: Objects of scribe.legendinfochild class exist - not clearing this class or any of its superclasses
Warning: Objects of scribe.legend class exist - not clearing this class or any of its superclasses
Warning: Objects of graphics.panbehavior class exist - not clearing this class or any of its superclasses
Warning: Objects of graphics.zoombehavior class exist - not clearing this class or any of its superclasses
Warning: Objects of graphics.rotate3dbehavior class exist - not clearing this class or any of its superclasses
Warning: Objects of graphics.datacursorbehavior class exist - not clearing this class or any of its superclasses
Warning: Objects of graphics.ploteditbehavior class exist - not clearing this class or any of its superclasses

I wanted to push what I have so I can get feedback on the structure.

@jharenza
Copy link
Collaborator

hmm @jaclyn-taroni - I don't remember if I have seen that before. Do you get output?

It has been a while, but I think when I used to set up the array file to include certain histologies, I would notice that some would have all of the outputs, some did not. When I ran it on a cell line cohort, an N of 39 was too small and I had to essentially "trick" it by duplicating the data 2-3x and it ran.

@jaclyn-taroni
Copy link
Member Author

@jharenza I do get output but not all the output --

analyses/run-gistic/results/pbta-cnv-consensus-lgat-gistic/
├── D.cap1.5.mat
├── all_lesions.conf_90.txt
├── amp_genes.conf_90.txt
├── amp_qplot.pdf
├── amp_qplot.png
├── del_genes.conf_90.txt
├── del_qplot.pdf
├── del_qplot.png
├── focal_dat.0.98.mat
├── freqarms_vs_ngenes.pdf
├── gistic_inputs.mat
├── peak_regs.mat
├── perm_ads.mat
├── raw_copy_number.pdf
├── raw_copy_number.png
├── regions_track.conf_90.bed
├── sample_seg_counts.txt
├── scores.0.98.mat
├── scores.gistic
└── wide_peak_regs.mat

@jashapiro also pointed out that I should use the arrayfile argument instead, so I will try that

@jaclyn-taroni
Copy link
Member Author

Moved to using array list files with f26ca3d but I'm still getting the error for LGAT. I committed and added the array list file I used for that.

@jaclyn-taroni
Copy link
Member Author

I ran GISTIC on the consensus SEG file included in the data download overnight and am happy to report all the text files produced have the same checksums as the GISTIC output in the data download. The PDF checksums change which I expected; a quick visual inspection of a few of them look the same between runs. I don't know much about how .mat files are generated but those changed as well (could be a time stamp issue?).

docker_consensus_gistic_md5sum.txt
data_download_consensus_gistic_md5sum.txt

@cgreene
Copy link
Collaborator

cgreene commented Feb 13, 2020

@jharenza : If I understand this correctly you're just putting in the same samples multiple times:

When I ran it on a cell line cohort, an N of 39 was too small and I had to essentially "trick" it by duplicating the data 2-3x and it ran.

It seems like that would break any assumptions around independence and potentially drastically increase the false positive rates for any calls. Is that something that's widely accepted in practice and is there any literature to support doing that?

@jharenza
Copy link
Collaborator

jharenza commented Feb 13, 2020

When I did this, it was because our cell line cohort was too small, and GISTIC would not run. I think it was trial and error by @gonzolgarcia and/or he may have seen it in the GISTIC forum, but when I did that, it was just to get the all_data_by_genes file for gene level CN, not to get recurrently altered CNVs, and I wouldn't recommend to do for that because you are artificially increasing sample size and there would be biases based on what your samples harbor. To be honest, I am not sure if I have used those results in the past because we moved to using the seg file LRR and thresholded for CN (for array data) based on certain amp/dels we knew of in the sample set (eg for PPTC, I thresholded amps using MYCN and dels using CDKN2A/B).

@jaclyn-taroni jaclyn-taroni marked this pull request as ready for review February 13, 2020 15:16
@jaclyn-taroni jaclyn-taroni changed the title WIP: add a GISTIC module Add a GISTIC module; start looking at running on specific histologies Feb 13, 2020
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, except for the mysterious errors.

I also have some comments on the choices of arguments, but those are more for later reference than for this particular PR.

Warning: Objects of graphics.ploteditbehavior class exist - not clearing this class or any of its superclasses
```

We have not gotten to the bottom of this as of yet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best last line possible.

Comment on lines 7 to 10
script_directory="$(perl -e 'use File::Basename;
use Cwd "abs_path";
print dirname(abs_path(@ARGV[0]));' -- "$0")"
cd "$script_directory" || exit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an alternative to this that fits on one line.... I don't think it has flaws, except maybe not working in other shells, which is not relevant here.

cd "$(dirname "${BASH_SOURCE[0]}")"

Comment on lines 18 to 19
set -e
set -o pipefail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not everywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want this to fail in CI if the install is broken (which is all we are testing) but continue to run when the first GISTIC run that uses an array list file.

Comment on lines +20 to +21
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64
export XAPPLRESDIR=/opt/mcr/v83/X11/app-defaults
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this happens in run-gistic-openpbta.sh, which is called via this script, is there a reason not to just do it for both versions of the script outside this conditional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate, but related, should these variables be unset/reset at the end of this script to avoid potential downstream errors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @jashapiro in person unset <name of variable>

-twoside 1 \
-brlen 0.98 \
-conf 0.90 \
-armpeel 1 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how I feel about this option; I can certainly imagine it is useful for some cases, but is it always needed?

Flag set to enable arm-level peel-off of events during peak definition. The arm-level peel-off enhancement to the arbitrated peel-off method assigns all events in the same chromosome arm of the same sample to a single peak. It is useful when peaks are split by noise or chromothripsis. Allowed values= {1,0}. (DEFAULT=0, use normal arbitrated peel-off)

-smallmem 1 \
-broad 1 \
-twoside 1 \
-brlen 0.98 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the default value: 98% of a chromosome arm distinguishes broad from focal (Though I am not sure what that means, tbh)

-conf 0.90 \
-armpeel 1 \
-savegene 1 \
-gcm extreme \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that I love extreme for this.

Method for reducing marker-level copy number data to the gene-level copy number data in the gene tables. Markers contained in the gene are used when available, otherwise the flanking marker or markers are used. Allowed values are mean, median, min, max or extreme. The extreme method chooses whichever of min or max is furthest from diploid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

median ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was for future testing, not immediate implementation, but I would look at both mean and median.

@jaclyn-taroni
Copy link
Member Author

@jashapiro just pushed some updates. I am running the step on the entire cohort locally to make sure everything is a-okay, but wanted to make sure these changes are what you meant!

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. While I don't love the setting env variables twice, but I see why you would want that for independent runs of run-gistic-openpbta.sh Does the double unset cause a problem? I feel like it might throw an error (though in my quick test it doesn't).

@jaclyn-taroni
Copy link
Member Author

Do you think I should move the setting and unsetting of analyses/run-gistic/run-gistic-module.sh into the logic for CI? This was what I was doing earlier but my interpretation of this comment #535 (comment) was to move it out.

@jaclyn-taroni jaclyn-taroni merged commit 22066ad into AlexsLemonade:master Feb 14, 2020
@jaclyn-taroni jaclyn-taroni deleted the add-a-gistic-module branch February 14, 2020 17:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants