Add a GISTIC module; start looking at running on specific histologies #535

jaclyn-taroni · 2020-02-12T18:17:03Z

Purpose/implementation Section

Now that GISTIC is installed on the Docker container (#531), we should run it! This draft pull request is to make sure this still builds and runs in CI using the consensus SEG file. I am including filtering to specific histologies above a certain sample size using the array list file functionality of GISTIC.

I'm also removing large files that I should have removed in #531 to reduce the size of the GISTIC install layer ~1GB (wow!)

What GitHub issue does your pull request address?

Related to #529

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Any bash pointers?

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

cgreene · 2020-02-12T18:18:11Z

analyses/run-gistic/gistic-example.sh

+set -o pipefail
+
+# Configure environmental variables for MCR
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64:


Suggested change

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64

doubt you need this last colon since you're not going to append anything after this

(I realize that one was from me back in the previous PR 😁 )

jharenza · 2020-02-12T18:39:04Z

@jaclyn-taroni looks like you have everything set!

jaclyn-taroni · 2020-02-12T18:46:06Z

Hmm something going on with the paths for the reference genome file and yet it doesn’t fail. Wonder if it’s a warning instead of an error...

In CI, the repository will be in the rocker-build directory

So test using the example included with GISTIC install

jaclyn-taroni · 2020-02-12T20:45:44Z

@jharenza do you have a rule of thumb for the required sample size for GISTIC?

jaclyn-taroni · 2020-02-12T21:54:12Z

When I run things locally using anything less than the entire cohort I get an error on the broad analysis step:

Focal GISTIC completed without error
Running broad analysis...
Reconstructing genome: amp
Reconstructing genome: aod
Reconstructing genome: del
Reconstructing genome: doa
Calculating median of arm values...
arm 1: 1p 11910 markers
arm 2: 1q 10078 markers
arm 3: 2p 9007 markers
arm 4: 2q 14373 markers
arm 5: 3p 8744 markers
arm 6: 3q 9943 markers
arm 7: 4p 4730 markers
arm 8: 4q 13741 markers
arm 9: 5p 4508 markers
arm 10: 5q 12758 markers
arm 11: 6p 5339 markers
arm 12: 6q 10788 markers
arm 13: 7p 5753 markers
arm 14: 7q 9655 markers
arm 15: 8p 4113 markers
arm 16: 8q 9747 markers
arm 17: 9p 3856 markers
arm 18: 9q 7202 markers
arm 19: 10p 3752 markers
arm 20: 10q 9050 markers
arm 21: 11p 5014 markers
arm 22: 11q 7816 markers
arm 23: 12p 3255 markers
arm 24: 12q 9501 markers
arm 25: 13p 551 markers
arm 26: 13q 9501 markers
arm 27: 14p 441 markers
arm 28: 14q 8545 markers
arm 29: 15p 551 markers
arm 30: 15q 7739 markers
arm 31: 16p 3138 markers
arm 32: 16q 4252 markers
arm 33: 17p 2183 markers
arm 34: 17q 5463 markers
arm 35: 18p 1414 markers
arm 36: 18q 5839 markers
arm 37: 19p 1943 markers
arm 38: 19q 2632 markers
arm 39: 20p 2526 markers
arm 40: 20q 3372 markers
arm 41: 21p 390 markers
arm 42: 21q 3241 markers
arm 43: 22p 511 markers
arm 44: 22q 3182 markers
     1

     2

     3

     4

     5

     6

     7

     8

     9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    34

    35

    36

    37

    38

    39

    40

    41

    42

    43

    44

Error using line
Vectors must be the same lengths.

Error in gistic_broad_analysis (line 209)



Error in run_gistic20 (line 130)



Error in run_gistic2_from_seg (line 249)



Error in gp_gistic2_from_seg (line 97)



MATLAB:samelen
Warning: Objects of specgraph.scattergroup class exist - not clearing this class or any of its superclasses
Warning: Objects of scribe.legendinfo class exist - not clearing this class or any of its superclasses
Warning: Objects of scribe.legendinfochild class exist - not clearing this class or any of its superclasses
Warning: Objects of scribe.legend class exist - not clearing this class or any of its superclasses
Warning: Objects of graphics.panbehavior class exist - not clearing this class or any of its superclasses
Warning: Objects of graphics.zoombehavior class exist - not clearing this class or any of its superclasses
Warning: Objects of graphics.rotate3dbehavior class exist - not clearing this class or any of its superclasses
Warning: Objects of graphics.datacursorbehavior class exist - not clearing this class or any of its superclasses
Warning: Objects of graphics.ploteditbehavior class exist - not clearing this class or any of its superclasses

I wanted to push what I have so I can get feedback on the structure.

jharenza · 2020-02-12T22:14:13Z

hmm @jaclyn-taroni - I don't remember if I have seen that before. Do you get output?

It has been a while, but I think when I used to set up the array file to include certain histologies, I would notice that some would have all of the outputs, some did not. When I ran it on a cell line cohort, an N of 39 was too small and I had to essentially "trick" it by duplicating the data 2-3x and it ran.

jaclyn-taroni · 2020-02-12T22:17:14Z

@jharenza I do get output but not all the output --

analyses/run-gistic/results/pbta-cnv-consensus-lgat-gistic/
├── D.cap1.5.mat
├── all_lesions.conf_90.txt
├── amp_genes.conf_90.txt
├── amp_qplot.pdf
├── amp_qplot.png
├── del_genes.conf_90.txt
├── del_qplot.pdf
├── del_qplot.png
├── focal_dat.0.98.mat
├── freqarms_vs_ngenes.pdf
├── gistic_inputs.mat
├── peak_regs.mat
├── perm_ads.mat
├── raw_copy_number.pdf
├── raw_copy_number.png
├── regions_track.conf_90.bed
├── sample_seg_counts.txt
├── scores.0.98.mat
├── scores.gistic
└── wide_peak_regs.mat

@jashapiro also pointed out that I should use the arrayfile argument instead, so I will try that

jaclyn-taroni · 2020-02-12T23:58:09Z

Moved to using array list files with f26ca3d but I'm still getting the error for LGAT. I committed and added the array list file I used for that.

jaclyn-taroni · 2020-02-13T11:36:28Z

I ran GISTIC on the consensus SEG file included in the data download overnight and am happy to report all the text files produced have the same checksums as the GISTIC output in the data download. The PDF checksums change which I expected; a quick visual inspection of a few of them look the same between runs. I don't know much about how .mat files are generated but those changed as well (could be a time stamp issue?).

docker_consensus_gistic_md5sum.txt
data_download_consensus_gistic_md5sum.txt

Also uncomment out entire cohort step

Still want to set -e set -o pipefail in CI

cgreene · 2020-02-13T13:16:33Z

@jharenza : If I understand this correctly you're just putting in the same samples multiple times:

When I ran it on a cell line cohort, an N of 39 was too small and I had to essentially "trick" it by duplicating the data 2-3x and it ran.

It seems like that would break any assumptions around independence and potentially drastically increase the false positive rates for any calls. Is that something that's widely accepted in practice and is there any literature to support doing that?

jharenza · 2020-02-13T13:27:41Z

When I did this, it was because our cell line cohort was too small, and GISTIC would not run. I think it was trial and error by @gonzolgarcia and/or he may have seen it in the GISTIC forum, but when I did that, it was just to get the all_data_by_genes file for gene level CN, not to get recurrently altered CNVs, and I wouldn't recommend to do for that because you are artificially increasing sample size and there would be biases based on what your samples harbor. To be honest, I am not sure if I have used those results in the past because we moved to using the seg file LRR and thresholded for CN (for array data) based on certain amp/dels we knew of in the sample set (eg for PPTC, I thresholded amps using MYCN and dels using CDKN2A/B).

jashapiro

This looks good, except for the mysterious errors.

I also have some comments on the choices of arguments, but those are more for later reference than for this particular PR.

jashapiro · 2020-02-13T15:49:52Z

analyses/run-gistic/README.md

+Warning: Objects of graphics.ploteditbehavior class exist - not clearing this class or any of its superclasses
+```
+
+We have not gotten to the bottom of this as of yet.


Best last line possible.

jashapiro · 2020-02-13T15:55:55Z

analyses/run-gistic/run-gistic-module.sh

+script_directory="$(perl -e 'use File::Basename;
+  use Cwd "abs_path";
+  print dirname(abs_path(@ARGV[0]));' -- "$0")"
+cd "$script_directory" || exit


I have an alternative to this that fits on one line.... I don't think it has flaws, except maybe not working in other shells, which is not relevant here.

cd "$(dirname "${BASH_SOURCE[0]}")"

jashapiro · 2020-02-13T15:57:41Z

analyses/run-gistic/run-gistic-module.sh

+  set -e
+  set -o pipefail


Why not everywhere?

I want this to fail in CI if the install is broken (which is all we are testing) but continue to run when the first GISTIC run that uses an array list file.

analyses/run-gistic/run-gistic-module.sh

jashapiro · 2020-02-13T16:13:55Z

analyses/run-gistic/run-gistic-module.sh

+  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64
+  export XAPPLRESDIR=/opt/mcr/v83/X11/app-defaults


Since this happens in run-gistic-openpbta.sh, which is called via this script, is there a reason not to just do it for both versions of the script outside this conditional?

Separate, but related, should these variables be unset/reset at the end of this script to avoid potential downstream errors?

From @jashapiro in person unset <name of variable>

jashapiro · 2020-02-13T16:40:53Z

analyses/run-gistic/scripts/run-gistic-openpbta.sh

+  -twoside 1 \
+  -brlen 0.98 \
+  -conf 0.90 \
+  -armpeel 1 \


I am not sure how I feel about this option; I can certainly imagine it is useful for some cases, but is it always needed?

Flag set to enable arm-level peel-off of events during peak definition. The arm-level peel-off enhancement to the arbitrated peel-off method assigns all events in the same chromosome arm of the same sample to a single peak. It is useful when peaks are split by noise or chromothripsis. Allowed values= {1,0}. (DEFAULT=0, use normal arbitrated peel-off)

jashapiro · 2020-02-13T16:42:14Z

analyses/run-gistic/scripts/run-gistic-openpbta.sh

+  -smallmem 1 \
+  -broad 1 \
+  -twoside 1 \
+  -brlen 0.98 \


This is the default value: 98% of a chromosome arm distinguishes broad from focal (Though I am not sure what that means, tbh)

jashapiro · 2020-02-13T16:49:40Z

analyses/run-gistic/scripts/run-gistic-openpbta.sh

+  -conf 0.90 \
+  -armpeel 1 \
+  -savegene 1 \
+  -gcm extreme \


I don't know that I love extreme for this.

Method for reducing marker-level copy number data to the gene-level copy number data in the gene tables. Markers contained in the gene are used when available, otherwise the flanking marker or markers are used. Allowed values are mean, median, min, max or extreme. The extreme method chooses whichever of min or max is furthest from diploid.

This comment was for future testing, not immediate implementation, but I would look at both mean and median.

Co-Authored-By: jashapiro <[email protected]>

jaclyn-taroni · 2020-02-14T15:33:34Z

@jashapiro just pushed some updates. I am running the step on the entire cohort locally to make sure everything is a-okay, but wanted to make sure these changes are what you meant!

jashapiro

Looks good. While I don't love the setting env variables twice, but I see why you would want that for independent runs of run-gistic-openpbta.sh Does the double unset cause a problem? I feel like it might throw an error (though in my quick test it doesn't).

jaclyn-taroni · 2020-02-14T15:48:15Z

Do you think I should move the setting and unsetting of analyses/run-gistic/run-gistic-module.sh into the logic for CI? This was what I was doing earlier but my interpretation of this comment #535 (comment) was to move it out.

jaclyn-taroni added 3 commits February 12, 2020 12:43

Remove the large GISTIC-related files we do not need

7196a01

Update the GISTIC example to use consensus

61c4378

Update directory in CI

96e4f2d

cgreene reviewed Feb 12, 2020

View reviewed changes

We don't need this colon

fa3145d

jaclyn-taroni mentioned this pull request Feb 12, 2020

Add MATLAB Runtime and the relevant MATLAB programs to project Docker container #529

Closed

jaclyn-taroni added 9 commits February 12, 2020 14:02

Try relative path for executable

c4c0538

Absolute path for reference genome file

064350e

In CI, the repository will be in the rocker-build directory

This module is not like the others

25fc94f

The subset files will probably always be too small

63a4bfa

So test using the example included with GISTIC install

Forgot to update CI config

60cb975

Fix logic

b162dd6

Fix style

727749e

Move script for running gistic

842eb7f

Document the default option

dec2a70

jaclyn-taroni added 3 commits February 12, 2020 16:50

Add Rscript for subsetting SEG files

882b5a5

Rearrange to run multiple histologies

fa818fa

Uncomment the entire cohort step

62317a4

jaclyn-taroni added 3 commits February 12, 2020 16:56

Fix style

4018611

More indentation woes

ec25f08

MCR environmental variables for CI

f147494

Create array list files instead

f26ca3d

jaclyn-taroni added 4 commits February 13, 2020 06:43

Add compressed consensus entire cohort results

52b0671

Also uncomment out entire cohort step

Script the compression step

efc9075

Formatting

5f7c6a6

For debugging, allow to run to get some of the files

36b9e19

Still want to set -e set -o pipefail in CI

jaclyn-taroni added 2 commits February 13, 2020 08:52

Include incomplete runs for histologies with n > 100

94d3cc1

Update documentation

50041a6

jaclyn-taroni marked this pull request as ready for review February 13, 2020 15:16

jaclyn-taroni changed the title ~~WIP: add a GISTIC module~~ Add a GISTIC module; start looking at running on specific histologies Feb 13, 2020

Add run-gistic to modules at a glance

d98db2b

jaclyn-taroni requested a review from jashapiro February 13, 2020 15:24

Move GISTIC down and comment that it's not testing the module code

4337254

jashapiro approved these changes Feb 13, 2020

View reviewed changes

jaclyn-taroni and others added 2 commits February 13, 2020 13:20

Update analyses/run-gistic/run-gistic-module.sh

6a37cca

Co-Authored-By: jashapiro <[email protected]>

@jashapiro suggestions

568c414

jashapiro approved these changes Feb 14, 2020

View reviewed changes

jaclyn-taroni added 2 commits February 14, 2020 10:57

Move env variable setting and unsetting into CI logic

df80297

End of line

7226cdb

jaclyn-taroni merged commit 22066ad into AlexsLemonade:master Feb 14, 2020

jaclyn-taroni deleted the add-a-gistic-module branch February 14, 2020 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a GISTIC module; start looking at running on specific histologies #535

Add a GISTIC module; start looking at running on specific histologies #535

jaclyn-taroni commented Feb 12, 2020 •

edited

Loading

cgreene Feb 12, 2020

cgreene Feb 12, 2020

jharenza commented Feb 12, 2020

jaclyn-taroni commented Feb 12, 2020

jaclyn-taroni commented Feb 12, 2020

jaclyn-taroni commented Feb 12, 2020

jharenza commented Feb 12, 2020

jaclyn-taroni commented Feb 12, 2020

jaclyn-taroni commented Feb 12, 2020

jaclyn-taroni commented Feb 13, 2020

cgreene commented Feb 13, 2020

jharenza commented Feb 13, 2020 •

edited

Loading

jashapiro left a comment

jashapiro Feb 13, 2020

jashapiro Feb 13, 2020

jashapiro Feb 13, 2020

jaclyn-taroni Feb 13, 2020

jashapiro Feb 13, 2020

jashapiro Feb 13, 2020

jaclyn-taroni Feb 13, 2020

jashapiro Feb 13, 2020

jashapiro Feb 13, 2020

jashapiro Feb 13, 2020

jaclyn-taroni Feb 14, 2020

jashapiro Feb 14, 2020

jaclyn-taroni commented Feb 14, 2020

jashapiro left a comment

jaclyn-taroni commented Feb 14, 2020

	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64:
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64

		export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64
		export XAPPLRESDIR=/opt/mcr/v83/X11/app-defaults

Add a GISTIC module; start looking at running on specific histologies #535

Add a GISTIC module; start looking at running on specific histologies #535

Conversation

jaclyn-taroni commented Feb 12, 2020 • edited Loading

Purpose/implementation Section

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Reproducibility Checklist

Documentation Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jharenza commented Feb 12, 2020

jaclyn-taroni commented Feb 12, 2020

jaclyn-taroni commented Feb 12, 2020

jaclyn-taroni commented Feb 12, 2020

jharenza commented Feb 12, 2020

jaclyn-taroni commented Feb 12, 2020

jaclyn-taroni commented Feb 12, 2020

jaclyn-taroni commented Feb 13, 2020

cgreene commented Feb 13, 2020

jharenza commented Feb 13, 2020 • edited Loading

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni commented Feb 14, 2020

jashapiro left a comment

Choose a reason for hiding this comment

jaclyn-taroni commented Feb 14, 2020

jaclyn-taroni commented Feb 12, 2020 •

edited

Loading

jharenza commented Feb 13, 2020 •

edited

Loading