-
Notifications
You must be signed in to change notification settings - Fork 83
Add a GISTIC module; start looking at running on specific histologies #535
Add a GISTIC module; start looking at running on specific histologies #535
Conversation
set -o pipefail | ||
|
||
# Configure environmental variables for MCR | ||
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64: | |
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64 |
doubt you need this last colon since you're not going to append anything after this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I realize that one was from me back in the previous PR 😁 )
@jaclyn-taroni looks like you have everything set! |
Hmm something going on with the paths for the reference genome file and yet it doesn’t fail. Wonder if it’s a warning instead of an error... |
In CI, the repository will be in the rocker-build directory
So test using the example included with GISTIC install
@jharenza do you have a rule of thumb for the required sample size for GISTIC? |
When I run things locally using anything less than the entire cohort I get an error on the broad analysis step:
I wanted to push what I have so I can get feedback on the structure. |
hmm @jaclyn-taroni - I don't remember if I have seen that before. Do you get output? It has been a while, but I think when I used to set up the array file to include certain histologies, I would notice that some would have all of the outputs, some did not. When I ran it on a cell line cohort, an N of 39 was too small and I had to essentially "trick" it by duplicating the data 2-3x and it ran. |
@jharenza I do get output but not all the output --
@jashapiro also pointed out that I should use the arrayfile argument instead, so I will try that |
Moved to using array list files with f26ca3d but I'm still getting the error for LGAT. I committed and added the array list file I used for that. |
I ran GISTIC on the consensus SEG file included in the data download overnight and am happy to report all the text files produced have the same checksums as the GISTIC output in the data download. The PDF checksums change which I expected; a quick visual inspection of a few of them look the same between runs. I don't know much about how docker_consensus_gistic_md5sum.txt |
Also uncomment out entire cohort step
Still want to set -e set -o pipefail in CI
@jharenza : If I understand this correctly you're just putting in the same samples multiple times:
It seems like that would break any assumptions around independence and potentially drastically increase the false positive rates for any calls. Is that something that's widely accepted in practice and is there any literature to support doing that? |
When I did this, it was because our cell line cohort was too small, and GISTIC would not run. I think it was trial and error by @gonzolgarcia and/or he may have seen it in the GISTIC forum, but when I did that, it was just to get the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, except for the mysterious errors.
I also have some comments on the choices of arguments, but those are more for later reference than for this particular PR.
Warning: Objects of graphics.ploteditbehavior class exist - not clearing this class or any of its superclasses | ||
``` | ||
|
||
We have not gotten to the bottom of this as of yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best last line possible.
script_directory="$(perl -e 'use File::Basename; | ||
use Cwd "abs_path"; | ||
print dirname(abs_path(@ARGV[0]));' -- "$0")" | ||
cd "$script_directory" || exit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an alternative to this that fits on one line.... I don't think it has flaws, except maybe not working in other shells, which is not relevant here.
cd "$(dirname "${BASH_SOURCE[0]}")"
set -e | ||
set -o pipefail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want this to fail in CI if the install is broken (which is all we are testing) but continue to run when the first GISTIC run that uses an array list file.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64 | ||
export XAPPLRESDIR=/opt/mcr/v83/X11/app-defaults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this happens in run-gistic-openpbta.sh
, which is called via this script, is there a reason not to just do it for both versions of the script outside this conditional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate, but related, should these variables be unset/reset at the end of this script to avoid potential downstream errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @jashapiro in person unset <name of variable>
-twoside 1 \ | ||
-brlen 0.98 \ | ||
-conf 0.90 \ | ||
-armpeel 1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how I feel about this option; I can certainly imagine it is useful for some cases, but is it always needed?
Flag set to enable arm-level peel-off of events during peak definition. The arm-level peel-off enhancement to the arbitrated peel-off method assigns all events in the same chromosome arm of the same sample to a single peak. It is useful when peaks are split by noise or chromothripsis. Allowed values= {1,0}. (DEFAULT=0, use normal arbitrated peel-off)
-smallmem 1 \ | ||
-broad 1 \ | ||
-twoside 1 \ | ||
-brlen 0.98 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the default value: 98% of a chromosome arm distinguishes broad from focal (Though I am not sure what that means, tbh)
-conf 0.90 \ | ||
-armpeel 1 \ | ||
-savegene 1 \ | ||
-gcm extreme \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that I love extreme
for this.
Method for reducing marker-level copy number data to the gene-level copy number data in the gene tables. Markers contained in the gene are used when available, otherwise the flanking marker or markers are used. Allowed values are mean, median, min, max or extreme. The extreme method chooses whichever of min or max is furthest from diploid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
median
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was for future testing, not immediate implementation, but I would look at both mean and median.
Co-Authored-By: jashapiro <[email protected]>
@jashapiro just pushed some updates. I am running the step on the entire cohort locally to make sure everything is a-okay, but wanted to make sure these changes are what you meant! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. While I don't love the setting env variables twice, but I see why you would want that for independent runs of run-gistic-openpbta.sh
Does the double unset
cause a problem? I feel like it might throw an error (though in my quick test it doesn't).
Do you think I should move the setting and unsetting of |
Purpose/implementation Section
Now that GISTIC is installed on the Docker container (#531), we should run it! This draft pull request is to make sure this still builds and runs in CI using the consensus SEG file. I am including filtering to specific histologies above a certain sample size using the array list file functionality of GISTIC.
I'm also removing large files that I should have removed in #531 to reduce the size of the GISTIC install layer ~1GB (wow!)
What GitHub issue does your pull request address?
Related to #529
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Any bash pointers?
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.