-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gene task failing when importing GTEx tpm file #914
Comments
@mattsolo1 FYI I am currently investigating this |
Btw just noticed hail 0.2.96 released today... could be worth trying again after an update |
OK nice I'll try that before I go any further on this, and I suppose we'd
want to keep up to date on Hail in any case.
…On Fri, Jun 24, 2022 at 1:29 PM Matthew Solomonson ***@***.***> wrote:
Btw just noticed hail 0.2.96 released today... could be worth trying again
after an update
—
Reply to this email directly, view it on GitHub
<#914 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AZQ2NNOEKAB7OD6C35RX2Q3VQXV57ANCNFSM5ZUKYNGA>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
No, same error with 0.2.96. On a related note, I had thought the hail
version might be specified in `data-pipeline/requirements.txt` but it's
not, is that something we want or is there a reason to leave it out?
On Fri, Jun 24, 2022 at 1:36 PM Phil Darnowsky ***@***.***>
wrote:
… OK nice I'll try that before I go any further on this, and I suppose we'd
want to keep up to date on Hail in any case.
On Fri, Jun 24, 2022 at 1:29 PM Matthew Solomonson <
***@***.***> wrote:
> Btw just noticed hail 0.2.96 released today... could be worth trying
> again after an update
>
> —
> Reply to this email directly, view it on GitHub
> <#914 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AZQ2NNOEKAB7OD6C35RX2Q3VQXV57ANCNFSM5ZUKYNGA>
> .
> You are receiving this because you were assigned.Message ID:
> ***@***.***>
>
|
Yeah, makes sense to include the hail version. |
(note to self so I remember this come Monday morning)
what this step fundamentally is doing is taking tabular data represented in
a `Table` and building a `MatrixTable` with all the same data, just
organized differently. this export/re-import dance is a roundabout way of
doing that, with the advantage that it leverages a lot of pre-existing
code. however, we run into the problem we see here.
I'm thinking there may be a way to load this data into the `MatrixTable` we
ultimately want more directly. It might work to read everything into a
`Table` as we do now but then convert it into the `MatrixTable` without
using the export/import machinery, rather explicitly modelling the
transformation using steps that can be parallelized. Or maybe we don't
need the intermediate `Table` and can build our `MatrixTable` right from
the TSV file. In any case, I'll see if and how these could be done in Hail.
I am also concerned that, if we're running into this problem here, there
might be similar problems in other parts of the pipeline where this
`MatrixTable` or associated entities hit a resource limitation. Hard to say
without being able to get past this point in the pipeline, but it's
something that should not surprise us if it does happen.
…On Fri, Jun 24, 2022 at 4:07 PM Matthew Solomonson ***@***.***> wrote:
Yeah, makes sense to include the hail version.
—
Reply to this email directly, view it on GitHub
<#914 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AZQ2NNMCUURINHZIQOPVTLLVQYIOXANCNFSM5ZUKYNGA>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
it looks like the crux of the problem is the number of columns, when I load the TSV and then throw away all but a handful of columns, this step succeeds and the pipeline continues |
Perhaps you could modify the gene pipeline task to use the output table from this step ( Since the input GTEx file doesn't seem to change often, it may not be worth putting in the effort of rewriting this pipeline step right now, since it's possible that that future releases of GTEx expression results may have different a different format anyway. And in the meantime perhaps file a hail bug report? |
Per hail-is/hail#11972, this is ultimately due to a known Hail bug with wide |
That makes sense. In that case should we close the issue on their repo? |
@mattsolo1 yeah that makes sense, will do |
This issue is effectively closed by #1178 and #1269 in that an intermediate hail table is hard-coded.
However the genes pipeline is no longer reproducible with the reference to our private data pipeline bucket. @rileyhgrant @ch-kr could this hail table be moved to the gnomAD public bucket (the same one we host the downloads page from)? |
that makes sense to me, I think the team and community would benefit from using this resource. one question: how often does this HT get updated? if we regularly replace it, then copying this into the public bucket might not be a good idea (we can't delete old data) |
The GTEx and pext hailtable referenced here never get updated, they're each a specific release of those resources, and our pipeline just reshapes the data. If we want the versions of these we'll need to update the hailtables, but these two specific hailtables of the specific versions will never be updated. |
thanks for the context! copying sounds good to me |
Running the
genes
task fails./deployctl data-pipeline run --cluster <cluster> genes
Stack trace:
At this stage of the pipeline, we're importing and exporting the GTEx transcript TPM file (https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RSEMv1.2.22_transcript_tpm.txt.gz)
gnomad-browser/data-pipeline/src/data_pipeline/data_types/gtex_tissue_expression.py
Lines 10 to 14 in 77f3872
I ran this pipeline task on the same file several months ago with Hail 0.2.81, so something must have changed since then (now 0.2.95). Perhaps there are too many columns for Hail to import now; there appears to be one column for every GTEx sample (e.g.,
GTEX-1117F-2826-SM-5GZXL
: str,GTEX-1117F-2926-SM-5GZYI
: str`, etc).Possible solutions
As a near-term work around, could adapt the pipeline to use one of the previous successful exports of the table from this step:
gtex_v7_tissue_expression.ht
, since it doesn't look like the source file has been updated anways.In parallel file a bug report with Hail.
The text was updated successfully, but these errors were encountered: