Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gene task failing when importing GTEx tpm file #914

Closed
mattsolo1 opened this issue Jun 23, 2022 · 15 comments · Fixed by #1315
Closed

Gene task failing when importing GTEx tpm file #914

mattsolo1 opened this issue Jun 23, 2022 · 15 comments · Fixed by #1315

Comments

@mattsolo1
Copy link
Contributor

mattsolo1 commented Jun 23, 2022

Running the genes task fails

./deployctl data-pipeline run --cluster <cluster> genes

Stack trace:

2022-06-22 17:06:59,249 - gnomad_data_pipeline - INFO - Running prepare_gtex_v7_expression_data (Output does not exist)
2022-06-22 17:06:59 Hail: WARN: file 'gs://gnomadev-data-pipeline-output/2022-06-22/2/external_sources/gtex/v7/GTEx_Analysis_2016-01-15_v7_RSEMv1.2.22_transcript_tpm.txt.gz' is 1.8G
  It will be loaded serially (on one core) due to usage of the 'force' argument.
  If it is actually block-gzipped, either rename to .bgz or use the 'force_bgz'
  argument.
2022-06-22 17:07:08 Hail: INFO: Loading <StructExpression of type struct{transcript_id: str, gene_id: str, `GTEX-1117F-0226-SM-5GZZ7`: str, `GTEX-1117F-0426-SM-5EGHI`: str, ...etc, fields. Counts by type:
  str: 11690
2022-06-22 17:07:11 Hail: WARN: file 'gs://gnomadev-data-pipeline-output/2022-06-22/2/external_sources/gtex/v7/GTEx_Analysis_2016-01-15_v7_RSEMv1.2.22_transcript_tpm.txt.gz' is 1.8G
  It will be loaded serially (on one core) due to usage of the 'force' argument.
  If it is actually block-gzipped, either rename to .bgz or use the 'force_bgz'
  argument.
Traceback (most recent call last):
  File "/tmp/645b5598df4e41dd9a0e24f748f6bb15/genes.py", line 325, in <module>
    run_pipeline(pipeline)
  File "/tmp/645b5598df4e41dd9a0e24f748f6bb15/pyfiles_zdd6fpyt.zip/data_pipeline/pipeline.py", line 197, in run_pipeline
  File "/tmp/645b5598df4e41dd9a0e24f748f6bb15/pyfiles_zdd6fpyt.zip/data_pipeline/pipeline.py", line 164, in run
  File "/tmp/645b5598df4e41dd9a0e24f748f6bb15/pyfiles_zdd6fpyt.zip/data_pipeline/pipeline.py", line 129, in run
  File "/tmp/645b5598df4e41dd9a0e24f748f6bb15/pyfiles_zdd6fpyt.zip/data_pipeline/data_types/gtex_tissue_expression.py", line 14, in prepare_gtex_expression_data
  File "<decorator-gen-1010>", line 2, in export
  File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/table.py", line 1098, in export
    Env.backend().execute(
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 104, in execute
    self._handle_fatal_error_from_backend(e, ir)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/backend.py", line 181, in _handle_fatal_error_from_backend
    raise err
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    result_tuple = self._jbackend.executeEncode(jir, stream_codec)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 31, in deco
    raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: MethodTooLargeException: Method too large: __C19580collect_distributed_array.__m19633split_InsertFields ()V

Java stack trace:
is.hail.relocated.org.objectweb.asm.MethodTooLargeException: Method too large: __C19580collect_distributed_array.__m19633split_InsertFields ()V
	at is.hail.relocated.org.objectweb.asm.MethodWriter.computeMethodInfoSize(MethodWriter.java:2087)
	at is.hail.relocated.org.objectweb.asm.ClassWriter.toByteArray(ClassWriter.java:489)
	at is.hail.lir.Emit$.apply(Emit.scala:217)
	at is.hail.lir.Classx.$anonfun$asBytes$4(X.scala:108)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at scala.collection.AbstractIterator.to(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
	at is.hail.lir.Classx.asBytes(X.scala:121)
	at is.hail.asm4s.ClassBuilder.classBytes(ClassBuilder.scala:357)
	at is.hail.asm4s.ModuleBuilder.$anonfun$classesBytes$1(ClassBuilder.scala:151)
	at is.hail.asm4s.ModuleBuilder.$anonfun$classesBytes$1$adapted(ClassBuilder.scala:151)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at scala.collection.AbstractIterator.to(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
	at is.hail.asm4s.ModuleBuilder.classesBytes(ClassBuilder.scala:152)
	at is.hail.expr.ir.EmitClassBuilder.resultWithIndex(EmitClassBuilder.scala:708)
	at is.hail.expr.ir.WrappedEmitClassBuilder.resultWithIndex(EmitClassBuilder.scala:170)
	at is.hail.expr.ir.WrappedEmitClassBuilder.resultWithIndex$(EmitClassBuilder.scala:170)
	at is.hail.expr.ir.EmitFunctionBuilder.resultWithIndex(EmitClassBuilder.scala:1115)
	at is.hail.expr.ir.Emit.$anonfun$emitI$225(Emit.scala:2337)
	at is.hail.expr.ir.IEmitCodeGen.map(Emit.scala:334)
	at is.hail.expr.ir.Emit.emitI(Emit.scala:2278)
	at is.hail.expr.ir.Emit.$anonfun$emitSplitMethod$1(Emit.scala:575)
	at is.hail.expr.ir.Emit.$anonfun$emitSplitMethod$1$adapted(Emit.scala:573)
	at is.hail.expr.ir.EmitCodeBuilder$.scoped(EmitCodeBuilder.scala:18)
	at is.hail.expr.ir.EmitCodeBuilder$.scopedVoid(EmitCodeBuilder.scala:28)
	at is.hail.expr.ir.EmitMethodBuilder.voidWithBuilder(EmitClassBuilder.scala:1048)
	at is.hail.expr.ir.Emit.emitSplitMethod(Emit.scala:573)
	at is.hail.expr.ir.Emit.emitInSeparateMethod(Emit.scala:590)
	at is.hail.expr.ir.Emit.emitI(Emit.scala:777)
	at is.hail.expr.ir.Emit.emitI$1(Emit.scala:614)
	at is.hail.expr.ir.Emit.$anonfun$emitVoid$26(Emit.scala:732)
	at is.hail.expr.ir.TableTextFinalizer.writeMetadata(TableWriter.scala:507)
	at is.hail.expr.ir.Emit.emitVoid(Emit.scala:732)
	at is.hail.expr.ir.Emit.emitVoid$1(Emit.scala:611)
	at is.hail.expr.ir.Emit.$anonfun$emitVoid$5(Emit.scala:628)
	at is.hail.expr.ir.Emit.$anonfun$emitVoid$5$adapted(Emit.scala:628)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at is.hail.expr.ir.Emit.$anonfun$emitVoid$4(Emit.scala:628)
	at is.hail.expr.ir.Emit.$anonfun$emitVoid$4$adapted(Emit.scala:627)
	at is.hail.expr.ir.EmitCodeBuilder$.scoped(EmitCodeBuilder.scala:18)
	at is.hail.expr.ir.EmitCodeBuilder$.scopedVoid(EmitCodeBuilder.scala:28)
	at is.hail.expr.ir.EmitMethodBuilder.voidWithBuilder(EmitClassBuilder.scala:1048)
	at is.hail.expr.ir.Emit.$anonfun$emitVoid$3(Emit.scala:627)
	at is.hail.expr.ir.Emit.$anonfun$emitVoid$3$adapted(Emit.scala:625)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at is.hail.expr.ir.Emit.emitVoid(Emit.scala:625)
	at is.hail.expr.ir.Emit$.$anonfun$apply$3(Emit.scala:70)
	at is.hail.expr.ir.Emit$.$anonfun$apply$3$adapted(Emit.scala:68)
	at is.hail.expr.ir.EmitCodeBuilder$.scoped(EmitCodeBuilder.scala:18)
	at is.hail.expr.ir.EmitCodeBuilder$.scopedVoid(EmitCodeBuilder.scala:28)
	at is.hail.expr.ir.EmitMethodBuilder.voidWithBuilder(EmitClassBuilder.scala:1048)
	at is.hail.expr.ir.Emit$.apply(Emit.scala:68)
	at is.hail.expr.ir.Compile$.apply(Compile.scala:78)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$1(CompileAndEvaluate.scala:50)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:50)
	at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:30)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:30)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:15)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:13)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:416)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:452)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:69)
	at is.hail.utils.package$.using(package.scala:640)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:69)
	at is.hail.utils.package$.using(package.scala:640)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:58)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:310)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:449)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:448)
	at sun.reflect.GeneratedMethodAccessor113.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.95-513139587f57
Error summary: MethodTooLargeException: Method too large: __C19580collect_distributed_array.__m19633split_InsertFields ()V

At this stage of the pipeline, we're importing and exporting the GTEx transcript TPM file (https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RSEMv1.2.22_transcript_tpm.txt.gz)

def prepare_gtex_expression_data(transcript_tpms_path, sample_annotations_path, tmp_path):
# Recompress tpms file with block gzip so that import_matrix_table will read the file
ds = hl.import_table(transcript_tpms_path, force=True)
tmp_transcript_tpms_path = tmp_path + "/" + transcript_tpms_path.split("/")[-1].replace(".gz", ".bgz")
ds.export(tmp_transcript_tpms_path)

I ran this pipeline task on the same file several months ago with Hail 0.2.81, so something must have changed since then (now 0.2.95). Perhaps there are too many columns for Hail to import now; there appears to be one column for every GTEx sample (e.g., GTEX-1117F-2826-SM-5GZXL: str, GTEX-1117F-2926-SM-5GZYI: str`, etc).

Possible solutions

As a near-term work around, could adapt the pipeline to use one of the previous successful exports of the table from this step: gtex_v7_tissue_expression.ht, since it doesn't look like the source file has been updated anways.

In parallel file a bug report with Hail.

@phildarnowsky-broad
Copy link
Contributor

@mattsolo1 FYI I am currently investigating this

@mattsolo1
Copy link
Contributor Author

Btw just noticed hail 0.2.96 released today... could be worth trying again after an update

@phildarnowsky-broad
Copy link
Contributor

phildarnowsky-broad commented Jun 24, 2022 via email

@phildarnowsky-broad
Copy link
Contributor

phildarnowsky-broad commented Jun 24, 2022 via email

@mattsolo1
Copy link
Contributor Author

Yeah, makes sense to include the hail version.

@phildarnowsky-broad
Copy link
Contributor

phildarnowsky-broad commented Jun 25, 2022 via email

@phildarnowsky-broad
Copy link
Contributor

it looks like the crux of the problem is the number of columns, when I load the TSV and then throw away all but a handful of columns, this step succeeds and the pipeline continues

@mattsolo1
Copy link
Contributor Author

mattsolo1 commented Jun 27, 2022

Perhaps you could modify the gene pipeline task to use the output table from this step (gtex_v7_tissue_expression.ht) as an input?

Since the input GTEx file doesn't seem to change often, it may not be worth putting in the effort of rewriting this pipeline step right now, since it's possible that that future releases of GTEx expression results may have different a different format anyway.

And in the meantime perhaps file a hail bug report?

@phildarnowsky-broad
Copy link
Contributor

Per hail-is/hail#11972, this is ultimately due to a known Hail bug with wide Tables that, for strategic reasons, will probably not be fixed. Hail team suggested rewriting this task using import_matrix_table.

@mattsolo1
Copy link
Contributor Author

That makes sense. In that case should we close the issue on their repo?

@phildarnowsky-broad
Copy link
Contributor

@mattsolo1 yeah that makes sense, will do

@mattsolo1 mattsolo1 removed their assignment Aug 8, 2022
@mattsolo1
Copy link
Contributor Author

mattsolo1 commented Nov 13, 2023

This issue is effectively closed by #1178 and #1269 in that an intermediate hail table is hard-coded.

# This table can no longer be generated with current versions of Hail

However the genes pipeline is no longer reproducible with the reference to our private data pipeline bucket.

@rileyhgrant @ch-kr could this hail table be moved to the gnomAD public bucket (the same one we host the downloads page from)?

@ch-kr
Copy link
Contributor

ch-kr commented Nov 13, 2023

that makes sense to me, I think the team and community would benefit from using this resource. one question: how often does this HT get updated? if we regularly replace it, then copying this into the public bucket might not be a good idea (we can't delete old data)

@rileyhgrant
Copy link
Contributor

that makes sense to me, I think the team and community would benefit from using this resource. one question: how often does this HT get updated? if we regularly replace it, then copying this into the public bucket might not be a good idea (we can't delete old data)

The GTEx and pext hailtable referenced here never get updated, they're each a specific release of those resources, and our pipeline just reshapes the data.

If we want the versions of these we'll need to update the hailtables, but these two specific hailtables of the specific versions will never be updated.

@ch-kr
Copy link
Contributor

ch-kr commented Nov 13, 2023

thanks for the context! copying sounds good to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants