Generate de novo release #654

KoalaQin · 2025-01-29T17:47:08Z

This cleaned up some redundant dense mt functions and arguments since we had PR #651. Julia created the dense MT in job 0a3e98e75ba14b2f8341012515d11f8b before the PR was merged.

This is waiting on PR #760 in gnomad_methods.

Test run on chr20: b0729d42ea77466c8cc1fe9697e73af1

ch-kr

some questions and minor suggestions

gnomad_qc/v4/resources/sample_qc.py

gnomad_qc/v4/create_release/create_de_novo_release.py

Co-authored-by: Katherine Chao <[email protected]>

… into qh/denovo

ch-kr

just a couple more minor comments, then waiting on the other PR

ch-kr · 2025-02-03T21:03:39Z

gnomad_qc/v4/create_release/create_de_novo_release.py

@@ -135,34 +135,50 @@ def get_releasable_de_novo_calls_ht(
    )
    mt = annotate_adj(mt)

-    # Approximate the AD and PL fields when missing.
+    # Many of our larger datasets have the PL and AD fields for homref genotypes


Suggested change

# Many of our larger datasets have the PL and AD fields for homref genotypes

# Many of our larger datasets have the PL and AD fields for homref genotypes

can you check if this is true for v3 as well? if yes, it's best to say v3 and v4 have the PL and ...

Yes, they are also NA in v3 VDS. FYI, I used this code to check:

from gnomad_qc.v3.resources.basics import get_gnomad_v3_vds from gnomad.sample_qc.relatedness import filter_to_trios vds = get_gnomad_v3_vds(split=True, filter_intervals=['chr11:113409605-113475691'], ) trios = hl.import_fam("gs://gnomad-qin/g1k.trios.fam", delimiter='\t') vds = filter_to_trios(vds, trios) mt = hl.vds.to_dense_mt(vds) mt.entries().show(20)

thank you for looking! this is good to know

gnomad_qc/v4/create_release/create_de_novo_release.py

ch-kr

a few minor suggestions and questions

ch-kr · 2025-02-21T16:15:12Z

gnomad_qc/v4/resources/sample_qc.py

+
+
+def trio_denovo_ht(
+    releasable: bool = True,


does a non-releasable version of this HT exist?

No, will remove.

ch-kr · 2025-02-21T16:16:46Z

gnomad_qc/v4/resources/release.py

+        ".all_confidences.filtered"
+        if by_confidence == "all"
+        else ".high_confidence.filtered"
+    )
    postfix = f".{datetime.today().strftime('%Y-%m-%d')}" if test else ""


basic question, is this how we've started naming test files?

I don't think so, Julia just added this to have different versions to inspect. I could remove it.

ch-kr · 2025-02-21T18:03:04Z

gnomad_qc/v4/create_release/create_de_novo_release.py


-    return selects
+    ht = ht.filter(ht.de_novo_call_info.is_de_novo).naive_coalesce(1000)


should this 1000 be an argument rather than hard coded? not sure how strict we've been about this in gnomad_qc

Yeah, I think I will make it an argument default to 1000 and use 1/10 for test.

gnomad_qc/v4/create_release/create_de_novo_release.py

ch-kr · 2025-02-21T18:07:58Z

gnomad_qc/v4/create_release/create_de_novo_release.py

+    ht = process_consequences(ht, has_polyphen=False)
+
+    ht = ht.annotate(
+        alt_is_star=ht.alleles[1] == "*",


rather than alt_is_star, the annotation added earlier (mixed_site) might be more valuable. users can pretty easily check for themselves whether the alt is a star allele

I don't think they are the same thing:

Maybe we could keep both? I need to use alt_is_star to filter.

yep, they're different annotations -- I suggested keeping mixed_site in case you were trying to retain information about the locus for our users.

for alt_is_star specifically, it feels cleaner to filter directly using the logic here:

ht = ht.filter(~ht.alleles[1] == "*").checkpoint(new_temp_file("denovo_no_star", "ht"))

as opposed to keeping this annotation. the reason I suggest this is because the TSV will still have the alt_is_star annotation despite it always being False, and users that are reading in the HT can easily filter on this logic themselves as well

Change this, and I also moved the all confidence table after it.

ch-kr · 2025-02-21T18:10:06Z

gnomad_qc/v4/create_release/create_de_novo_release.py

+        gene_id=ht.vep.worst_csq_for_variant.gene_id,
+        transcript_id=ht.vep.worst_csq_for_variant.transcript_id,


now that I see this again, I wonder if these should be named worst_csq_gene_id and worst_csq_transcript_id. what do you think?

Yeah, I could change that.

gnomad_qc/v4/create_release/create_de_novo_release.py

ch-kr

a few more comments

gnomad_qc/v4/create_release/create_de_novo_release.py

ch-kr · 2025-02-21T19:42:33Z

gnomad_qc/v4/create_release/create_de_novo_release.py

+    ht = process_consequences(ht, has_polyphen=False)
+
+    ht = ht.annotate(
+        alt_is_star=ht.alleles[1] == "*",


yep, they're different annotations -- I suggested keeping mixed_site in case you were trying to retain information about the locus for our users.

for alt_is_star specifically, it feels cleaner to filter directly using the logic here:

ht = ht.filter(~ht.alleles[1] == "*").checkpoint(new_temp_file("denovo_no_star", "ht"))

as opposed to keeping this annotation. the reason I suggest this is because the TSV will still have the alt_is_star annotation despite it always being False, and users that are reading in the HT can easily filter on this logic themselves as well

gnomad_qc/v4/create_release/create_de_novo_release.py

ch-kr

a couple minor comments. this is close, but we need to finalize how we're defining medium confidence DNMs

gnomad_qc/v4/create_release/create_de_novo_release.py

ch-kr · 2025-03-07T16:43:29Z

gnomad_qc/v4/create_release/create_de_novo_release.py

-    # the fields are present in the HT.
-    final_fields = get_final_ht_fields(ht, schema=FINALIZED_SCHEMA)
-    return ht.select(*final_fields["rows"]).select_globals(*final_fields["globals"])
+    return ht_all_conf, ht


with my other comment above: we should filter the release HT to only high and medium confidence de novos, and it sounds like there is some remaining discussion on how we define medium confidence

Do we still release 2? One with everything and one with high quality?

I saw this on slack before I saw this comment, but adding here for future reference: my understanding based on the summaries I heard of the ATGU lab meeting presentation was that we wanted to release at most high and medium (with some filtering) confidence de novos. this means we should have two release files, a HT with high + medium DNMs, and a TSV with high confidence DNMs only

ch-kr

a few documentation-related requests

gnomad_qc/v4/create_release/create_de_novo_release.py

ch-kr · 2025-03-12T20:37:38Z

gnomad_qc/v4/create_release/create_de_novo_release.py

-        for data_type in ["exomes", "genomes", "joint"]
-    }
+    This step get two sets of de novo calls:
+       - de novos at all confidence levels


this should probably state high quality given your naming below (ht_all_hq)

can you also define what you mean by high quality and what is included in the filtered version of the table in this docstring?

ch-kr · 2025-03-12T20:42:47Z

gnomad_qc/v4/create_release/create_de_novo_release.py

+            "de_novo_AC.AC_high_conf": "de_novo_AC_high_conf",
+            "de_novo_AC.AC_medium_conf": "de_novo_AC_medium_conf",
+            "de_novo_AC.AC_medium_conf_P_0_9": "de_novo_AC_medium_conf_P_0_9",
+            "de_novo_AC.AC_low_conf": "de_novo_AC_low_conf",


this TSV should contain at most the high confidence and medium (p > whatever threshold is decided) de novos, so you should drop the medium and low confidence fields rather than renaming them

gnomad_qc/v4/create_release/create_de_novo_release.py

KoalaQin · 2025-03-17T19:24:52Z

back to you! Konrad gave us green light!

KoalaQin · 2025-03-17T20:09:56Z

New test: d1696a90203441629959ef1b05e3906a

ch-kr

so close! a couple more questions

gnomad_qc/v4/create_release/create_de_novo_release.py

ch-kr · 2025-03-17T20:37:48Z

gnomad_qc/v4/resources/release.py

@@ -396,22 +396,25 @@ def release_all_sites_an(
    )


-def release_de_novo(test: bool = False) -> VersionedTableResource:
+def release_de_novo(


this function is used to also grab the TSV path:

ht.export( release_de_novo(test=test).path.replace(".ht", ".tsv.bgz"), header=True, delimiter="\t", )

should this function include the option to return a TSV path? or is the current setup preferred by the team (fine either way, just wondering)

Julia wrote this. I found it's better this way, because the tsv might be an optional input or output. I'm using the same logic for my downloaded AoU tsv.

gnomad_qc/v4/create_release/create_de_novo_release.py

ch-kr · 2025-03-17T20:42:55Z

gnomad_qc/v4/create_release/create_de_novo_release.py

+            ),
+            p_de_novo_stats=hl.agg.stats(ht.de_novo_call_info.p_de_novo),
+            # The mixed_site info should stay the same for each variant.
+            mixed_site=hl.agg.take(ht.mixed_site, 1)[0],


is this getting included in the release? it isn't included in the proposed schema https://atgu.slack.com/archives/CRA2TKTV0/p1740149020877199 (unless I missed it?)

No, the proposed schema was old. You suggested to include this, no?

ah yes because you filtered * alleles so including that annotation didn't make sense to me

I don't have strong feelings about this annotation, but if you include it, I think you'll need to describe what it means in a README included with the downloads

Co-authored-by: Katherine Chao <[email protected]>

ch-kr

one comment outside of this PR and LGTM!

ch-kr · 2025-03-19T13:57:29Z

gnomad_qc/v4/create_release/create_de_novo_release.py

+            ),
+            p_de_novo_stats=hl.agg.stats(ht.de_novo_call_info.p_de_novo),
+            # The mixed_site info should stay the same for each variant.
+            mixed_site=hl.agg.take(ht.mixed_site, 1)[0],


I don't have strong feelings about this annotation, but if you include it, I think you'll need to describe what it means in a README included with the downloads

KoalaQin · 2025-03-19T14:13:51Z

A README file for all the fields? Yeah, I could do that.

KoalaQin added 3 commits January 28, 2025 16:02

Generate de novo calls with new function

7812365

Add Julia's changes for dense MT and clean up

4b1355a

revert a comment

277e2fc

KoalaQin assigned KoalaQin and ch-kr Jan 29, 2025

KoalaQin requested a review from ch-kr January 29, 2025 17:54

ch-kr requested changes Jan 29, 2025

View reviewed changes

KoalaQin and others added 5 commits January 30, 2025 17:13

Apply suggestions from code review

d3a6d3d

Co-authored-by: Katherine Chao <[email protected]>

Address review comments

da2af71

Merge branch 'qh/denovo' of https://github.com/broadinstitute/gnomad_qc…

e157ed3

… into qh/denovo

Change to use new functions from gnomad_methods

dba9f98

import new function

73fd7da

KoalaQin requested a review from ch-kr February 3, 2025 15:48

Select entries and transmute colnames

7c8bc8b

ch-kr reviewed Feb 3, 2025

View reviewed changes

KoalaQin added 6 commits February 6, 2025 14:10

Change the filter step

d6fbcb4

Modify comment about missing AD

0966825

Reflect gnomad_methods changes

af47133

Add aggregate and annnotate function

db83d47

Clean up the aggregate function

79b5ae2

Clean up the functions

aa66a8a

KoalaQin requested a review from ch-kr February 21, 2025 16:04

Change docstring

f3f67f3

KoalaQin changed the title ~~Generate de novo calls~~ Generate de novo release Feb 21, 2025

ch-kr requested changes Feb 21, 2025

View reviewed changes

KoalaQin added 4 commits February 21, 2025 14:04

Address review comments

d3a15dc

remove postfix

d793b96

Correct errors

bb179fd

Add mixed_site

fb279ed

KoalaQin requested a review from ch-kr February 21, 2025 19:21

Put coding_sequence_variant back

8f5c4ff

ch-kr reviewed Feb 21, 2025

View reviewed changes

KoalaQin added 2 commits February 24, 2025 09:00

Address review comments

def7de5

Fix bad operand for string

dc07534

KoalaQin requested a review from ch-kr March 5, 2025 20:15

ch-kr reviewed Mar 7, 2025

View reviewed changes

Change to add medium P>=0.9 to the high-quality

9257250

KoalaQin requested a review from ch-kr March 12, 2025 20:14

Change to high-quality

3b4d7b7

ch-kr requested changes Mar 12, 2025

View reviewed changes

KoalaQin added 4 commits March 12, 2025 17:05

Address review comments

c18a209

change to agg.take

75e6d4e

add indentation

bd61a27

Adjust docstring

7058d59

KoalaQin requested a review from ch-kr March 17, 2025 19:24

Only getting high-quality

6bd5796

fix typo

5e5662e

ch-kr reviewed Mar 17, 2025

View reviewed changes

KoalaQin and others added 2 commits March 17, 2025 17:22

Update gnomad_qc/v4/create_release/create_de_novo_release.py

302f53e

Co-authored-by: Katherine Chao <[email protected]>

Apply suggestions from code review

8de2196

Co-authored-by: Katherine Chao <[email protected]>

KoalaQin requested a review from ch-kr March 17, 2025 21:39

ch-kr approved these changes Mar 19, 2025

View reviewed changes

KoalaQin merged commit e507a3e into main Mar 19, 2025
4 checks passed

KoalaQin deleted the qh/denovo branch March 19, 2025 18:32

	# Many of our larger datasets have the PL and AD fields for homref genotypes
	# Many of our larger datasets have the PL and AD fields for homref genotypes


		return selects
		ht = ht.filter(ht.de_novo_call_info.is_de_novo).naive_coalesce(1000)

		gene_id=ht.vep.worst_csq_for_variant.gene_id,
		transcript_id=ht.vep.worst_csq_for_variant.transcript_id,

Generate de novo release #654

Generate de novo release #654

Conversation

KoalaQin commented Jan 29, 2025 • edited Loading

ch-kr left a comment

Choose a reason for hiding this comment

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KoalaQin commented Mar 17, 2025

KoalaQin commented Mar 17, 2025

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KoalaQin commented Mar 19, 2025

KoalaQin commented Jan 29, 2025 •

edited

Loading