Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate de novo release #654

Merged
merged 33 commits into from
Mar 19, 2025
Merged

Generate de novo release #654

merged 33 commits into from
Mar 19, 2025

Conversation

KoalaQin
Copy link
Contributor

@KoalaQin KoalaQin commented Jan 29, 2025

This cleaned up some redundant dense mt functions and arguments since we had PR #651. Julia created the dense MT in job 0a3e98e75ba14b2f8341012515d11f8b before the PR was merged.

This is waiting on PR #760 in gnomad_methods.

Test run on chr20: b0729d42ea77466c8cc1fe9697e73af1

Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some questions and minor suggestions

@KoalaQin KoalaQin requested a review from ch-kr February 3, 2025 15:48
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple more minor comments, then waiting on the other PR

@@ -135,34 +135,50 @@ def get_releasable_de_novo_calls_ht(
)
mt = annotate_adj(mt)

# Approximate the AD and PL fields when missing.
# Many of our larger datasets have the PL and AD fields for homref genotypes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Many of our larger datasets have the PL and AD fields for homref genotypes
# Many of our larger datasets have the PL and AD fields for homref genotypes

can you check if this is true for v3 as well? if yes, it's best to say v3 and v4 have the PL and ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are also NA in v3 VDS. FYI, I used this code to check:

from gnomad_qc.v3.resources.basics import get_gnomad_v3_vds
from gnomad.sample_qc.relatedness import filter_to_trios

vds = get_gnomad_v3_vds(split=True, 
                       filter_intervals=['chr11:113409605-113475691'],
                       )

trios = hl.import_fam("gs://gnomad-qin/g1k.trios.fam", delimiter='\t')
vds = filter_to_trios(vds, trios)

mt = hl.vds.to_dense_mt(vds)
mt.entries().show(20)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for looking! this is good to know

@KoalaQin KoalaQin requested a review from ch-kr February 21, 2025 16:04
@KoalaQin KoalaQin changed the title Generate de novo calls Generate de novo release Feb 21, 2025
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor suggestions and questions



def trio_denovo_ht(
releasable: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does a non-releasable version of this HT exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, will remove.

".all_confidences.filtered"
if by_confidence == "all"
else ".high_confidence.filtered"
)
postfix = f".{datetime.today().strftime('%Y-%m-%d')}" if test else ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basic question, is this how we've started naming test files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, Julia just added this to have different versions to inspect. I could remove it.


return selects
ht = ht.filter(ht.de_novo_call_info.is_de_novo).naive_coalesce(1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this 1000 be an argument rather than hard coded? not sure how strict we've been about this in gnomad_qc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I will make it an argument default to 1000 and use 1/10 for test.

ht = process_consequences(ht, has_polyphen=False)

ht = ht.annotate(
alt_is_star=ht.alleles[1] == "*",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than alt_is_star, the annotation added earlier (mixed_site) might be more valuable. users can pretty easily check for themselves whether the alt is a star allele

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they are the same thing:
Screenshot 2025-02-21 at 1 34 10 PM
Maybe we could keep both? I need to use alt_is_star to filter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, they're different annotations -- I suggested keeping mixed_site in case you were trying to retain information about the locus for our users.

for alt_is_star specifically, it feels cleaner to filter directly using the logic here:

ht = ht.filter(~ht.alleles[1] == "*").checkpoint(new_temp_file("denovo_no_star", "ht"))

as opposed to keeping this annotation. the reason I suggest this is because the TSV will still have the alt_is_star annotation despite it always being False, and users that are reading in the HT can easily filter on this logic themselves as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this, and I also moved the all confidence table after it.

Comment on lines 180 to 181
gene_id=ht.vep.worst_csq_for_variant.gene_id,
transcript_id=ht.vep.worst_csq_for_variant.transcript_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that I see this again, I wonder if these should be named worst_csq_gene_id and worst_csq_transcript_id. what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I could change that.

@KoalaQin KoalaQin requested a review from ch-kr February 21, 2025 19:21
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more comments

ht = process_consequences(ht, has_polyphen=False)

ht = ht.annotate(
alt_is_star=ht.alleles[1] == "*",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, they're different annotations -- I suggested keeping mixed_site in case you were trying to retain information about the locus for our users.

for alt_is_star specifically, it feels cleaner to filter directly using the logic here:

ht = ht.filter(~ht.alleles[1] == "*").checkpoint(new_temp_file("denovo_no_star", "ht"))

as opposed to keeping this annotation. the reason I suggest this is because the TSV will still have the alt_is_star annotation despite it always being False, and users that are reading in the HT can easily filter on this logic themselves as well

@KoalaQin KoalaQin requested a review from ch-kr March 5, 2025 20:15
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple minor comments. this is close, but we need to finalize how we're defining medium confidence DNMs

# the fields are present in the HT.
final_fields = get_final_ht_fields(ht, schema=FINALIZED_SCHEMA)
return ht.select(*final_fields["rows"]).select_globals(*final_fields["globals"])
return ht_all_conf, ht
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with my other comment above: we should filter the release HT to only high and medium confidence de novos, and it sounds like there is some remaining discussion on how we define medium confidence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still release 2? One with everything and one with high quality?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this on slack before I saw this comment, but adding here for future reference: my understanding based on the summaries I heard of the ATGU lab meeting presentation was that we wanted to release at most high and medium (with some filtering) confidence de novos. this means we should have two release files, a HT with high + medium DNMs, and a TSV with high confidence DNMs only

@KoalaQin KoalaQin requested a review from ch-kr March 12, 2025 20:14
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few documentation-related requests

for data_type in ["exomes", "genomes", "joint"]
}
This step get two sets of de novo calls:
- de novos at all confidence levels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably state high quality given your naming below (ht_all_hq)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also define what you mean by high quality and what is included in the filtered version of the table in this docstring?

Comment on lines 267 to 270
"de_novo_AC.AC_high_conf": "de_novo_AC_high_conf",
"de_novo_AC.AC_medium_conf": "de_novo_AC_medium_conf",
"de_novo_AC.AC_medium_conf_P_0_9": "de_novo_AC_medium_conf_P_0_9",
"de_novo_AC.AC_low_conf": "de_novo_AC_low_conf",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this TSV should contain at most the high confidence and medium (p > whatever threshold is decided) de novos, so you should drop the medium and low confidence fields rather than renaming them

@KoalaQin KoalaQin requested a review from ch-kr March 17, 2025 19:24
@KoalaQin
Copy link
Contributor Author

back to you! Konrad gave us green light!

@KoalaQin
Copy link
Contributor Author

New test: d1696a90203441629959ef1b05e3906a

Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so close! a couple more questions

@@ -396,22 +396,25 @@ def release_all_sites_an(
)


def release_de_novo(test: bool = False) -> VersionedTableResource:
def release_de_novo(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is used to also grab the TSV path:

        ht.export(
            release_de_novo(test=test).path.replace(".ht", ".tsv.bgz"),
            header=True,
            delimiter="\t",
        )

should this function include the option to return a TSV path? or is the current setup preferred by the team (fine either way, just wondering)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Julia wrote this. I found it's better this way, because the tsv might be an optional input or output. I'm using the same logic for my downloaded AoU tsv.

),
p_de_novo_stats=hl.agg.stats(ht.de_novo_call_info.p_de_novo),
# The mixed_site info should stay the same for each variant.
mixed_site=hl.agg.take(ht.mixed_site, 1)[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this getting included in the release? it isn't included in the proposed schema https://atgu.slack.com/archives/CRA2TKTV0/p1740149020877199 (unless I missed it?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the proposed schema was old. You suggested to include this, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes because you filtered * alleles so including that annotation didn't make sense to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong feelings about this annotation, but if you include it, I think you'll need to describe what it means in a README included with the downloads

@KoalaQin KoalaQin requested a review from ch-kr March 17, 2025 21:39
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment outside of this PR and LGTM!

),
p_de_novo_stats=hl.agg.stats(ht.de_novo_call_info.p_de_novo),
# The mixed_site info should stay the same for each variant.
mixed_site=hl.agg.take(ht.mixed_site, 1)[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong feelings about this annotation, but if you include it, I think you'll need to describe what it means in a README included with the downloads

@KoalaQin
Copy link
Contributor Author

A README file for all the fields? Yeah, I could do that.

@KoalaQin KoalaQin merged commit e507a3e into main Mar 19, 2025
4 checks passed
@KoalaQin KoalaQin deleted the qh/denovo branch March 19, 2025 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants