-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate de novo release #654
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some questions and minor suggestions
Co-authored-by: Katherine Chao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple more minor comments, then waiting on the other PR
@@ -135,34 +135,50 @@ def get_releasable_de_novo_calls_ht( | |||
) | |||
mt = annotate_adj(mt) | |||
|
|||
# Approximate the AD and PL fields when missing. | |||
# Many of our larger datasets have the PL and AD fields for homref genotypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Many of our larger datasets have the PL and AD fields for homref genotypes | |
# Many of our larger datasets have the PL and AD fields for homref genotypes |
can you check if this is true for v3 as well? if yes, it's best to say v3 and v4 have the PL and ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are also NA in v3 VDS. FYI, I used this code to check:
from gnomad_qc.v3.resources.basics import get_gnomad_v3_vds
from gnomad.sample_qc.relatedness import filter_to_trios
vds = get_gnomad_v3_vds(split=True,
filter_intervals=['chr11:113409605-113475691'],
)
trios = hl.import_fam("gs://gnomad-qin/g1k.trios.fam", delimiter='\t')
vds = filter_to_trios(vds, trios)
mt = hl.vds.to_dense_mt(vds)
mt.entries().show(20)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for looking! this is good to know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few minor suggestions and questions
gnomad_qc/v4/resources/sample_qc.py
Outdated
|
||
|
||
def trio_denovo_ht( | ||
releasable: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does a non-releasable version of this HT exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, will remove.
gnomad_qc/v4/resources/release.py
Outdated
".all_confidences.filtered" | ||
if by_confidence == "all" | ||
else ".high_confidence.filtered" | ||
) | ||
postfix = f".{datetime.today().strftime('%Y-%m-%d')}" if test else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basic question, is this how we've started naming test files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, Julia just added this to have different versions to inspect. I could remove it.
|
||
return selects | ||
ht = ht.filter(ht.de_novo_call_info.is_de_novo).naive_coalesce(1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this 1000 be an argument rather than hard coded? not sure how strict we've been about this in gnomad_qc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think I will make it an argument default to 1000 and use 1/10 for test.
ht = process_consequences(ht, has_polyphen=False) | ||
|
||
ht = ht.annotate( | ||
alt_is_star=ht.alleles[1] == "*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than alt_is_star
, the annotation added earlier (mixed_site
) might be more valuable. users can pretty easily check for themselves whether the alt is a star allele
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, they're different annotations -- I suggested keeping mixed_site
in case you were trying to retain information about the locus for our users.
for alt_is_star
specifically, it feels cleaner to filter directly using the logic here:
ht = ht.filter(~ht.alleles[1] == "*").checkpoint(new_temp_file("denovo_no_star", "ht"))
as opposed to keeping this annotation. the reason I suggest this is because the TSV will still have the alt_is_star
annotation despite it always being False, and users that are reading in the HT can easily filter on this logic themselves as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this, and I also moved the all confidence table after it.
gene_id=ht.vep.worst_csq_for_variant.gene_id, | ||
transcript_id=ht.vep.worst_csq_for_variant.transcript_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that I see this again, I wonder if these should be named worst_csq_gene_id
and worst_csq_transcript_id
. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I could change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few more comments
ht = process_consequences(ht, has_polyphen=False) | ||
|
||
ht = ht.annotate( | ||
alt_is_star=ht.alleles[1] == "*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, they're different annotations -- I suggested keeping mixed_site
in case you were trying to retain information about the locus for our users.
for alt_is_star
specifically, it feels cleaner to filter directly using the logic here:
ht = ht.filter(~ht.alleles[1] == "*").checkpoint(new_temp_file("denovo_no_star", "ht"))
as opposed to keeping this annotation. the reason I suggest this is because the TSV will still have the alt_is_star
annotation despite it always being False, and users that are reading in the HT can easily filter on this logic themselves as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple minor comments. this is close, but we need to finalize how we're defining medium confidence DNMs
# the fields are present in the HT. | ||
final_fields = get_final_ht_fields(ht, schema=FINALIZED_SCHEMA) | ||
return ht.select(*final_fields["rows"]).select_globals(*final_fields["globals"]) | ||
return ht_all_conf, ht |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with my other comment above: we should filter the release HT to only high and medium confidence de novos, and it sounds like there is some remaining discussion on how we define medium confidence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still release 2? One with everything and one with high quality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this on slack before I saw this comment, but adding here for future reference: my understanding based on the summaries I heard of the ATGU lab meeting presentation was that we wanted to release at most high and medium (with some filtering) confidence de novos. this means we should have two release files, a HT with high + medium DNMs, and a TSV with high confidence DNMs only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few documentation-related requests
for data_type in ["exomes", "genomes", "joint"] | ||
} | ||
This step get two sets of de novo calls: | ||
- de novos at all confidence levels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably state high quality given your naming below (ht_all_hq
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also define what you mean by high quality and what is included in the filtered version of the table in this docstring?
"de_novo_AC.AC_high_conf": "de_novo_AC_high_conf", | ||
"de_novo_AC.AC_medium_conf": "de_novo_AC_medium_conf", | ||
"de_novo_AC.AC_medium_conf_P_0_9": "de_novo_AC_medium_conf_P_0_9", | ||
"de_novo_AC.AC_low_conf": "de_novo_AC_low_conf", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this TSV should contain at most the high confidence and medium (p > whatever threshold is decided) de novos, so you should drop the medium and low confidence fields rather than renaming them
back to you! Konrad gave us green light! |
New test: d1696a90203441629959ef1b05e3906a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so close! a couple more questions
@@ -396,22 +396,25 @@ def release_all_sites_an( | |||
) | |||
|
|||
|
|||
def release_de_novo(test: bool = False) -> VersionedTableResource: | |||
def release_de_novo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is used to also grab the TSV path:
ht.export(
release_de_novo(test=test).path.replace(".ht", ".tsv.bgz"),
header=True,
delimiter="\t",
)
should this function include the option to return a TSV path? or is the current setup preferred by the team (fine either way, just wondering)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Julia wrote this. I found it's better this way, because the tsv might be an optional input or output. I'm using the same logic for my downloaded AoU tsv.
), | ||
p_de_novo_stats=hl.agg.stats(ht.de_novo_call_info.p_de_novo), | ||
# The mixed_site info should stay the same for each variant. | ||
mixed_site=hl.agg.take(ht.mixed_site, 1)[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this getting included in the release? it isn't included in the proposed schema https://atgu.slack.com/archives/CRA2TKTV0/p1740149020877199 (unless I missed it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the proposed schema was old. You suggested to include this, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes because you filtered *
alleles so including that annotation didn't make sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong feelings about this annotation, but if you include it, I think you'll need to describe what it means in a README included with the downloads
Co-authored-by: Katherine Chao <[email protected]>
Co-authored-by: Katherine Chao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one comment outside of this PR and LGTM!
), | ||
p_de_novo_stats=hl.agg.stats(ht.de_novo_call_info.p_de_novo), | ||
# The mixed_site info should stay the same for each variant. | ||
mixed_site=hl.agg.take(ht.mixed_site, 1)[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong feelings about this annotation, but if you include it, I think you'll need to describe what it means in a README included with the downloads
A README file for all the fields? Yeah, I could do that. |
This cleaned up some redundant dense mt functions and arguments since we had PR #651. Julia created the dense MT in job 0a3e98e75ba14b2f8341012515d11f8b before the PR was merged.
This is waiting on PR #760 in gnomad_methods.
Test run on chr20: b0729d42ea77466c8cc1fe9697e73af1