Transcript Versions - Gaps and replace alignments with newer annotations #494

davmlaw · 2021-09-23T00:51:31Z

The alignments can change for the same transcript version, so we should update data to the latest (if using --replace)

https://www.ncbi.nlm.nih.gov/nuccore/NM_004656.4 is 3600bp long, and 1 base short summing up JSON exon lengths

Emma noticed that GCF_000001405.39_GRCh38.p13_genomic.109.20210514.gff.gz the exon lengths sum to 3600 - ie different than our JSON data.

Here's the genePred data (0-based like the JSON) and yes it looks like the 1st (in genomic order) exon start is 52435023 while the JSON is 52435024

$ zgrep NM_004656 GCF_000001405.25_GRCh37.p13_genomic.105.20201022.genePred
NM_004656.4     NC_000003.11    -       52435023        52444024        52436303        52443894        17      52435023,52436617,52436794,52437153,52437431,52438468,52439125,52439780,52440268,52440844,52441189,52441414,52441973,52442489,52443569,52443729,52443857,  52436437,52436690,52436887,52437314,52437910,52438602,52439310,52439928,52440392,52440923,52441332,52441476,52442093,52442622,52443624,52443759,52444024,  0       BAP1    cmpl    cmpl    1,0,0,1,2,0,1,0,2,1,2,0,0,2,1,1,0,

The data was originally inserted via these files which do have the 52435024 coordinate:

ucsc_hg19.20181211.genePred:NM_004656.4 chr3    -       52435024        52444024        52436303        52443894        17      52435024,52436617,52436794,52437153,52437431,52438468,52439125,52439780,52440268,52440844,52441189,52441414,52441973,52442489,52443569,52443729,52443857,  52436437,52436690,52436887,52437314,52437910,52438602,52439310,52439928,52440392,52440923,52441332,52441476,52442093,52442622,52443624,52443759,52444024,  0       BAP1
ucsc_hg19.20190821.genePred:NM_004656.4 chr3    -       52435024        52444024        52436303        52443894        17      52435024,52436617,52436794,52437153,52437431,52438468,52439125,52439780,52440268,52440844,52441189,52441414,52441973,52442489,52443569,52443729,52443857,  52436437,52436690,52436887,52437314,52437910,52438602,52439310,52439928,52440392,52440923,52441332,52441476,52442093,52442622,52443624,52443759,52444024,  0       BAP1

The current process is:

Go through all the GTF/GFF (and genePred) files from oldest to newest, and insert new transcript version records that we don't have
If the old record is missing data, or aligned to a patch, then replace data with the new one

By default we do NOT replace the transcript data with the new stuff, my thoughts were that if the transcript version wasn't bumped, then that wouldn't change.

Looks like this was a bad assumption as for whatever reason, the genome being patched or UCSC having bad alignments, the alignments for the same transcript version can indeed change. Will move this to a new issue

The text was updated successfully, but these errors were encountered:

davmlaw · 2021-09-27T04:34:45Z

Need to remember to re-match any classifications against transcripts that change because of this issue.

Since we haven't upgraded any prod systems yet might be better to modify the previous rematch classifications management command, delete the manual migration and then do a new one

Should only call the API made classifications, James says you can best get this via:

Your first classification modification will have a source of “variantgrid”
If you create via API, your first will have a source of “api”

davmlaw · 2021-09-27T08:42:36Z

I think I'll just handle the gaps in PyHGVS

Remove alignment_gap boolean - this is too course and uninformative - we should store info in the transcript version data and decide then whether it can be used
It looks like cDNA_match line up with exons, which makes it pretty easy actually, just need to add a separate list in parallel with exons that stores the Gap=M185 I3 M250 attribute

https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

Then just need to handle that shift in the exons - probably just write that into PyHGVS - we're already made modifications etc

Previous discussion:

3132 HGVS - mRNA/genome gapped alignments
480 UTA transcripts with gaps

counsyl/hgvs#60

…nnotations - use HTTP as getting corruption via ftp to NCBI

davmlaw · 2021-09-29T07:52:23Z

I found that "partial" was used to describe partial matches, so added that to data

import gzip
import json
from snpdb.models import GenomeBuild
from genes.models import TranscriptVersionSequenceInfo, TranscriptVersion

vg_tv_lengths = {tv.accession: tv.length for tv in TranscriptVersion.objects.filter(genome_build=GenomeBuild.grch37(), transcript__annotation_consortium='R')}
expected_lengths = {}
for tvsi in TranscriptVersionSequenceInfo.objects.all():
    expected_lengths[tvsi.accession] = tvsi.length

# filename = "/media/dlawrence/SpinningIron/reference/variantgrid_setup_data/gene_annotation/refseq/GRCh37/GCF_000001405.25_GRCh37.p13_genomic.105.20201022.gff.gz.json.gz"
filename = "/data/annotation/variantgrid_setup_data/gene_annotation/refseq/GRCh37/GCF_000001405.25_GRCh37.p13_genomic.105.20201022.gff.gz.json.gz"
with gzip.open(filename) as f:
    data = json.load(f)

vg_right_gff_right = 0
vg_right_gff_wrong = 0
vg_wrong_gff_right = 0
vg_wrong_gff_wrong = 0


gff_wrong_has_cdna_match = 0
gff_wrong_has_partial = 0
gff_wrong_no_idea = 0


def pyreference_transcript_length(transcript):
    return sum([d["stop"] - d["start"] for d in transcript["features_by_type"]["exon"]])

for transcript_accession, transcript in data["transcripts_by_id"].items():
    if expected_length := expected_lengths.get(transcript_accession):
        length = pyreference_transcript_length(transcript)
        vg_length = vg_tv[transcript_accession].length
        vg_gap = vg_tv[transcript_accession].alignment_gap
        vg_right = vg_length == expected_length
        gff_right = length == expected_length
        cdna_match = "cDNA_match" in transcript["features_by_type"]
        partial = "partial" in transcript
        
        if gff_right:
            if vg_right:
                vg_right_gff_right += 1
            else:
                vg_wrong_gff_right += 1
        else:
            if vg_right:
                vg_right_gff_wrong += 1
                print(f"{transcript_accession}: seq: {expected_length}, VG {vg_length} ({vg_gap=}), GFF {length} {cdna_match=}/{partial=}")
            else:
                vg_wrong_gff_wrong += 1
                
            if cdna_match:
                gff_wrong_has_cdna_match += 1
            elif partial:
                gff_wrong_has_partial += 1
            else:
                gff_wrong_no_idea += 1

print(f"{vg_right_gff_right=} / {vg_right_gff_wrong=} / {vg_wrong_gff_right=} / {vg_wrong_gff_wrong}")
print(f"{gff_wrong_has_cdna_match=} / {gff_wrong_has_partial=} / {gff_wrong_no_idea=}")

Output:

vg_right_gff_right=68118 / vg_right_gff_wrong=15 / vg_wrong_gff_right=3302 / 5911
gff_wrong_has_cdna_match=1067 / gff_wrong_has_partial=6 / gff_wrong_no_idea=4853

# The ones where VG length was ok (though had gap) and GFF wrong:

NM_001351372.1: seq: 5668, VG 5668 (vg_gap=True), GFF 5672 cdna_match=True/partial=False
NM_001366712.1: seq: 5401, VG 5401 (vg_gap=True), GFF 5407 cdna_match=True/partial=False
NR_029388.2: seq: 1448, VG 1448 (vg_gap=True), GFF 808 cdna_match=True/partial=True
NR_036455.1: seq: 636, VG 636 (vg_gap=True), GFF 381 cdna_match=True/partial=True
NR_073509.1: seq: 2143, VG 2143 (vg_gap=True), GFF 2123 cdna_match=True/partial=True
NR_110795.1: seq: 826, VG 826 (vg_gap=True), GFF 317 cdna_match=True/partial=True
NR_111945.1: seq: 2214, VG 2214 (vg_gap=True), GFF 2215 cdna_match=True/partial=False
NR_111951.1: seq: 1178, VG 1178 (vg_gap=True), GFF 1172 cdna_match=True/partial=False
NR_111953.1: seq: 1015, VG 1015 (vg_gap=True), GFF 1009 cdna_match=True/partial=False
NR_121210.1: seq: 1185, VG 1185 (vg_gap=True), GFF 1190 cdna_match=True/partial=False
NR_146657.1: seq: 4868, VG 4868 (vg_gap=True), GFF 4494 cdna_match=True/partial=True
NR_148945.1: seq: 1727, VG 1727 (vg_gap=True), GFF 1731 cdna_match=True/partial=False
NR_160518.1: seq: 4319, VG 4319 (vg_gap=True), GFF 4167 cdna_match=True/partial=True
NR_160519.1: seq: 4408, VG 4408 (vg_gap=True), GFF 4256 cdna_match=True/partial=True
NR_160941.1: seq: 1961, VG 1961 (vg_gap=False), GFF 800 cdna_match=False/partial=True

So, GFF better, plus we get those cDNA matches etc we can use

davmlaw · 2021-09-30T13:55:10Z

Working in branch "gene_import"

# Download and create data
~/localwork/variantgrid/genes/scripts/refseq/refseq_gene_annotation_grch37.sh
~/localwork/variantgrid/genes/scripts/refseq/refseq_gene_annotation_grch38.sh

~/localwork/variantgrid/genes/scripts/refseq/ensembl_gene_annotation_grch37.sh
~/localwork/variantgrid/genes/scripts/refseq/ensembl_gene_annotation_grch38.sh

TODO:

We're now inserting - NG_/XR- there's an issue about this - now solved?
Work out what’s going on in lengths not matching and no partial etc - any other way to tell?
PyReference pip release? Also say in README about supported GFFs?
PyReference 5PUTR - if already have five_prime_UTR as some GFFs do - then rename instead of calculate it? Also coding_start / coding_stop etc.

…, include gene versions

davmlaw · 2021-10-01T14:40:13Z

Checked in start of gap handling code to https://github.com/SACGF/hgvs

Using test data that worked with VEP:

2:73385942 A>T: VEP: NM_015120.4:c.74A>T - OLD pyhgvs: NM_015120.4(ALMS1):c.74A>T
2:73385943 A>T: VEP: NM_015120.4:c.78A>T - OLD pyhgvs: NM_015120.4(ALMS1):c.75A>T
2:73385944 G>C: VEP: NM_015120.4:c.79G>C - OLD pyhgvs: NM_015120.4(ALMS1):c.76G>C

from genes.models import GeneVersion, TranscriptVersion
from genes.hgvs import HGVSMatcher, GenomeBuild
transcript_version = TranscriptVersion.get("NM_015120.4", GenomeBuild.grch38(), 'R')
matcher = HGVSMatcher(GenomeBuild.grch38())

matcher._pyhgvs_get_variant_tuple("NM_015120.4(ALMS1):c.74A>T", transcript_version)
Out[5]: ('2', 73385942, 'A', 'T')

matcher._pyhgvs_get_variant_tuple("NM_015120.4(ALMS1):c.75A>T", transcript_version)
matcher._pyhgvs_get_variant_tuple("NM_015120.4(ALMS1):c.76A>T", transcript_version)
matcher._pyhgvs_get_variant_tuple("NM_015120.4(ALMS1):c.77A>T", transcript_version)
# These 3 die with ValueError: Coordinate inside insertion (I3) - no mapping possible!

matcher._pyhgvs_get_variant_tuple("NM_015120.4(ALMS1):c.78A>T", transcript_version)
Out[9]: ('2', 73385943, 'A', 'T')

Still TODO:

Need to handle the other way - ie genomic_to_cdna_coord
Need to test on -'ve strand, not sure I got those right way around.

davmlaw · 2021-10-15T04:29:24Z

Install instructions

export gene_annotation_dir=/tmp/gene_annotation
mkdir -p ${gene_annotation_dir}
cd ${gene_annotation_dir}
wget https://variantgrid.com/static/gene_annotation/pyhgvs_transcripts_ensembl_grch37.json.gz  https://variantgrid.com/static/gene_annotation/pyhgvs_transcripts_ensembl_grch38.json.gz  https://variantgrid.com/static/gene_annotation/pyhgvs_transcripts_refseq_grch37.json.gz  https://variantgrid.com/static/gene_annotation/pyhgvs_transcripts_refseq_grch38.json.gz

python3 manage.py import_gene_annotation --annotation-consortium=Ensembl --genome-build=GRCh37 --json-file ${gene_annotation_dir}/pyhgvs_transcripts_ensembl_grch37.json.gz  
python3 manage.py import_gene_annotation --annotation-consortium=Ensembl --genome-build=GRCh38 --json-file ${gene_annotation_dir}/pyhgvs_transcripts_ensembl_grch38.json.gz

python3 manage.py import_gene_annotation --annotation-consortium=RefSeq --genome-build=GRCh37 --json-file ${gene_annotation_dir}/pyhgvs_transcripts_refseq_grch37.json.gz
python3 manage.py import_gene_annotation --annotation-consortium=RefSeq --genome-build=GRCh38 --json-file ${gene_annotation_dir}/pyhgvs_transcripts_refseq_grch38.json.gz

…ence diffs on transcript version page

davmlaw · 2021-10-20T02:58:11Z

Just testing those I had transcript sequences for (so didn't need to use API request to get more)

162510 total transcripts
  178 had alignment gap
    131 - can handle gaps via cDNA match
     43 - due to partial alignment
      4 - for unknown reasons: NR_110481.1, NM_001242902.1, NR_110480.1, NM_001242903.1

 4879 (3%) had length be different than transcript
    4879 - (100% of length diff) - were due to polyA tails

So polyA explains 99.93% of the length differences (not due to cDNA matches) and the transcripts look like they're polyA, here's the end sequence:

NM_001242903.1 (AKAP1/GRCh38)
AAAAAAAAAAAAAAAAAAAAAAAAAAAACCACTT
NR_110480.1 (LOC101927079/GRCh38)
CAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
NR_110481.1 (LOC101927079/GRCh38)
CAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
NM_001242902.1 (AKAP1/GRCh38)
AAAAAAAAAAAAAAAAAAAAAAAAAAAACCACTT

EmmaTudini · 2021-11-07T09:03:43Z

Testing covered by https://app.zenhub.com/workspaces/everything-space-5bb3158c4b5806bc2beae448/issues/sacgf/shariant-admin/124. @davmlaw Does anything else need to be tested?

TheMadBug · 2021-11-12T01:21:31Z

Think this has been well tested by fire effectively.

davmlaw added the Annotation label Sep 23, 2021

davmlaw mentioned this issue Sep 23, 2021

UTA transcripts with gaps #480

Closed

davmlaw changed the title ~~Transcript Version alignments need to be replaced with newer annotations~~ Transcript Versions - Gaps and replace alignments with newer annotations Sep 27, 2021

davmlaw added a commit that referenced this issue Sep 27, 2021

#494 - Replace TranscriptVersion data if different

468c8a2

davmlaw added a commit that referenced this issue Sep 29, 2021

#494 - Transcript Versions - Gaps and replace alignments with newer a…

a0b599d

…nnotations - use HTTP as getting corruption via ftp to NCBI

davmlaw added a commit that referenced this issue Sep 29, 2021

#494 - Start writing new importer, test conversion code

2156cd4

davmlaw added a commit that referenced this issue Sep 30, 2021

#494 - Remove alignment gap, new fields, testing

316a4f4

davmlaw added a commit that referenced this issue Sep 30, 2021

#494 - Remove alignment gap, new fields, testing

beb7131

davmlaw added a commit that referenced this issue Sep 30, 2021

#494 - Loader

b98eb41

davmlaw added a commit that referenced this issue Sep 30, 2021

#494 - Insert data

c66fd3e

davmlaw added a commit that referenced this issue Sep 30, 2021

#494 - Insert data fixes

7a42077

davmlaw added a commit that referenced this issue Oct 1, 2021

#494 - Insert data fixes - case insensitive genes, biotype formatting…

30c4174

…, include gene versions

davmlaw added a commit that referenced this issue Oct 1, 2021

#494 - Insert data - iterator fixups

4d8362b

davmlaw added a commit that referenced this issue Oct 1, 2021

#494 - using gene version in key, don't use old GTFs w/o versions

d014dbe

davmlaw added a commit that referenced this issue Oct 11, 2021

#494 - Store more cDNA_match data

c5531cb

davmlaw added a commit that referenced this issue Oct 18, 2021

#494 - Rematch management command

30a883d

davmlaw added a commit that referenced this issue Oct 18, 2021

#494 - Rematch management command

7bc91cf

davmlaw added a commit that referenced this issue Oct 18, 2021

#494 - Manual migration for management command

5c874b1

davmlaw mentioned this issue Oct 18, 2021

New gene annotation releases #459

Closed

davmlaw added a commit that referenced this issue Oct 18, 2021

#494 - Revalidate before re-match

0f2961b

davmlaw added a commit that referenced this issue Oct 20, 2021

#494 - Allow for polyA tail. Show polyA tail length, transcript/refer…

a1bb0b1

…ence diffs on transcript version page

davmlaw added a commit that referenced this issue Oct 20, 2021

#494 - Gene Annotation release

ebc04ea

TheMadBug assigned davmlaw Oct 21, 2021

TheMadBug closed this as completed Nov 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcript Versions - Gaps and replace alignments with newer annotations #494

Transcript Versions - Gaps and replace alignments with newer annotations #494

davmlaw commented Sep 23, 2021

davmlaw commented Sep 27, 2021 •

edited

Loading

davmlaw commented Sep 27, 2021 •

edited

Loading

davmlaw commented Sep 29, 2021 •

edited

Loading

davmlaw commented Sep 30, 2021 •

edited

Loading

davmlaw commented Oct 1, 2021 •

edited

Loading

davmlaw commented Oct 15, 2021 •

edited

Loading

davmlaw commented Oct 20, 2021 •

edited

Loading

EmmaTudini commented Nov 7, 2021

TheMadBug commented Nov 12, 2021

Transcript Versions - Gaps and replace alignments with newer annotations #494

Transcript Versions - Gaps and replace alignments with newer annotations #494

Comments

davmlaw commented Sep 23, 2021

davmlaw commented Sep 27, 2021 • edited Loading

davmlaw commented Sep 27, 2021 • edited Loading

davmlaw commented Sep 29, 2021 • edited Loading

davmlaw commented Sep 30, 2021 • edited Loading

davmlaw commented Oct 1, 2021 • edited Loading

davmlaw commented Oct 15, 2021 • edited Loading

davmlaw commented Oct 20, 2021 • edited Loading

EmmaTudini commented Nov 7, 2021

TheMadBug commented Nov 12, 2021

davmlaw commented Sep 27, 2021 •

edited

Loading

davmlaw commented Sep 27, 2021 •

edited

Loading

davmlaw commented Sep 29, 2021 •

edited

Loading

davmlaw commented Sep 30, 2021 •

edited

Loading

davmlaw commented Oct 1, 2021 •

edited

Loading

davmlaw commented Oct 15, 2021 •

edited

Loading

davmlaw commented Oct 20, 2021 •

edited

Loading