Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcript Versions - Gaps and replace alignments with newer annotations #494

Closed
davmlaw opened this issue Sep 23, 2021 · 9 comments
Closed
Assignees

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Sep 23, 2021

The alignments can change for the same transcript version, so we should update data to the latest (if using --replace)

https://www.ncbi.nlm.nih.gov/nuccore/NM_004656.4 is 3600bp long, and 1 base short summing up JSON exon lengths

Emma noticed that GCF_000001405.39_GRCh38.p13_genomic.109.20210514.gff.gz the exon lengths sum to 3600 - ie different than our JSON data.

Here's the genePred data (0-based like the JSON) and yes it looks like the 1st (in genomic order) exon start is 52435023 while the JSON is 52435024

$ zgrep NM_004656 GCF_000001405.25_GRCh37.p13_genomic.105.20201022.genePred
NM_004656.4     NC_000003.11    -       52435023        52444024        52436303        52443894        17      52435023,52436617,52436794,52437153,52437431,52438468,52439125,52439780,52440268,52440844,52441189,52441414,52441973,52442489,52443569,52443729,52443857,  52436437,52436690,52436887,52437314,52437910,52438602,52439310,52439928,52440392,52440923,52441332,52441476,52442093,52442622,52443624,52443759,52444024,  0       BAP1    cmpl    cmpl    1,0,0,1,2,0,1,0,2,1,2,0,0,2,1,1,0,

The data was originally inserted via these files which do have the 52435024 coordinate:

ucsc_hg19.20181211.genePred:NM_004656.4 chr3    -       52435024        52444024        52436303        52443894        17      52435024,52436617,52436794,52437153,52437431,52438468,52439125,52439780,52440268,52440844,52441189,52441414,52441973,52442489,52443569,52443729,52443857,  52436437,52436690,52436887,52437314,52437910,52438602,52439310,52439928,52440392,52440923,52441332,52441476,52442093,52442622,52443624,52443759,52444024,  0       BAP1
ucsc_hg19.20190821.genePred:NM_004656.4 chr3    -       52435024        52444024        52436303        52443894        17      52435024,52436617,52436794,52437153,52437431,52438468,52439125,52439780,52440268,52440844,52441189,52441414,52441973,52442489,52443569,52443729,52443857,  52436437,52436690,52436887,52437314,52437910,52438602,52439310,52439928,52440392,52440923,52441332,52441476,52442093,52442622,52443624,52443759,52444024,  0       BAP1

The current process is:

  • Go through all the GTF/GFF (and genePred) files from oldest to newest, and insert new transcript version records that we don't have
  • If the old record is missing data, or aligned to a patch, then replace data with the new one

By default we do NOT replace the transcript data with the new stuff, my thoughts were that if the transcript version wasn't bumped, then that wouldn't change.

Looks like this was a bad assumption as for whatever reason, the genome being patched or UCSC having bad alignments, the alignments for the same transcript version can indeed change. Will move this to a new issue

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 27, 2021

Need to remember to re-match any classifications against transcripts that change because of this issue.

Since we haven't upgraded any prod systems yet might be better to modify the previous rematch classifications management command, delete the manual migration and then do a new one

Should only call the API made classifications, James says you can best get this via:

  • Your first classification modification will have a source of “variantgrid”
  • If you create via API, your first will have a source of “api”

@davmlaw davmlaw changed the title Transcript Version alignments need to be replaced with newer annotations Transcript Versions - Gaps and replace alignments with newer annotations Sep 27, 2021
@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 27, 2021

I think I'll just handle the gaps in PyHGVS

  • Remove alignment_gap boolean - this is too course and uninformative - we should store info in the transcript version data and decide then whether it can be used
  • It looks like cDNA_match line up with exons, which makes it pretty easy actually, just need to add a separate list in parallel with exons that stores the Gap=M185 I3 M250 attribute

https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

Then just need to handle that shift in the exons - probably just write that into PyHGVS - we're already made modifications etc

Previous discussion:

3132 HGVS - mRNA/genome gapped alignments
480 UTA transcripts with gaps

counsyl/hgvs#60

davmlaw added a commit that referenced this issue Sep 29, 2021
…nnotations - use HTTP as getting corruption via ftp to NCBI
@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 29, 2021

I found that "partial" was used to describe partial matches, so added that to data

import gzip
import json
from snpdb.models import GenomeBuild
from genes.models import TranscriptVersionSequenceInfo, TranscriptVersion

vg_tv_lengths = {tv.accession: tv.length for tv in TranscriptVersion.objects.filter(genome_build=GenomeBuild.grch37(), transcript__annotation_consortium='R')}
expected_lengths = {}
for tvsi in TranscriptVersionSequenceInfo.objects.all():
    expected_lengths[tvsi.accession] = tvsi.length

# filename = "/media/dlawrence/SpinningIron/reference/variantgrid_setup_data/gene_annotation/refseq/GRCh37/GCF_000001405.25_GRCh37.p13_genomic.105.20201022.gff.gz.json.gz"
filename = "/data/annotation/variantgrid_setup_data/gene_annotation/refseq/GRCh37/GCF_000001405.25_GRCh37.p13_genomic.105.20201022.gff.gz.json.gz"
with gzip.open(filename) as f:
    data = json.load(f)

vg_right_gff_right = 0
vg_right_gff_wrong = 0
vg_wrong_gff_right = 0
vg_wrong_gff_wrong = 0


gff_wrong_has_cdna_match = 0
gff_wrong_has_partial = 0
gff_wrong_no_idea = 0


def pyreference_transcript_length(transcript):
    return sum([d["stop"] - d["start"] for d in transcript["features_by_type"]["exon"]])

for transcript_accession, transcript in data["transcripts_by_id"].items():
    if expected_length := expected_lengths.get(transcript_accession):
        length = pyreference_transcript_length(transcript)
        vg_length = vg_tv[transcript_accession].length
        vg_gap = vg_tv[transcript_accession].alignment_gap
        vg_right = vg_length == expected_length
        gff_right = length == expected_length
        cdna_match = "cDNA_match" in transcript["features_by_type"]
        partial = "partial" in transcript
        
        if gff_right:
            if vg_right:
                vg_right_gff_right += 1
            else:
                vg_wrong_gff_right += 1
        else:
            if vg_right:
                vg_right_gff_wrong += 1
                print(f"{transcript_accession}: seq: {expected_length}, VG {vg_length} ({vg_gap=}), GFF {length} {cdna_match=}/{partial=}")
            else:
                vg_wrong_gff_wrong += 1
                
            if cdna_match:
                gff_wrong_has_cdna_match += 1
            elif partial:
                gff_wrong_has_partial += 1
            else:
                gff_wrong_no_idea += 1

print(f"{vg_right_gff_right=} / {vg_right_gff_wrong=} / {vg_wrong_gff_right=} / {vg_wrong_gff_wrong}")
print(f"{gff_wrong_has_cdna_match=} / {gff_wrong_has_partial=} / {gff_wrong_no_idea=}")

Output:

vg_right_gff_right=68118 / vg_right_gff_wrong=15 / vg_wrong_gff_right=3302 / 5911
gff_wrong_has_cdna_match=1067 / gff_wrong_has_partial=6 / gff_wrong_no_idea=4853
# The ones where VG length was ok (though had gap) and GFF wrong:

NM_001351372.1: seq: 5668, VG 5668 (vg_gap=True), GFF 5672 cdna_match=True/partial=False
NM_001366712.1: seq: 5401, VG 5401 (vg_gap=True), GFF 5407 cdna_match=True/partial=False
NR_029388.2: seq: 1448, VG 1448 (vg_gap=True), GFF 808 cdna_match=True/partial=True
NR_036455.1: seq: 636, VG 636 (vg_gap=True), GFF 381 cdna_match=True/partial=True
NR_073509.1: seq: 2143, VG 2143 (vg_gap=True), GFF 2123 cdna_match=True/partial=True
NR_110795.1: seq: 826, VG 826 (vg_gap=True), GFF 317 cdna_match=True/partial=True
NR_111945.1: seq: 2214, VG 2214 (vg_gap=True), GFF 2215 cdna_match=True/partial=False
NR_111951.1: seq: 1178, VG 1178 (vg_gap=True), GFF 1172 cdna_match=True/partial=False
NR_111953.1: seq: 1015, VG 1015 (vg_gap=True), GFF 1009 cdna_match=True/partial=False
NR_121210.1: seq: 1185, VG 1185 (vg_gap=True), GFF 1190 cdna_match=True/partial=False
NR_146657.1: seq: 4868, VG 4868 (vg_gap=True), GFF 4494 cdna_match=True/partial=True
NR_148945.1: seq: 1727, VG 1727 (vg_gap=True), GFF 1731 cdna_match=True/partial=False
NR_160518.1: seq: 4319, VG 4319 (vg_gap=True), GFF 4167 cdna_match=True/partial=True
NR_160519.1: seq: 4408, VG 4408 (vg_gap=True), GFF 4256 cdna_match=True/partial=True
NR_160941.1: seq: 1961, VG 1961 (vg_gap=False), GFF 800 cdna_match=False/partial=True

So, GFF better, plus we get those cDNA matches etc we can use

davmlaw added a commit that referenced this issue Sep 30, 2021
davmlaw added a commit that referenced this issue Sep 30, 2021
@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 30, 2021

Working in branch "gene_import"

# Download and create data
~/localwork/variantgrid/genes/scripts/refseq/refseq_gene_annotation_grch37.sh
~/localwork/variantgrid/genes/scripts/refseq/refseq_gene_annotation_grch38.sh

~/localwork/variantgrid/genes/scripts/refseq/ensembl_gene_annotation_grch37.sh
~/localwork/variantgrid/genes/scripts/refseq/ensembl_gene_annotation_grch38.sh

TODO:

  • We're now inserting - NG_/XR- there's an issue about this - now solved?
  • Work out what’s going on in lengths not matching and no partial etc - any other way to tell?
  • PyReference pip release? Also say in README about supported GFFs?
  • PyReference 5PUTR - if already have five_prime_UTR as some GFFs do - then rename instead of calculate it? Also coding_start / coding_stop etc.

davmlaw added a commit that referenced this issue Sep 30, 2021
davmlaw added a commit that referenced this issue Oct 1, 2021
davmlaw added a commit that referenced this issue Oct 1, 2021
@davmlaw
Copy link
Contributor Author

davmlaw commented Oct 1, 2021

Checked in start of gap handling code to https://github.com/SACGF/hgvs

Using test data that worked with VEP:

2:73385942 A>T: VEP: NM_015120.4:c.74A>T - OLD pyhgvs: NM_015120.4(ALMS1):c.74A>T
2:73385943 A>T: VEP: NM_015120.4:c.78A>T - OLD pyhgvs: NM_015120.4(ALMS1):c.75A>T
2:73385944 G>C: VEP: NM_015120.4:c.79G>C - OLD pyhgvs: NM_015120.4(ALMS1):c.76G>C

from genes.models import GeneVersion, TranscriptVersion
from genes.hgvs import HGVSMatcher, GenomeBuild
transcript_version = TranscriptVersion.get("NM_015120.4", GenomeBuild.grch38(), 'R')
matcher = HGVSMatcher(GenomeBuild.grch38())

matcher._pyhgvs_get_variant_tuple("NM_015120.4(ALMS1):c.74A>T", transcript_version)
Out[5]: ('2', 73385942, 'A', 'T')

matcher._pyhgvs_get_variant_tuple("NM_015120.4(ALMS1):c.75A>T", transcript_version)
matcher._pyhgvs_get_variant_tuple("NM_015120.4(ALMS1):c.76A>T", transcript_version)
matcher._pyhgvs_get_variant_tuple("NM_015120.4(ALMS1):c.77A>T", transcript_version)
# These 3 die with ValueError: Coordinate inside insertion (I3) - no mapping possible!

matcher._pyhgvs_get_variant_tuple("NM_015120.4(ALMS1):c.78A>T", transcript_version)
Out[9]: ('2', 73385943, 'A', 'T')

Still TODO:

  • Need to handle the other way - ie genomic_to_cdna_coord
  • Need to test on -'ve strand, not sure I got those right way around.

davmlaw added a commit that referenced this issue Oct 11, 2021
@davmlaw
Copy link
Contributor Author

davmlaw commented Oct 15, 2021

Install instructions

export gene_annotation_dir=/tmp/gene_annotation
mkdir -p ${gene_annotation_dir}
cd ${gene_annotation_dir}
wget https://variantgrid.com/static/gene_annotation/pyhgvs_transcripts_ensembl_grch37.json.gz  https://variantgrid.com/static/gene_annotation/pyhgvs_transcripts_ensembl_grch38.json.gz  https://variantgrid.com/static/gene_annotation/pyhgvs_transcripts_refseq_grch37.json.gz  https://variantgrid.com/static/gene_annotation/pyhgvs_transcripts_refseq_grch38.json.gz
python3 manage.py import_gene_annotation --annotation-consortium=Ensembl --genome-build=GRCh37 --json-file ${gene_annotation_dir}/pyhgvs_transcripts_ensembl_grch37.json.gz  
python3 manage.py import_gene_annotation --annotation-consortium=Ensembl --genome-build=GRCh38 --json-file ${gene_annotation_dir}/pyhgvs_transcripts_ensembl_grch38.json.gz

python3 manage.py import_gene_annotation --annotation-consortium=RefSeq --genome-build=GRCh37 --json-file ${gene_annotation_dir}/pyhgvs_transcripts_refseq_grch37.json.gz
python3 manage.py import_gene_annotation --annotation-consortium=RefSeq --genome-build=GRCh38 --json-file ${gene_annotation_dir}/pyhgvs_transcripts_refseq_grch38.json.gz

davmlaw added a commit that referenced this issue Oct 18, 2021
davmlaw added a commit that referenced this issue Oct 18, 2021
davmlaw added a commit that referenced this issue Oct 18, 2021
davmlaw added a commit that referenced this issue Oct 20, 2021
@davmlaw
Copy link
Contributor Author

davmlaw commented Oct 20, 2021

Just testing those I had transcript sequences for (so didn't need to use API request to get more)

162510 total transcripts
  178 had alignment gap
    131 - can handle gaps via cDNA match
     43 - due to partial alignment
      4 - for unknown reasons: NR_110481.1, NM_001242902.1, NR_110480.1, NM_001242903.1

 4879 (3%) had length be different than transcript
    4879 - (100% of length diff) - were due to polyA tails

So polyA explains 99.93% of the length differences (not due to cDNA matches) and the transcripts look like they're polyA, here's the end sequence:

NM_001242903.1 (AKAP1/GRCh38)
AAAAAAAAAAAAAAAAAAAAAAAAAAAACCACTT
NR_110480.1 (LOC101927079/GRCh38)
CAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
NR_110481.1 (LOC101927079/GRCh38)
CAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
NM_001242902.1 (AKAP1/GRCh38)
AAAAAAAAAAAAAAAAAAAAAAAAAAAACCACTT

@EmmaTudini
Copy link
Contributor

@TheMadBug
Copy link
Member

Think this has been well tested by fire effectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants