Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTA transcripts with gaps #480

Closed
davmlaw opened this issue Sep 13, 2021 · 44 comments
Closed

UTA transcripts with gaps #480

davmlaw opened this issue Sep 13, 2021 · 44 comments
Assignees
Labels

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Sep 13, 2021

I went through the RefSeq GFFs and searched for any that had an alignment gap. I didn't do this for UTA transcripts.

An example is NM_001271466.3 for GRCh37 is from UTA, and has different length from GRCh38 - which doesn't have a gap. It's likely that the UTA version has a gap but just wasn't flagged.

If we can query UTA to look for affected transcripts (or they had a forum post with affected ones I think) then set them via a DB migration

@davmlaw davmlaw added the HGVS label Sep 13, 2021
@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 13, 2021

So while I can't find a way to flag it via UTA data, you can look up the length here:

https://www.ncbi.nlm.nih.gov/nuccore/NM_001271466.3

And if it doesn't match transcript_version.length then there must be a gap, so maybe do that as well. There are 15k UTA transcripts though. Could perhaps compare against 38 ones if RefSeq/no gap and then if not equals then contact the server.

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 13, 2021

Shariant test will go down so need to move onto laptop:

# These are from RefSeq and not marked as alignment_gap

other_lengths_37 = {}

for tv in TranscriptVersion.objects.filter(genome_build=GenomeBuild.grch37(), transcript__annotation_consortium='R').exclude(import_source__in=gai_qs).exclude(alignment_gap=True):
    other_lengths_37[tv.accession] = tv.length

other_lengths_38 = {}

for tv in TranscriptVersion.objects.filter(genome_build=GenomeBuild.grch38(), transcript__annotation_consortium='R').exclude(import_source__in=gai_qs).exclude(alignment_gap=True):
    other_lengths_38[tv.accession] = tv.length


need_to_examine_37 = []
need_to_examine_38 = []
for tv in tv_qs.filter(genome_build=GenomeBuild.grch37()):
    if other_length := other_lengths_38.get(tv.accession):
        if other_length == tv.length:
            continue
    need_to_examine_37.append(tv)

for tv in tv_qs.filter(genome_build=GenomeBuild.grch38()):
    if other_length := other_lengths_37.get(tv.accession):
        if other_length == tv.length:
            continue
    need_to_examine_38.append(tv)

transcripts = list(sorted(set(need_to_examine_37 | need_to_examine_38)))

refseq_data = {}
for i, accession in enumerate(transcripts):
    if i % 100 == 0:
        print(i)
    refseq_data[accession] = Entrez.efetch(db='nuccore', id=accession, rettype='gb', retmode='text')


for k, v in refseq_data.items():
    line = v.readline().strip()
    columns = line.split()
    if columns[3] == "bp":
        refseq_lengths[k] = int(columns[2])
    else:
        print(f"{k}: couldn't parse '{line}'")



this gives 37:

'NM_001330124.2, NM_001353208.1, NM_001287060.1, NM_001122964.2, NM_001322178.1, NM_001355014.1, NM_001355012.1, NM_001330239.1, NM_001362662.1, NM_001364143.1, NM_001291588.1, NM_001353797.1, NM_001330691.1, NM_001363686.1, NM_001354871.1, NM_001362793.1, NM_001302552.1, NM_001300813.1, NM_001303256.1, NM_001290058.1, NM_001322841.1, NM_001322840.1, NM_001349989.1, NM_130444.2, NM_001363068.1, NM_001363722.1, NM_001013742.3, NM_001102600.2, NM_001102599.2, NM_001102597.2, NM_001204204.2, NM_012429.4, NM_001794.4, NM_001252339.2, NM_001256540.2, NM_016391.7, NM_001281972.1, NM_001281971.1, NM_001301016.1, NM_001301775.1, NM_001301773.1, NM_001301774.1, NM_001301769.1, NM_001301771.1, NM_001305006.1, NM_001079808.2, NM_007159.3, NM_001305007.1, NM_001305005.1, NM_001330605.1, NM_080759.5, NM_006241.7, NM_006322.5, NM_006709.4, NM_025256.6, NM_006990.4, NM_001201404.2, NM_007105.3, NM_007213.2, NM_014001.4, NM_001172703.2, NM_138619.3, NM_014941.2, NM_016339.4, NM_001282850.1, NM_020463.3, NM_001271466.3, NM_032950.3, NM_144732.3, NM_178039.3, NM_178040.3, NM_181311.3, NM_181313.3, NM_000116.4, NM_181312.3, NM_198956.3, NM_198834.2, NM_001171907.2, NM_001171908.2, NM_178124.5, NM_001267547.2, NM_001267549.2, NM_003224.5, NM_001267544.2, NM_001267546.2, NM_001276282.3, NM_001290360.1, NM_001040432.2, NM_001184749.2, NM_173078.4, NM_001191058.2, NM_001191059.2, NM_001291839.1, NM_001136177.2, NM_000399.4, NM_001136179.2, NM_001313998.1, NM_001319948.1, NM_001302817.1, NM_001302816.1, NM_001324274.1, NM_001324277.1, NM_001324279.1, NM_001324275.1, NM_001324276.1, NM_001324278.1, NM_001324280.1, NM_001042734.2, NM_001306220.1, NM_001300761.1, NM_198963.2, NM_001281471.2, NM_001345872.1, NM_001300739.1, NM_001322181.1, NM_001322191.1, NM_001322190.1, NM_001322187.1, NM_001322188.1, NM_001322180.1, NM_001322179.1, NM_001322189.1, NM_001135746.2, NM_001135748.2, NM_001135747.2, NM_145043.3, NM_001349339.1, NM_001330383.1, NM_001330645.1, NM_001350884.1, NM_001351318.1, NM_001331005.1, NM_001352367.1, NM_001352349.1, NM_015011.2, NM_001198950.2, NM_001163022.2, NM_001352434.1, NM_001163023.2, NM_001352436.1, NM_018474.5, NM_001364395.1, NM_001170718.2, NM_001170720.2, NM_014567.4, NM_001170714.2, NM_001170715.2, NM_001170716.2, NM_001170717.2, NM_001355015.1, NM_001301025.2, NM_001330671.1, NM_001353205.1, NM_001353204.1, NM_001353202.1, NM_001353796.1, NM_001353794.1, NM_001353793.1, NM_001353791.1, NM_001354093.1, NM_001354600.2, NM_001354599.1, NM_000500.8, NM_000054.5, NM_001146151.2, NM_007242.6, NM_001362661.1, NM_001362660.1, NM_001363542.1, NM_001364380.1, NM_001364144.1, NM_001364140.1, NM_001364150.1, NM_001364146.1, NM_001364384.1, NM_001364389.1, NM_001364382.1, NM_001364381.1, NM_001364386.1, NM_001364383.1, NM_001364394.1, NM_001364727.1, NM_001572.4, NM_003684.6, NM_001135553.3, NM_198973.4, NM_005827.3, NM_012453.3, NM_020191.3, NM_031412.3, NM_001330520.1, NM_001330732.1, NM_182481.1, NM_001348192.1, NM_001348191.1, NM_001348266.1, NM_001130991.2, NM_006389.4, NM_001134368.3, NM_001291504.1, NM_001316325.1, NM_001291586.1, NM_001291587.1, NM_001291822.1, NM_001363740.1, NM_001293216.1, NM_001293215.1, NM_001293214.1, NM_001293212.1, NM_001293213.1, NM_001303465.1, NM_001304960.1, NM_001304959.1, NM_001304961.1, NM_001304958.1, NM_001304957.1, NM_003257.4, NM_175610.3, NM_005030.4, NM_004445.5, NM_005671.3, NM_006468.7, NM_007010.4, NM_015378.3, NM_018156.3, NM_018403.6, NM_020993.4, NM_001024808.2, NM_022553.5, NM_001278563.2, NM_001002266.2, NM_173477.4, NM_023946.3, NM_177457.4, NM_177477.3, NM_181985.3, NM_021250.3, NM_199478.2, NM_000533.4, NM_214461.2, NM_000682.6, NM_001114139.2, NM_001114135.3, NM_001114137.2, NM_001324019.1, NM_001286277.1, NM_001290207.1, NM_001290205.1, NM_001290215.1, NM_001290229.1, NM_001048201.2, NM_001282866.1, NM_001145414.3, NM_001300993.1, NM_002426.5, NM_005803.3, NM_147780.3, NM_147782.3, NM_147783.3, NM_001908.4, NM_147781.3, NM_153688.3, NM_001300842.1, NM_001329190.1, NM_001330219.1, NM_001330693.1, NM_001348081.1, NM_001349442.1, NM_001349439.1, NM_001349440.1, NM_001351289.1, NM_001351287.1, NM_001351283.1, NM_001351288.1, NM_001351285.1, NM_001351286.1, NM_001351282.1, NM_001351284.1, NM_001351291.1, NM_002073.3, NM_013244.4, NM_001201483.2, NM_001354872.1, NM_001362791.1, NM_001364278.1, NM_003211.5, NM_001300994.1, NM_001292026.1, NM_021599.3, NM_001080423.3, NM_001291581.1, NM_001297623.1, NM_001303257.1, NM_001304419.1, NM_001903.3, NM_005532.4, NM_005903.6, NM_001001419.2, NM_005931.4, NM_014262.4, NM_014518.3, NM_016180.4, NM_001012509.3, NM_021813.3, NM_030594.4, NM_153334.6, NM_182895.4, NM_176816.4, NM_182703.5, NM_001317237.1, NM_001470.3, NM_001288995.1, NM_138393.2, NM_001346748.1, NM_001346747.1, NM_001349982.1, NM_001349996.1, NM_001349991.1, NM_001349992.1, NM_001349990.1, NM_006985.3, NM_001354636.1, NM_001354075.1, NM_001354635.1, NM_024111.4, NM_001080978.3, NM_005874.4, NM_001146344.2, NM_001135769.2, NM_001135768.2, NM_006505.4, NM_001145440.2, NM_001330065.1, NM_001282398.1, NM_001291605.1, NM_001291604.1, NM_001291831.1, NM_001291986.1, NM_001297735.1, NM_001301224.1, NM_004188.5, NM_001167672.2, NM_018646.5, NM_021962.4, NM_173059.2, NM_003386.2, NM_175859.2, NM_019857.4, NM_001190965.2, NM_003453.4, NM_001190964.2, NM_001256299.2, NM_001272093.2, NM_001083585.2, NM_004703.5, NM_004518.5, NM_172109.2, NM_172108.4, NM_172106.2, NM_172107.3, NM_015425.4, NM_016529.5, NM_032631.3, NM_001001520.2, NM_001099673.2, NM_001348147.1, NM_001082578.2, NM_001135649.2, NM_006999.5, NM_001614.4, NM_001349949.1, NM_001362778.1, NM_001290135.1, NM_001330419.1, NM_001177519.2, NM_001301782.1, NM_001122665.2, NM_004879.4, NM_015430.3, NM_001001991.2, NM_018076.3, NM_000913.5, NM_033518.3, NM_080669.5, NM_001242366.2, NM_001110199.2, NM_001077446.3, NM_001145073.2, NM_001318045.1, NM_007028.4, NM_033401.4, NM_182647.3, NM_020699.3, NM_001330619.1, NM_031904.4, NM_032765.3, NM_182524.3, NM_052903.5, NM_053051.4, NM_001037144.6, NM_001362969.1, NM_001363069.1, NM_001363687.1, NM_001010855.3, NM_001030010.2, NM_001161473.2, NM_000694.3, NM_006866.3, NM_001291328.1, NM_001858.5, NM_005407.2, NM_006781.4, NM_012196.1, NM_012324.4, NM_014611.2, NM_016004.3, NM_019119.4, NM_024772.4, NM_025140.2, NM_032595.4, NM_032932.5, NM_002570.4, NM_138322.3, NM_001145303.2, NM_001128596.2, NM_153498.3, NM_178014.3, NM_178844.3, NM_014735.4, NM_001145064.2, NM_001004431.2, NM_000949.6, NM_001317796.1, NM_024596.4, NM_130901.2, NM_001110781.2, NM_080764.3, NM_001243538.2, NM_001348206.1, NM_001322103.1, NM_001355197.1, NM_001364838.1, NM_001297417.2, NM_001291642.1, NM_001291641.1, NM_001302862.1, NM_001303489.1, NM_002111.7, NM_005568.4, NM_003909.4, NM_014828.3, NM_015027.3, NM_015584.4, NM_016523.2, NM_020137.4, NM_032508.3, NM_058180.4, NM_145041.3, NM_173506.6, NM_178172.5, NM_000314.6, NM_001051.4, NM_024830.4, NM_013310.4, NM_001906.5, NM_001303523.1, NM_003575.3, NM_006890.4, NM_007231.4, NM_013364.5, NM_016429.3, NM_017570.4, NM_024302.4, NM_030930.3, NM_130445.3, NM_133373.4, NM_001002836.3, NM_006472.5, NM_004335.3, NM_032965.5, NM_001346810.1, NM_001350496.1, NM_017947.3, NM_001363511.1, NM_030783.2, NM_001012755.4, NM_001102598.2, NM_001256539.2, NM_001287062.1, NM_001287061.1, NM_001289951.1, NM_001352754.1, NM_001304457.1, NM_004866.5, NM_182700.5, NM_001134758.3, NM_001362759.1, NM_001362760.1, NM_001362761.1, NM_001289154.1, NM_001289153.1, NM_001289152.1, NM_001290309.1, NM_001290310.1, NM_182708.1, NM_001029998.4, NM_001291306.1, NM_001291307.1, NM_001317975.1, NM_001291308.2, NM_001291305.2, NM_001346049.1, NM_001346051.1, NM_001346048.1, NM_001346050.1, NM_001346140.1, NM_001349376.1, NM_001330761.1, NM_001349378.1, NM_001349381.1, NM_001349380.1, NM_001349377.1, NM_001349379.1, NM_001349427.1, NM_001350519.1, NM_001355428.1, NM_001350865.1, NM_001330995.1, NM_001330997.1, NM_001331006.1, NM_001331000.1, NM_001331002.1, NM_001330996.1, NM_001364391.1, NM_003441.3, NM_001352435.1, NM_001364379.1, NM_001301026.1, NM_001353162.1, NM_001353161.1, NM_001353203.1, NM_001353206.1, NM_001354376.1, NM_001354372.1, NM_001354375.1, NM_001354373.1, NM_001354374.1, NM_001354778.1, NM_001364385.1, NM_001330140.1, NM_001362663.1, NM_001364388.1, NM_001362781.1, NM_001362925.1, NM_001362926.1, NM_001364392.1, NM_001364377.1, NM_001363577.1, NM_001318737.2, NM_001363726.1, NM_001364390.1, NM_001322042.1, NM_001364149.1, NM_001364145.1, NM_001364148.1, NM_001364147.1, NM_001364141.1, NM_001364160.1, NM_001364393.1, NM_001364387.1, NM_001348289.1, NM_001348277.1, NM_001348280.1, NM_001348278.1, NM_001348279.1, NM_001348286.1, NM_001291505.1, NM_001289973.1, NM_001290051.1, NM_153837.2, NM_001304376.1, NM_004753.6, NM_024644.4, NM_001285445.1, NM_001285444.1, NM_001286278.1, NM_001286615.1, NM_001286633.1, NM_001287349.1, NM_001289174.1, NM_001290204.1, NM_001290206.1, NM_001290216.1, NM_001318471.1, NM_001318476.1, NM_001318472.1, NM_001318474.1, NM_001318473.1, NM_001318469.1, NM_001300833.2, NM_001330430.1, NM_001316668.1, NM_001348082.1, NM_001349441.1, NM_001330227.1, NM_018972.3, NM_001362792.1, NM_001363173.1, NM_001363172.1, NM_001363941.1, NM_001364279.1, NM_001364275.1, NM_001364277.1, NM_001364276.1, NM_001364885.1, NM_001300745.1, NM_001300744.1, NM_001318853.1, NM_001329556.1, NM_001291932.1, NM_001303018.1, NM_001304718.1, NM_001304717.2, NM_002356.6, NM_001282457.1, NM_001288960.1, NM_001288957.1, NM_001288954.1, NM_001288958.1, NM_001289161.1, NM_001290059.1, NM_001346749.1, NM_001303542.2, NM_001349995.1, NM_001349994.1, NM_001362840.1, NM_001362985.1, NM_001363700.1, NM_001080826.2, NM_001282506.1, NM_001290271.1, NM_001291274.1, NM_001352265.1, NM_001352266.1, NM_001291485.1, NM_001309516.1, NM_020422.5, NM_001346754.1, NM_001348386.1, NM_001349008.1, NM_001353498.1, NM_001363806.1, NM_003595.4, NM_001290060.1, NM_001291446.1, NM_001291447.1, NM_001301065.1, NM_015995.3, NM_024075.4, NM_001282658.1, NM_000804.3, NM_001313972.1, NM_001320049.1, NM_001348097.1, NM_001348364.1, NM_001330142.1, NM_001360869.1, NM_001361665.1, NM_001362435.1, NM_001363574.1, NM_001363876.1, NM_001291412.1, NM_005429.4, NM_144686.3, NM_001363770.1, NM_001329545.1, NM_001329544.1, NM_001282489.2, NM_001282535.1, NM_001290185.1, NM_001290264.1, NM_001291476.1, NM_001303481.1, NM_021088.3, NM_001017396.2, NM_001318511.1, NM_001321037.1, NM_001348271.1, NM_001361041.1, NM_001362177.1, NM_001318908.1, NM_001346232.1, NM_001363481.1, NM_001330124.1, NM_001330777.1, NM_001330778.1, NM_001290307.1, NM_001290342.1, NM_023078.5, NM_001330999.1, NM_001331004.1, NM_001354600.1, NM_001330228.1, NM_001330385.1, NM_001329214.1, NM_001330075.1, NM_001330611.1, NM_144694.2, NM_001291305.1, NM_001304423.1, NM_016391.6, NM_018283.2, NM_000314.5, NM_005020.3, NM_001191057.2, NM_001191056.2, NM_001353766.1, NM_001353767.1, NM_152603.3, NM_001290341.1, NM_001079807.2, NM_001304421.1, NM_001304420.1, NM_021815.3, NM_001304422.1, NM_001291308.1, NM_007040.4, NM_001142776.2, NM_003954.4, NM_006323.3, NM_001331003.1, NM_001331001.1, NM_001330998.1, NM_001353346.1, NM_001330139.1, NM_001978.3, NM_021646.2, NM_153257.3, NM_001271466.2, NM_001330394.1, NM_001301025.1, NM_001290214.1, NM_001303458.1, NM_001286479.1, NM_001303459.1, NM_032236.6, NM_001300883.1, NM_001330604.1, NM_001330670.1, NM_032138.5, NM_001306219.1, NM_153711.4, NM_001145414.2, NM_001304333.1'

and 38:

'NM_001353766.1, NM_001353767.1, NM_001329214.1, NM_001362177.1, NM_005827.3, NM_001330520.1, NM_001330605.1, NM_001330604.1, NM_001331005.1, NM_001331006.1, NM_001330997.1, NM_001330995.1, NM_001330996.1, NM_001331002.1, NM_000500.8, NM_001330998.1, NM_001331000.1, NM_001331001.1, NM_001135553.3, NM_001331004.1, NM_001330999.1, NM_003684.6, NM_198973.4, NM_001329556.1, NM_138393.2, NM_001330140.1, NM_001330139.1, NM_001330219.1, NM_001349949.1, NM_001330239.1, NM_001330385.1, NM_001330430.1, NM_001330691.1, NM_001330693.1, NM_001346140.1, NM_001349427.1, NM_001201483.2, NM_001353346.1, NM_001354075.1, NM_001354093.1, NM_001354778.1, NM_001349339.1, NM_015425.4, NM_001330383.1, NM_001330671.1, NM_001330645.1, NM_001330670.1, NM_001142776.2, NM_024111.4, NM_001330227.1, NM_001349008.1, NM_001614.4, NM_001330142.1, NM_001360869.1, NM_001361665.1, NM_001352367.1, NM_001352349.1, NM_001330394.1, NM_032236.6, NM_012196.1, NM_001350865.1, NM_001353498.1, NM_001330124.1, NM_001348097.1, NM_001351318.1, NM_001361041.1, NM_001330075.1, NM_001330228.1, NM_001330611.1, NM_001330777.1, NM_001330778.1, NM_032138.5, NM_001346232.1, NM_001331003.1, NM_000054.5, NM_001146151.2'

Will write a script to retrieve these off clingen and then the ones that don't match the length will put into data migration

TODO:

  • Go through transcripts and call length in stest / prod to find transcripts that need to be marked as alignment_gap = True
  • Check them against the UTA issue ones above

Also have a think about whether we want to add a new model that has transcript_version length etc - then we can retrieve them as we need via API and check things are good.

Thinking about it more, perhaps TranscriptVersion should be outside genome build and there is related object that holds the coordinates

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 14, 2021

Worked out a way to retrieve the transcript lengths via API, so we can pull them down and check for length, and if different from ours, mark them as alignment_gap=True

still todo:

  • Annotation instructions - optional - import_transcript_refseq_fasta
  • Transcript version page - show transcript version info if we have it - note how HGVS will be resolved etc.
  • Write a fix management command to go through and find potentially bad stuff - the UTA ones and code below:
  • manual migration - remind existing installation to run import_transcript_refseq_fasta, and after that's done - the fix command above

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 14, 2021

Deployment notes

Run in order (1 and 3 are in upgrade steps in right order, step 2 is a hack to save time)

  • import_refseq_transcript_fasta
  • Insert of genbank.txt.gz - see below:
  • fix_retrieve_transcript_version_sequence_info (this takes ages, so hopefully won't need to get much here)

genbank.txt is data I saved from my laptop, to save 1/2 an hour of API calls or so by copying around the data

And also here: sacgf.ersa.edu.au:/data/sacgf/admin/variantgrid_setup_data/genbank.txt.gz

# To write it
from Bio import SeqIO
from io import StringIO
import gzip
from genes.models import TranscriptVersionSequenceInfo

qs = TranscriptVersionSequenceInfo.objects.filter(api_response__isnull=False)
responses = list(qs.values_list("api_response", flat=True))
records = []
for api_response in responses:
    if '"id":"ENST' in api_response:
        print(f"Skipping {api_response}")
        continue
    with StringIO(api_response) as f:
        records.append(list(SeqIO.parse(f, "genbank"))[0])

with gzip.open("genbank.txt.gz", "wt") as f:
    SeqIO.write(records, f, "genbank")

To read them and insert:

from io import TextIOWrapper
import gzip
from genes.models import TranscriptVersionSequenceInfo

f = gzip.open("./genbank.txt.gz")
f2 = TextIOWrapper(f)
TranscriptVersionSequenceInfo._insert_from_genbank_handle(f2)

They have been copied to vg test /mnt/incoming - could put them in static and provide link here so that others can use them in diff environments

Testing

  • When you visit the transcript version page, we retrieve the sequence info via API if not there, and then set alignment_gap if appropriate. It then shows a warning message about how that works
  • Annotation page has a new section on transcript version sequence info

@EmmaTudini
Copy link
Contributor

Not sure if this is the right issue for this comment (so feel free to move if needed)

  • Before all of these changes, if a transcript version didn't exist, we'd bump it to the closest version for that genome build that did exist.
  • This variant - https://test.shariant.org.au/classification/classification/21230 was imported as NM_007294.6(BRCA1):c.1A>G and had originally resolved to NM_007294.4(BRCA1):c.1A>G.
  • Since the changes, it stays as version .6 and doesn't lift over. There is also no open flag against the allele or variant to show that something hasn't matched. The allele flag that was open with "
    Error: Transcript 'NM_007294.6' does not exist in the GRCh37 RefSeq database. (GRCh37)
    Error: Transcript 'NM_007294.6' does not exist in the GRCh38 RefSeq database. (GRCh38)" was closed during today's update

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 16, 2021

@EmmaTudini now we only go upwards with transcript versions.

We inserted the very latest RefSeq into Shariant test and there is no v6 in there. If you go to the RefSeq page for the transcript:

https://www.ncbi.nlm.nih.gov/nuccore/NM_007294

The most recent version is 4, like we have. There has never been a 5 or a 6

@EmmaTudini
Copy link
Contributor

EmmaTudini commented Sep 16, 2021

@davmlaw We used to go down in versions though? See text from transcript version change flag for that variant:

Admin Bot
admin
04:43 PM
In Progress
For c.hgvs NM_007294.6(BRCA1):c.1A>G the transcripts are:

NM_007294.6 (imported)
NM_007294.4 (resolved)
@TheMadBug You might be able to comment too?

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 16, 2021

We used to, now we only go up.

If someone enters a transcript version that doesn't exist (we check RefSeq via API) something went very wrong and I think it should fail.

@EmmaTudini
Copy link
Contributor

EmmaTudini commented Sep 16, 2021

@TheMadBug Is there any way to check whether this will cause issues in prod?

@davmlaw If we only check refseq for transcript versions, what happens to ensembl transcripts?

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 16, 2021

For Ensembl, we check the Ensembl API

To work out the differences in prod, I could export the HGVSs and then run some code to re-match them using the current system, and see if they go somewhere different (or fail)

@EmmaTudini
Copy link
Contributor

Ok so there shouldn't be any way that we're out of date and a new transcript version does exist?

I think that would be good. @TheMadBug unless there's another way that is easier?

@EmmaTudini
Copy link
Contributor

Also would a variant be rejected if a transcript version didn't exist at all or didn't exist for a specific genome build?

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 16, 2021

If the transcript didn't exist in our DB, we'd call the API then die with a bad transcript does not exist message.

If it doesn't exist for a particular genome build, current behavior is to skip it on the way to the ones that are in our genome build.

#481 change is to call ClinGen with that version anyway, and try to get coordinates for our build that way

@EmmaTudini
Copy link
Contributor

@davmlaw Just uploaded this example to test and it resulted in an error (it's from prod and will cause a lot of issues - there are few variants on this transcripts) - https://test.shariant.org.au/classification/classification/36847
We only store .5 for GRCh37 and .10 for GRCh38.

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 16, 2021

That works ok with ClinGen Allele Registry so should be ok once #481 is done

@EmmaTudini
Copy link
Contributor

@davmlaw Just to confirm - we now go straight to refseq or ensembl via the API for every transcript that we see? So even if a transcript and it's version has been seen before, we still check for any updates via the API?

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 20, 2021

@EmmaTudini there are no update information for transcript versions - if the sequence changed the version would.

We contact the API to see whether or not it exists, and when we do that we also pull the sequence down, which we can use to see if there are any alignment gaps

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 21, 2021

Clingen allele registry reference bases provided are not right aligned and this is causing changes compared to prod

Worked this one out, see: #492 ClinGen allele registry supplied non-normalized indel coordinates / bases at 1 point in time

Solution will be to re-retrieve the ClinGen data and then re-match everything.

@EmmaTudini
Copy link
Contributor

@davmlaw I asked James to pull out the number of transcripts with the alignment gap of true vs those where it was false just to make sure that they lined up as expected. Transcripts with a gap alignment of true represent about 11% of the total transcripts. Is this higher than expected?

85289 gap = true
667649 gap = false

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 21, 2021

Wow looks like it grew quite a lot since I retrieved all of the sequence lengths and set gaps on anything with a diff length

I have no idea how many to expect. It doesn't seem unreasonable that there's that many, though it's annoying

@EmmaTudini
Copy link
Contributor

EmmaTudini commented Sep 22, 2021

@davmlaw UCSC now has a refseq track called "Differences between NCBI RefSeq Transcripts and the Reference Genome". Which makes it super easy to look at alignment gaps.

So I looked at a couple of examples where there was an alignment gap of true in Shariant test and found that there are some transcripts that annotated as true that don't have an alignment gap according to the new UCSC track.

E.g. BAP1 NM_004656.4 (https://test.shariant.org.au/genes/view_transcript_version/NM_004656/4). Think this also applies to any of the Refseq BAP1 transcripts

This is the new track that shows gaps in GRCh37 - doesn't show any gaps (https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr3%3A52435020%2D52444121&hgsid=1170874253_RNevAq5NDinRsAGS3r9sLR61ahda)
and GRCh38 - https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr3%3A52401008%2D52410008&hgsid=1170874253_RNevAq5NDinRsAGS3r9sLR61ahda

I wonder if doing a diff of the refseq length via the API vs looking at how long it is against the genome is creating issues?

@SACGF SACGF deleted a comment from EmmaTudini Sep 22, 2021
@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 22, 2021

Moved Shariant test allele issues into:

https://github.com/SACGF/shariant-admin/issues/124

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 22, 2021

@EmmaTudini - If you look at the transcript version: https://test.shariant.org.au/genes/view_transcript_version/NM_004656/4

The sequence is 3600bp long, but the sum of the exons is 3599 for both genome builds. If there's no gap, where did the base go?

@EmmaTudini
Copy link
Contributor

EmmaTudini commented Sep 22, 2021

Where are you retrieving the total length? If you look at .3 and .2 the gap is even bigger. I was thinking that maybe the total length that you retrieve via the API might be incorrect?

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 22, 2021

The length is given as 3600bp on this page:

https://www.ncbi.nlm.nih.gov/nuccore/NM_004656.4

But the way I calculate length is to get the sequence via Fasta or API and then count the bases - it's always matched the page (as it should!)

@EmmaTudini
Copy link
Contributor

EmmaTudini commented Sep 22, 2021

@davmlaw I went back to the original file referenced in Shariant test -
GCF_000001405.39_GRCh38.p13_genomic.109.20210514.gff.gz

From having a quick look at NM_004656.4 - it seems like the bases referenced in Shariant test in the JSON dropdown are 1bp off the file. E.g. for exon 1 the coordinates in the file are - 52409842, 52410008 whereas the coordinates in shariant test are - 52409841, 52410008

Do you know why that's the case?

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 23, 2021

Coordinates used for computer are generally zero based half closed intervals.

Coordinates used for humans are often 1 based open intervals.

http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/

@EmmaTudini
Copy link
Contributor

EmmaTudini commented Sep 23, 2021

In this case, it seems to change the overall length of the transcript though as one exon does not get moved. See exon 17 in the file. If you calculate the length using the original coordinates from the GCF file, the length is 3600bp which matches the refseq website.

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 23, 2021

Good spotting - yes looks like the alignments can change over time so we should update them - moved to issue #494

@EmmaTudini
Copy link
Contributor

EmmaTudini commented Sep 23, 2021

@davmlaw Have been looking into whether having 8000 transcripts with an alignment gap makes sense and found this comment in the docs for a tool called "hgvs" - https://hgvs.readthedocs.io/en/stable/examples/using-hgvs.html?highlight=gap#projecting-in-the-presence-of-a-genome-transcript-gap

"As of Oct 2016, 1033 RefSeq transcripts in 433 genes have gapped alignments. These gaps require special handlingin order to maintain the correspondence of positions in an alignment. hgvs uses the precomputed alignments in UTA to correctly project variants in exons containing gapped alignments.".

I don't think it's feasible to believe that the number of gapped transcripts has grown by 85x in five years? What are your thoughts?

Also it looks like genomic.gff files from refseq annotate transcripts with gaps. E.g. ALMS1 there's a note saying - "Note=The RefSeq transcript has 1 non-frameshifting indel compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=ALMS1;inference=similar to RNA sequence%2C mRNA (same species)". Not sure how consistent these annotations are, but this might be an alternative to comparing the lengths?

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 27, 2021

Thanks for the HGVS link I added it to the wiki guide

Also it looks like genomic.gff files from refseq annotate transcripts with gaps

I know, see the 1st sentence of this issue :) UTA = Universal Transcript Archive - which is what's used by that HGVS project

The conversations are too long to work out what's going on - will handle everything in #494

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 27, 2021

FYI evaluated that library and it was super-complex to use and setup, and I couldn't get it to work locally - I raised a still open issue from 2018 - and some environments we can't make a direct call to their SQL database

I really wish it had worked, or I had spent time trying to fix that issue myself, I had to re-add much of that complexity to the other HGVS library and our transcripts loading etc

@davmlaw davmlaw reopened this Nov 3, 2021
@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 3, 2021

Re-opened this as we need to handle these

select distinct tx_ac from uta_20180821.exon_set where exon_set_id in (select exon_set_id from uta_20180821.exon where exon_id in (select tx_exon_id from uta_20180821.exon_aln where cigar like '%D%' OR cigar like '%I%'));

And put this in a file /data/incoming I think

TODO:

  • Need to make a proper migration to set these
  • I set it to data["error"] but better to put into something that will show in alignment gap

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 11, 2021

I had run this by hand on test, first setting the version to be + 1000, then eventually setting it to error like below.

This was manually run on prod:

from classification.models import Classification                                                                            
from library.guardian_utils import admin_bot 
from classification.classification_import import reattempt_variant_matching 


f = open("/data/incoming/transcript_gaps.txt")                                                                               

transcripts = {s.strip() for s in f.readlines()}

for tv in TranscriptVersion.objects.filter(import_source__url__icontains='UTA'): 
    if tv.accession in transcripts: 
        tv.data["error"] = "UTA transcript has alignment gaps" 
        tv.save()

uta_classifications = [] 
user = admin_bot() 
for c in Classification.objects.all(): 
    if c.transcript in transcripts: 
        c.revalidate(user) 
        uta_classifications.append(c) 

qs = Classification.objects.filter(pk__in=[c.pk for c in uta_classifications])
reattempt_variant_matching(admin_bot, qs)

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 11, 2021

Before - I rematched every classification that had a transcript marked by UTA as having gaps (which was a lot)
I should have only rematched those that were from UTA in our database (we only have UTA for transcripts we don't have other copies for)

I think it set a lot of classifications to have no variant before I rematched anyway, but here's the script if we ever need to run it again:

from django.db.models import Q
from classification.models import Classification                                                                            
from library.guardian_utils import admin_bot 
from classification.classification_import import reattempt_variant_matching 
from genes.models import TranscriptVersion


f = open("/data/incoming/transcript_gaps.txt")                                                                               
uta_gap_transcripts = {s.strip() for s in f.readlines()}
our_uta_transcripts = set()


for tv in TranscriptVersion.objects.filter(import_source__url__icontains='UTA'):
    if tv.accession in uta_gap_transcripts:
        our_uta_transcripts.add(tv.accession)


uta_classifications = [] 
user = admin_bot() 
for c in Classification.objects.all(): 
    if c.transcript in our_uta_transcripts: 
        uta_classifications.append(c) 


qs = Classification.objects.filter(Q(variant__isnull=True) | Q(pk__in=[c.pk for c in uta_classifications]))
reattempt_variant_matching(user, qs)

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 12, 2021

When writing this, and setting errors:

  • Only include the transcripts that would end up being set for UTA to massively shrink the list.
  • check whether "cdna_match" appears in the transcript version data, as we may one day convert the UTA transcripts to our format so we can handle the gaps

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 12, 2021

This was patched by hand in upgrade on Nov 11

Running the data migration over the manual fix shouldn't change what's used for local HGVS resolution (assuming we haven't uploaded fixed UTA transcripts yet), but it should add the message in "alignment gap" on the transcript version page.

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 15, 2021

The data migration makes a slight change over what was done in prod manually.

If you go to a UCSC/UTA transcript with a gap, it should say in "alignment gap" what the error is eg:

https://test.shariant.org.au/genes/view_transcript_version/NM_032932/5

Not only doesn't work but it says in "Incorrectly converted UTA/UCSC transcript is missing alignment info"

@EmmaTudini
Copy link
Contributor

Think this has already been done in prod?

@EmmaTudini
Copy link
Contributor

Does anything else need to be done? Assuming that any new UTA transcripts will auto be assigned the gap

@davmlaw
Copy link
Contributor Author

davmlaw commented Dec 2, 2021

The original fix was deployed to prod, though there was a change to give slightly more info in test that isn't in prod yet, see:

#480 (comment)

@davmlaw davmlaw assigned EmmaTudini and unassigned davmlaw Dec 2, 2021
@EmmaTudini
Copy link
Contributor

EmmaTudini commented Dec 3, 2021

@davmlaw Looks good in test, but not sure I understand what you mean by "Incorrectly converted UTA/UCSC transcript is missing alignment info". Why is it incorrectly converted? I thought that we just didn't convert UTA/UCSC transcripts?

@davmlaw
Copy link
Contributor Author

davmlaw commented Dec 15, 2021

UTA has exon coordinates and alignment gaps (in their own format)

When I converted UTA transcripts to our format long ago, I didn't take the alignments, so it doesn't handle gaps properly.

So, it was correct on their end, but I converted it incorrectly

@davmlaw
Copy link
Contributor Author

davmlaw commented May 9, 2022

This has been in prod for a long time, and we'll be bringing in UTA transcripts with gap handling code in #556 (CDOT) and #515 (UTA gaps)

@davmlaw davmlaw closed this as completed May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants