-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discrepancy with running mhcflurry-predict on not supported alleles #210
Comments
Looks like the first two alleles you mentioned are getting canonicalized by mhcgnomes to other allele names, whereas the third is genuinely unsupported: We should take a look at the sequences in IMGT for the first two alleles and what they are getting mapped to (HLA-A02:172 -> HLA-A02:17, HLA-C02:16 -> HLA-C02:137) to understand if this canonicalization is reasonable or a bug. I will add this to my todo list but if you have a chance to do it first please let us know what you see. @iskandr who wrote mhcgnomes may also have thoughts on this. |
I'm not sure why A02:172 is getting mapped to A02:17, as they are different alleles with different protein sequences. I did note that C02:16:01 was renamed to C02:137 from IMGT's Deleted_alleles.txt. However, C02:16 shouldn't be mapped to C012:137. As for C12:139, this allele was changed from C12:139 in IMGT v3.38.0 to C*12:139Q in v 3.39.0. |
@iskandr do you think the following is a mhcgnomes bug:
@liviatran I am seeing that C02:16:01 is getting canonicalized to HLA-C02:137, which from your comment I think is the correct canonicalization, right? (I.e. I am not seeing it getting canonicalized to C012:137.) |
This does seem like a bug.
Checking the IMGT/HLA allele history entry, I see:
I'm trying to figure out how this "normalization" happens and my guess is that it's via this entry:
...which links "A*02172" with "A*02:17:02:01". I'm going to figure out where in the logic "A*02:172" gets checked against "A*02172" and make it more cautious. |
@timodonnell The answer for the C02:16 and C02:137 canonicalization question depends on what mhcflurry is evaluating. Is it evaluating the whole protein structure, which in some cases would affect peptide binding? In that case, the two proteins are different. Is it evaluating only the peptide binding domain, which is encoded by exons 2 and 3? In that case, the exons 2 and 3 protein encoded sequences are the same for the two alleles. |
@iskandr Perhaps mhcgenomes.parse should only look at version 3 (colon delimited) HLA allele names to avoid these name collisions. For the non-colon delimited (versions 1 and 2) allele names, the names have to be evaluated in pairs of numbers. There would never be a version 1 or 2 allele name with three digits in the first field. *Caveat: if anyone is analyzing version 1 allele names (which is not recommended), almost all the time, the alleles have four or five digits instead of four or six digits. |
I ran
mhcflurry-predict --list-supported-alleles > 'supported_alleles.txt'
to get a list of of alleles supported for mhcflurry.I noted HLA-A02:172, HLA-C02:16, and HLA-C12:139 were not in that list of supported alleles. The aforementioned alleles are all Well-Documented alleles (https://www.ihiw18.org/component-immunogenetics/download-common-and-well-documented-alleles-3-0).
I was able to successfully run mhcflurry-predict on HLA-A02:172 and HLA-C02:16, despite them not being on the list of supported alleles. HLA-C12:139 caused a ValueError.
Are the results for the unsupported alleles (HLA-A02:172, HLA-C02:16) reliable?
The text was updated successfully, but these errors were encountered: