Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture uploaded allele correctly for VCF input #1744

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

likhitha-surapaneni
Copy link
Contributor

@likhitha-surapaneni likhitha-surapaneni commented Aug 20, 2024

Ticket: ENSVAR-5858

  • Fix Unit tests
  • Review header and output columns

To test:

  • With --allele_number for multi-allelic examples

@likhitha-surapaneni likhitha-surapaneni marked this pull request as draft August 20, 2024 10:50
@likhitha-surapaneni likhitha-surapaneni marked this pull request as ready for review August 21, 2024 09:20
@nakib103 nakib103 self-requested a review August 21, 2024 09:27
modules/Bio/EnsEMBL/VEP/Parser.pm Outdated Show resolved Hide resolved
modules/Bio/EnsEMBL/VEP/Parser.pm Show resolved Hide resolved
@nuno-agostinho nuno-agostinho removed their request for review September 6, 2024 12:47
@likhitha-surapaneni likhitha-surapaneni changed the base branch from postreleasefix/113 to main December 4, 2024 15:30
modules/Bio/EnsEMBL/VEP/Parser.pm Outdated Show resolved Hide resolved
modules/Bio/EnsEMBL/VEP/OutputFactory.pm Show resolved Hide resolved
t/OutputFactory.t Outdated Show resolved Hide resolved
Comment on lines +970 to +971
# Updating a flag to minimise multi-allelic variants in split_variants/rejoin_variants
$vf->{minimised} = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-allelic is not getting minimised for default format. For example - 1 961320 961324 GCAGG/GCA/GCAG +

But in the output still getting MINIMISED=1, (without the PR they are also not minimised but there is no MINIMISED=1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nakib103 , can you please test this example with the latest commit. The allele is expected to be similar to when running --minimal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @likhitha-surapaneni,

Yes, the allele are same when running with or without --minimal.
But the original problem remains. The output says it is minimised but when it is not -

1_961320_GCAGG/GCA/GCAG	1:961320-961324	-	ENSG00000188976	ENST00000327044	Transcript	upstream_gene_variant	-	-UPLOADED_ALLELE=GCAGG/GCA/GCAG;IMPACT=MODIFIER;DISTANCE=2067;STRAND=-1;MINIMISED=1

Copy link
Contributor Author

@likhitha-surapaneni likhitha-surapaneni Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nakib103 , the output is still minimised if you notice the 3rd column (correct me if I got it wrong). We have the allele "-" as the minimised representation for first alternative allele is GG/- and for the second one is G/-. The problem however is that there is no way to differentiate between the alternative alleles as both show "-". Ideally minimal representation should be GG/-/G. This is also an existing problem with the option --minimal and needs to be addressed probably in a future ticket. Please let me know if this makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually looking at the Uploaded_variation column, it does not seem to be minimised as it does for bi-allelic variants (see - 1 961320 961324 GCAGG/GCA +).

But the Allele column shows the alleles are minimised. It seems Uploaded_variation string has different logic, should we address this?

Copy link
Contributor Author

@likhitha-surapaneni likhitha-surapaneni Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree @nakib103. The uploaded variation was not expected to minimise alleles but it was minimising the alleles in some cases due to the way we populate the column and it was difficult to correct this as it is also the default option, hence the approach was taken to capture original allele with a different flag (--uploaded_allele). However, I agree it may also be worth investigating why uploaded variation is minimising the allele in some cases and not doing it in other cases.

@likhitha-surapaneni likhitha-surapaneni changed the base branch from main to postreleasefix/114 December 17, 2024 13:44
@dglemos dglemos requested review from dglemos and nuno-agostinho and removed request for nuno-agostinho January 30, 2025 09:57
@dglemos dglemos self-assigned this Jan 30, 2025
@dglemos
Copy link
Contributor

dglemos commented Jan 31, 2025

The MINIMISED is missing from the header, so in the VCF output even when the variant is minimised this info is not in the file. The only outputs with this info are the default and json.

@dglemos
Copy link
Contributor

dglemos commented Jan 31, 2025

When the input is in VCF format, example:
1 961320 . G GCAGGCTCGGCC . . .

In the vep default output, the Uploaded_variation is represented with minimised alleles:
1_961321_-/CAGGCTCGGCC
In this scenario matching the alleles is more difficult.

@likhitha-surapaneni
Copy link
Contributor Author

When the input is in VCF format, example: 1 961320 . G GCAGGCTCGGCC . . .

In the vep default output, the Uploaded_variation is represented with minimised alleles: 1_961321_-/CAGGCTCGGCC In this scenario matching the alleles is more difficult.

Hi @dglemos , yes this is right. The idea is to use --uploaded_allele to extract the original allele through UPLOADED_ALLELE=G/GCAGGCTCGGCC (docs). Please let me know if that would be sufficient to match the alleles.

@dglemos
Copy link
Contributor

dglemos commented Feb 4, 2025

Yes, UPLOADED_ALLELE is enough to match the alleles. It should be returned by default though, if the variant is minimised it should return the original alleles too.

@nakib103
Copy link
Contributor

nakib103 commented Feb 4, 2025

Yes, UPLOADED_ALLELE is enough to match the alleles. It should be returned by default though, if the variant is minimised it should return the original alleles too.

I think it is generally advisable to not update the default output list so not to break existing pipelines.

@likhitha-surapaneni likhitha-surapaneni changed the base branch from postreleasefix/114 to main February 6, 2025 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants