Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved error messages for input data #52

Open
lczech opened this issue May 10, 2020 · 5 comments
Open

Improved error messages for input data #52

lczech opened this issue May 10, 2020 · 5 comments

Comments

@lczech
Copy link

lczech commented May 10, 2020

Hi there,
I'd suggest to improve some error messages:

panic!("there is no flag type for format");

The error simply states that

there is no flag type for format

and it took me quite a while digging through rust-htslib to figure out that it actually wants to say something like

Format "XX" was requested for inclusion in vcf-to-txt, but that format field does not appear in the input vcf file(s).

where "XX" is some format annotation shortcut for vcf files. The same probably applies for the info fields as well.

I'd provide a PR, but I don't know rust well enough to gauge whether the error message can occur in other circumstances as well, where my improved message would not be fitting.

All the best
Lucas

@lczech
Copy link
Author

lczech commented May 10, 2020

Addendum: For other types of missing fields, the errors get even more confusing. The above happens when calling rbt vcf-to-txt -g --fmt DP without having the format field DP in the vcf file.

However, if instead the AD field is ill formatted, I get a rust kernel panic:

thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', src/bcf/to_txt.rs:212:51

The behaviour can be reproduced with these files (produced by freebayes), and calling

cat bcftools-view.bcf | rbt vcf-to-txt -g --fmt DP AD --info ANN

using rust-bio-tools =0.6.0, as well as rust-bio-tools =0.10.0

The field is present in the file, so I don't know exactly what is causing the issue here... Would be nice to have a fix for that, because otherwise, I cannot use rbt with freebayes.

By the way, these errors occurred while adapting the dna-seq-gatk-variant-calling workflow to other types of callers (bcftools call, and freebayes), which seem to produce vcf files that confuse rbt. The error occurs in the stats rule for converting vcf to txt.

All the best and thanks in advance
Lucas

@johanneskoester
Copy link
Contributor

Thanks a lot for reporting. Indeed, error messages should be improved here. This is now much easier with the thiserror and anyhow crates. I currently don't have the capacity, but it is certainly something we would either be happy to get a PR on, or that we will start looking at as soon as possible.

@essut
Copy link
Contributor

essut commented Apr 17, 2022

Hi @lczech,

I tried to reproduce your error message (there is no flag type for format) by using the test file in tests/test.vcf and running the command rbt vcf-to-txt -g --fmt DP. However, rbt now prints the missing DP tag as empty instead of spitting out the error you described. Can you test on your file that emits this error and see if the fix helps in debugging or not? Thank you.

@essut
Copy link
Contributor

essut commented Apr 18, 2022

I can reproduce the error by supplying a INFO tag to --fmt, e.g. rbt vcf-to-txt --fmt T < tests/test.vcf where tag "T" is in the INFO field. As @lczech noted, the error message might not be specific to these instances.

I made a pull request (#243) to better inform the users of the error that occurred (at least now the offending tag is printed out instead of a generic message) and suggested the users to check their inputs. Hopefully @johanneskoester can chime in and give their view on this issue.

@essut
Copy link
Contributor

essut commented Apr 19, 2022

On the error about the "AD" tag, I just found out the reason why rust-bio-tools panicked.

The "AD" tag is defined as ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Number of observation for each allele">. Number=R informs the parsing library (rust-htslib) to expect two values in "AD".

However, the samples with missing values in vcf files generated from freebayes are encoded with .:.:.:.:.:.:.:. instead of .:.,.:.:.:.:.:.:.. The error thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', src/bcf/to_txt.rs:212:51 is caused by rbt trying to access this non-existent second value.

Here is the file successfully parsed after changing .:.:.:.:.:.:.:. to .:.,.:.:.:.:.:.:.:

VARIANT VARIANT VARIANT VARIANT VARIANT A       B
CHROM   POS     REF     ALT     QUAL    AD      AD
21      5038037 T       C       0.44    1       2
21      5038042 A       C       13.59   1       2
21      5082586 G       A       0.25    -2147483648     2
21      5118139 T       C       0.99    -2147483648     2
21      5119448 A       C       5.96    1       2
...

Maybe if the tag length is Alleles and an out of bounds error occurred, we can assume this is a missing value? Or maybe the upstream rust-htslib can be made to interpret a single missing value to represent the missing values for the available alleles in cases with Number=R?

As a side note, I noticed that the value for the "AD" tag is filled with the last allele value in each sample. Looking at the code, assume that this is intended? Also, the missing value is encoded as -2147483648, but I thought it should be encoded as (a blank space). Is this intended as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants