-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index failes if more than one empty line #160
Comments
Thanks so much for giving me a reproducible example. I'll have to consider what the expected behavior should be. So far I've been using the behavior of samtools (htslib) as a guide. In this case samtools seems to not do the correct thing:
So based on the index that samtools creates, it's getting confused about how many sequence characters and total characters are on each line. However, in this case the math works out correctly for the second entry and this is either purposeful or a happy accident as everything is fine when you fetch the sequence:
Let's see what happens when we use that index with pyfaidx:
So far so good. What about getting the actual sequences?
Other than the questionable naming of the empty sequence slice it looks like the pyfaidx sequence retrieval code handles the samtools index just fine. It's also worth noting that the correct result is returned from the
I'm guessing samtools has similar logic, aside from the "long name" which is something specific to this package. After looking at recent samtools releases I noticed that since version 1.9 there has been more error checking around empty sequences (samtools/samtools#834), prompted by a similar issue (samtools/samtools#513). TLDR: I think I need to fix the indexing code in pyfaidx to produce a similar output to samtools in these cases, and should consider adding warnings to the |
I also like the description of indexable FASTA files in the samtools documentation, but it says nothing about empty lines:
|
Its interesting that empty lines are not specified. I would think they should be ignored by the actual script as they are invalid characters but pyfaidx should work as expected even on these weird files. Just for clarity: I do not think this is really important as obviously this is an edge case that's just caused by another bioinformatics tool producing the wrong output. Sadly I can not file a bug report with them. |
Following the issue here: #159
I found a new bug with a corner case fasta containing two empty lines.
I made a minimal (not) working example here (version 0.5.8)
In this version of the bug, adding and empty line to an already empty sequence will fail the parser.
Note: When debugging this it sometimes only seems to fail if no .fai file exists already.
The text was updated successfully, but these errors were encountered: