You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to parseVEP a 800M tsv from all of mouse dbSNP, here: /hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/Giansanti_Mouse/processed_data/mpg_snp_indel/VEP/dbSNP150_GCF_000001635.24-All.tsv.gz
parseVEP was Killed and I assume that it is because it tried to read in the entire file as the first step. This is probably unnecessary for parsing and could be switched to reading the file line by line?
The text was updated successfully, but these errors were encountered:
parseVEP does read and process records line by line, but the issue is they are not writen until the very end. And the only reason for this is variants are sorted before writing to disk. But the sorting is done at the transcript level, and the input VCF to VEP is always sorted which means records that are mapped to the same gene are always next to each other. So we can actually write records to GVF once the current transcript ID or gene ID is changed.
update: turns out that this is not the case, particularly when the same position can be mapped to multiple genes. Maybe we can sort the VEP file based on gene ID before parsing.
There doesn't seem to be an out-of-box solution to sort a very large file in an on-disk manner in Python. A possible solution is splitting the file into chunks, sorting each trunk, and then merge each chunk. The easiest solution still seems to be the GNU sort. We could implement a sorted VEP parser, that handles the VEP file when it's already sorted. We can add a flag such as --sorted to use this behavior, and if not specified, use the original parser that writes all results at the end.
I tried to
parseVEP
a 800M tsv from all of mouse dbSNP, here:/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/Giansanti_Mouse/processed_data/mpg_snp_indel/VEP/dbSNP150_GCF_000001635.24-All.tsv.gz
parseVEP
wasKilled
and I assume that it is because it tried to read in the entire file as the first step. This is probably unnecessary for parsing and could be switched to reading the file line by line?The text was updated successfully, but these errors were encountered: