`parseVEP` crashes when reading extremely large files #683

lydiayliu · 2023-02-28T16:54:13Z

I tried to parseVEP a 800M tsv from all of mouse dbSNP, here:
/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/Giansanti_Mouse/processed_data/mpg_snp_indel/VEP/dbSNP150_GCF_000001635.24-All.tsv.gz

parseVEP was Killed and I assume that it is because it tried to read in the entire file as the first step. This is probably unnecessary for parsing and could be switched to reading the file line by line?

The text was updated successfully, but these errors were encountered:

zhuchcn · 2024-02-28T19:33:10Z

parseVEP does read and process records line by line, but the issue is they are not writen until the very end. And the only reason for this is variants are sorted before writing to disk. But the sorting is done at the transcript level, and the input VCF to VEP is always sorted which means records that are mapped to the same gene are always next to each other. So we can actually write records to GVF once the current transcript ID or gene ID is changed.

update: turns out that this is not the case, particularly when the same position can be mapped to multiple genes. Maybe we can sort the VEP file based on gene ID before parsing.

lydiayliu · 2024-02-28T20:16:55Z

Ahh got it. How bad is it for parseVEP to write a gvf that is not sorted?

zhuchcn · 2024-02-28T21:34:01Z

There doesn't seem to be an out-of-box solution to sort a very large file in an on-disk manner in Python. A possible solution is splitting the file into chunks, sorting each trunk, and then merge each chunk. The easiest solution still seems to be the GNU sort. We could implement a sorted VEP parser, that handles the VEP file when it's already sorted. We can add a flag such as --sorted to use this behavior, and if not specified, use the original parser that writes all results at the end.

lydiayliu added the enhancement New feature or request label Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`parseVEP` crashes when reading extremely large files #683

`parseVEP` crashes when reading extremely large files #683

lydiayliu commented Feb 28, 2023

zhuchcn commented Feb 28, 2024 •

edited

Loading

lydiayliu commented Feb 28, 2024

zhuchcn commented Feb 28, 2024

parseVEP crashes when reading extremely large files #683

parseVEP crashes when reading extremely large files #683

Comments

lydiayliu commented Feb 28, 2023

zhuchcn commented Feb 28, 2024 • edited Loading

lydiayliu commented Feb 28, 2024

zhuchcn commented Feb 28, 2024

`parseVEP` crashes when reading extremely large files #683

`parseVEP` crashes when reading extremely large files #683

zhuchcn commented Feb 28, 2024 •

edited

Loading