Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parseVEP crashes when reading extremely large files #683

Open
lydiayliu opened this issue Feb 28, 2023 · 3 comments
Open

parseVEP crashes when reading extremely large files #683

lydiayliu opened this issue Feb 28, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@lydiayliu
Copy link
Collaborator

I tried to parseVEP a 800M tsv from all of mouse dbSNP, here:
/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/Giansanti_Mouse/processed_data/mpg_snp_indel/VEP/dbSNP150_GCF_000001635.24-All.tsv.gz

parseVEP was Killed and I assume that it is because it tried to read in the entire file as the first step. This is probably unnecessary for parsing and could be switched to reading the file line by line?

@lydiayliu lydiayliu added the enhancement New feature or request label Feb 28, 2023
@zhuchcn
Copy link
Member

zhuchcn commented Feb 28, 2024

parseVEP does read and process records line by line, but the issue is they are not writen until the very end. And the only reason for this is variants are sorted before writing to disk. But the sorting is done at the transcript level, and the input VCF to VEP is always sorted which means records that are mapped to the same gene are always next to each other. So we can actually write records to GVF once the current transcript ID or gene ID is changed.

update: turns out that this is not the case, particularly when the same position can be mapped to multiple genes. Maybe we can sort the VEP file based on gene ID before parsing.

@lydiayliu
Copy link
Collaborator Author

Ahh got it. How bad is it for parseVEP to write a gvf that is not sorted?

@zhuchcn
Copy link
Member

zhuchcn commented Feb 28, 2024

There doesn't seem to be an out-of-box solution to sort a very large file in an on-disk manner in Python. A possible solution is splitting the file into chunks, sorting each trunk, and then merge each chunk. The easiest solution still seems to be the GNU sort. We could implement a sorted VEP parser, that handles the VEP file when it's already sorted. We can add a flag such as --sorted to use this behavior, and if not specified, use the original parser that writes all results at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants