Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcf_to_csv -> not recognizing FORMAT #340

Open
ayadlin opened this issue Sep 11, 2020 · 3 comments
Open

vcf_to_csv -> not recognizing FORMAT #340

ayadlin opened this issue Sep 11, 2020 · 3 comments

Comments

@ayadlin
Copy link

ayadlin commented Sep 11, 2020

Hi -
I am trying to copy 3 cols from a VCF file to CSV file. 1 is ID, the second is fromat DS and format GP.
using the command below I get a warning
allel.vcf_to_csv(my_vcf.vcf', 'my_vcf.csv', fields=['calldata/*'])

UserWarning: '*' FORMAT header not found
warnings.warn('%r FORMAT header not found' % name)

Changing the command to

allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['calldata/DS'], types={'calldata/DS':'f4' )
avoids the error but I get an empty csv file - I know there is data (float in the FORMAT:DS column, is there any thing wrong with what I am doing or is there an issue on the csv writing?

allel.read_vcf('my_vcf.vcf', fileds=['DS','GP']) works, but it is a long process- and I am note sure how to go from there to the csv file.

just in case this is useful FORMAT is one of the headers of my file -
and the description section defines:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 ">

A final question , as you see, GP is a tuple (float, float, float) if I need to assign a type to it in the types dictionary - what would the correct syntax be?
Also I have not been able to find exactly what f4 means ( I know is float but is it float 32, float64?

What I am trying to build is a CSV file that conserves the samples identifiers , the SNP IDs and the Dosage (DS) and the posterior probabilities GP. I have about 5000 samples and 10000000 divided across 22 chromosomes. Would appreciate any help on how to extract and consolidate that data

Thanks,

A

PS - I am sure the issue is with reading the calldata as
allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['ID' , 'calldata/DS'], types={'calldata/DS':'f4' )
allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['ID' , 'DS'])
allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['ID'])

produce the same .csv with the ID only

@hardingnj
Copy link
Collaborator

hardingnj commented Sep 13, 2020

Also I have not been able to find exactly what f4 means ( I know is float but is it float 32, float64?

This is the number of bytes, so f4 = 8x4 = 32 bit float.

Your code looks ok to me- no obvious problems with how you have specified those fields.

It's difficult to debug without access to the file, but if you could provide a minimal example that fails I'd be happy to look in detail. If you modify the numbers to obscure anything potentially identifiable/privileged that would be good too.

vcflibcontains some useful commands to downsample VCF files.

@ayadlin
Copy link
Author

ayadlin commented Sep 13, 2020 via email

@hardingnj
Copy link
Collaborator

I think it's unlikely to be the size... there are ways of chunking the file within allel if it's very large. The sorting could be a problem maybe. It might be worth using the region argument to read a subset of the data too.

A small subset of the data that fails to work can tell us much more than the error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants