-
Notifications
You must be signed in to change notification settings - Fork 2
Added VCF handling #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| // std::cout << "Processing sample " << indx << ", SNP " << s << std::endl; | ||
|
|
||
| if (field[0] == '0' || field[0] == '.') | ||
| snp->one[indx] = 0; // Reference allele (0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to make this part of the logic consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, this is the logic we should follow:
Line 1497 in 15ea376
| void snpdt::writeIntToGenotype(int indi, int snp, int code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the logic you suggested. It's still giving me a different output in comparison to the bfile.
I also want to confirm if you run the program with a --bfile and --file flag for the same data would it give the same output?
Below is the code I wrote:
int code = -9; if (field[0] == '.' || field[2] == '.') { code = -9; } else if (field[0] == '0' && field[2] == '0') { code = 0; // Homozygous reference (0/0) } else if ((field[0] == '0' && field[2] == '1') || (field[0] == '1' && field[2] == '0')) { code = 1; // Heterozygous (0/1) } else if (field[0] == '1' && field[2] == '1') { code = 2; // Homozygous alternate (1/1) } else { std::cerr << "Error: unrecognized genotype format: " << field << std::endl; exit(1); }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add your code as a commit to this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@radioactive17 the internal representatino neends to be changed as well. meaning this part of the code needs to be changed as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I haven't changed that part of the code. I'll do that and add the updated code. Thanks.
This pull request introduces updates to handle genotype data from a vcf format
Please review the readVcfData function in the snpdt.cpp file. This function reads data from a VCF file and aims to store it in a format similar to BFILE data (.bim, .bed, .fam).
The primary difference between VCF and BFILE formats lies in how missing genotype data is represented and handled:
In VCF, missing data is explicitly marked with a "."
For example:
In BFILE, missing data is encoded in a binary format (e.g., "01"), which does not specify whether the reference or alternate allele is missing.
This distinction in handling missing data could be introducing discrepancies in statistical calculations, particularly in variants where missing genotypes are prevalent.
I’d like your insights on whether the handling of missing data in this function could be improved to mitigate potential discrepancies. Let me know if further clarification is needed!