Added VCF handling #2

radioactive17 · 2024-11-19T03:18:19Z

This pull request introduces updates to handle genotype data from a vcf format

Please review the readVcfData function in the snpdt.cpp file. This function reads data from a VCF file and aims to store it in a format similar to BFILE data (.bim, .bed, .fam).

The primary difference between VCF and BFILE formats lies in how missing genotype data is represented and handled:

In VCF, missing data is explicitly marked with a "."
For example:

0/. indicates one allele is present (reference) and the other is missing.
./. indicates both alleles are missing.

In BFILE, missing data is encoded in a binary format (e.g., "01"), which does not specify whether the reference or alternate allele is missing.

This distinction in handling missing data could be introducing discrepancies in statistical calculations, particularly in variants where missing genotypes are prevalent.

I’d like your insights on whether the handling of missing data in this function could be improved to mitigate potential discrepancies. Let me know if further clarification is needed!

changshuaiwei · 2025-02-05T21:16:59Z

code/snpdt.cpp

+				// std::cout << "Processing sample " << indx << ", SNP " << s << std::endl;
+
+				if (field[0] == '0' || field[0] == '.')
+					snp->one[indx] = 0;  // Reference allele (0)


we need to make this part of the logic consistent

In particular, this is the logic we should follow:

hwu/code/snpdt.cpp

Line 1497 in 15ea376

void snpdt::writeIntToGenotype(int indi, int snp, int code)

I followed the logic you suggested. It's still giving me a different output in comparison to the bfile.
I also want to confirm if you run the program with a --bfile and --file flag for the same data would it give the same output?

Below is the code I wrote:
int code = -9; if (field[0] == '.' || field[2] == '.') { code = -9; } else if (field[0] == '0' && field[2] == '0') { code = 0; // Homozygous reference (0/0) } else if ((field[0] == '0' && field[2] == '1') || (field[0] == '1' && field[2] == '0')) { code = 1; // Heterozygous (0/1) } else if (field[0] == '1' && field[2] == '1') { code = 2; // Homozygous alternate (1/1) } else { std::cerr << "Error: unrecognized genotype format: " << field << std::endl; exit(1); }

can you add your code as a commit to this PR?

@radioactive17 the internal representatino neends to be changed as well. meaning this part of the code needs to be changed as well.

Yes, I haven't changed that part of the code. I'll do that and add the updated code. Thanks.

Added VCF handling

1708c2e

changshuaiwei reviewed Feb 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added VCF handling #2

Added VCF handling #2

Uh oh!

radioactive17 commented Nov 19, 2024

Uh oh!

changshuaiwei Feb 5, 2025

Uh oh!

changshuaiwei Feb 5, 2025

Uh oh!

radioactive17 Feb 24, 2025 •

edited

Loading

Uh oh!

changshuaiwei Feb 24, 2025

Uh oh!

changshuaiwei Mar 2, 2025

Uh oh!

radioactive17 Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Added VCF handling #2

Are you sure you want to change the base?

Added VCF handling #2

Uh oh!

Conversation

radioactive17 commented Nov 19, 2024

Uh oh!

changshuaiwei Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

changshuaiwei Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

radioactive17 Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

changshuaiwei Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

changshuaiwei Mar 2, 2025

Choose a reason for hiding this comment

Uh oh!

radioactive17 Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

radioactive17 Feb 24, 2025 •

edited

Loading