Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions to facilitate VCF comparison using sgkit #95

Merged
merged 1 commit into from
Jun 20, 2023

Conversation

szhan
Copy link
Owner

@szhan szhan commented Jun 14, 2023

Addresses #94

Copy link

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
Great that we have this working for small datasets, but this isn't going to scale as all the genotypes are bought into RAM as your return type is a list of lists.

For this to work on large datasets, I think you'll need to return copies of ds1 and ds2, where they are filtered down to the common sites and ds2 is remapped. Does that make sense for what you are trying to do?

Lets talk Monday in the office about how we can modify this code to achieve that.

@szhan
Copy link
Owner Author

szhan commented Jun 16, 2023

Thanks, @benjeffery !

Just for my understanding, the function should return xarray.Dataset, so that zarr can act upon it?

I'd just like to compare call_genotype of ds1 and ds2, so it makes sense to return ds1 and ds2 remapped with only the common sites.

@szhan
Copy link
Owner Author

szhan commented Jun 16, 2023

Should we just do that all within remap_genotypes? Or is that too much going on for a single function?

@szhan
Copy link
Owner Author

szhan commented Jun 16, 2023

Since sgkit xarray datasets have the variable variant_contig, the function should check both variant_contig and variant_positions.

@szhan
Copy link
Owner Author

szhan commented Jun 16, 2023

I'm thinking that get_matching_indices should return xarray.DataArray as well. It should help analyse big data sets, right? @benjeffery

@szhan
Copy link
Owner Author

szhan commented Jun 17, 2023

I've modified get_matching_indices to return numpy.ndarray instead of list.

@szhan
Copy link
Owner Author

szhan commented Jun 17, 2023

The "compatible" ds1 and ds2 should have the same allele lists.

@szhan
Copy link
Owner Author

szhan commented Jun 17, 2023

What i'd like to do is multi-way VCF comparisons, i.e., imputed genotypes from lshmm (ds_lshmm), imputed genotypes from BEAGLE (ds_beagle), and ground-truth genotypes (ds_truth). I think the easiest way is to just make ds_lshmm and ds_beagle compatible with ds_truth by calling make_compatible_genotypes twice.

@szhan szhan force-pushed the compare_vcf_using_sgkit branch 4 times, most recently from bd4d670 to 752fe39 Compare June 19, 2023 16:05
@szhan
Copy link
Owner Author

szhan commented Jun 19, 2023

For now, let's assume that only ACGT are allowed, to keep things simple.

@szhan
Copy link
Owner Author

szhan commented Jun 19, 2023

Also, I still need to add tests for make_compatible_genotypes.

@szhan
Copy link
Owner Author

szhan commented Jun 20, 2023

There are still more tests that could be added, but the current tests are good enough, I think. Will continue adding more tests in a separate issue.

@szhan szhan closed this Jun 20, 2023
@szhan szhan reopened this Jun 20, 2023
@szhan szhan merged commit c0a6a2d into main Jun 20, 2023
@szhan szhan deleted the compare_vcf_using_sgkit branch June 20, 2023 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants