Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exact or approximate number of IBD segments #5

Open
sdtemple opened this issue Jun 4, 2024 · 4 comments
Open

Exact or approximate number of IBD segments #5

sdtemple opened this issue Jun 4, 2024 · 4 comments

Comments

@sdtemple
Copy link

sdtemple commented Jun 4, 2024

Does your program output all of the IBD segments? I am curious about the <sampling_window> parameter. Is this parameter designed to manage memory and runtime resources? Could using too small of a value for this parameter lead to not every true IBD segment being reported?

@bguo068
Copy link
Owner

bguo068 commented Jun 4, 2024

Thanks for the question. The <sample_window> control how frequent local trees along the genome are being checked for breaks of ancestral segments. The smaller <sample_window> is, the more trees are checked, which means your ends (start and end coordinates) of your IBD segments are more precise, and it takes more time to run.

On the one hand, (1) if you want very accurate IBD segments (when gene conversion is not simulated = all breaks are due to sexual recombination), you can set <sample_window> is set 1. (2) However, if gene conversion is simulated, breaks can be caused by not only recombination but also gene conversion. In this case, using a small value of <sample_window> will break down a long true IBD segments into smaller pieces as a result of gene conversion. When these small pieces are shorter than the threshold say 2cM, they won't be reported. You can check whether this is true, by setting <mincm> to a smaller number so these smaller pieces can be written to the output. For point (2), I would avoid setting <sample_window> to a very small number (such as less than gene conversion length).

On the other hand, of course, using a large <sample_window> will ignore breaks that happened in local trees that are between adjacent sampled trees. You might see fewer breaks (thus fewer IBD) than expected.

@sdtemple
Copy link
Author

sdtemple commented Jun 4, 2024

Thanks! This is a great and helpful response. I did realize that I had to simulate without gene conversion to the get the true IBD segments in the Browning-esque defintion.

In this pre-print (here), Supplementary Figure 9, I got the tskibd "true" IBD segments to give similar results to hap-ibd inferred IBD segments (for selection coefficient estimation). On the other hand, I would have expected (Table 1 and Supplementary Table 2 and other unpublished results) that the tskibd true IBD segments would not have biased selection coefficient estimates for s <= 0.02.

These observations are why I would agree with you: to get exact / highly accurate true IBD segment lengths you would have to have no gene conversion and set the sampling window to 1 or close to it.

Do you have any idea about the runtime and memory trade-offs as the <sample_window> decreases?

@bguo068
Copy link
Owner

bguo068 commented Jun 4, 2024

I am glad my explanation was helpful. I have not spend enough time on characterizing the relationship of run time with sample-window argument. But for small sample size, it won't be too bad. For large sample size, there will be more local trees and it can be slow.

Actually, I have been working on a separate project to implement the same function (calling true IBD) but fully take advantage of the high correlation between adjacent trees: for adjacent local trees, only a few edges are added, removed or rearranged . The new project aims to go through tree by tree (without skipping trees) so it can be extremely accurate but also very fast. The main algorithm has been coded but I will need to write more tests and make it highly parallelizable. I can share the code when I have it tested, maybe within a month.

@sdtemple
Copy link
Author

sdtemple commented Jun 4, 2024

Great. Thanks for all the responses and I look forward to your updates. I'll be doing some benchmarks on the tskibd runtime and accuracy for a project of mine as well, particularly as I play around with the sample window parameter. For accuracy, I would expect that the msprime + tskibd approach (w/o gene conversion) should get similar numbers of true IBD segments >= Y cM than the function simulate_ibd in isweep when you set the chromosome to be of size 2 x Y cM and calculate the number of IBD segments overlapping the middle of such chromosomes.

In some preliminary results, I am finding that the msprime + tskibd approach is much faster at getting IBD segments than ARGON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants