Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

entropy vs sequencing depth/CG content #214

Open
ywang285 opened this issue Jun 20, 2024 · 3 comments
Open

entropy vs sequencing depth/CG content #214

ywang285 opened this issue Jun 20, 2024 · 3 comments
Labels
question Further information is requested

Comments

@ywang285
Copy link

Hi, Thank you for providing this great tool! I am interested in calculating CpG entropy, and I am wondering how sequencing depth (such as comparing data with 10X coverage vs. 30X coverage) and CG content (e.g., comparing reads from high CG content region and low CG content region but still have >=4 CG in the window size) affect entropy value. Thank you!

@ArtRand
Copy link
Contributor

ArtRand commented Jun 20, 2024

Hello @ywang285,

If you don't have deep enough coverage to sample all of $\textbf{N}$ possible patterns (referring to the documentation on the calculation), the entropy calculation will be suppressed. For example in the extreme case, if you have coverage of 1, the entropy will always be 0.0. On the other hand, as coverage increases, once you sample all of the methylation patterns, I would expect the calculation to reach a value close to the "true value". The entropy may continue to grow as coverage increases due to accumulating reads with methylation call errors, but I would expect this growth to be small since the $Pr(n_i)$ of each erroneous pattern will be small.

I have less of an intuition for how CpG density will change entropy. Given a constant number of CpGs (--num-positions) within a smaller interval of the genome (--window-size) you may expect the entropy to be lower since CpGs tend to be spatially correlated (for example, all modified or all unmodified). Whereas as you increase the interval size (keeping the number of CpGs constant) you have more opportunity to sample different biological processes. On the other hand, I could also see dense regions CpGs that have a lot of competition for binding could have very high entropy.

So in general, I would recommend trying to get as high coverage as possible (to a point, 30X/haplotype), and trying multiple settings for --num-positions and --window-size. Let me know if this isn't clear and happy to answer any followup questions.

@ArtRand ArtRand added the question Further information is requested label Jun 20, 2024
@ywang285
Copy link
Author

Thank you so much for your reply! I plan to compare the entropy of genomic regions without amplification (10X-20X depth) vs. amplicon regions (>100X depth). However, it sounds that the comparison won't be fair unless I subsample the amplicon region. Is that the case or if you have any suggestions?

@ArtRand
Copy link
Contributor

ArtRand commented Jun 24, 2024

Hello @ywang285,

I don't think you'll need to sub-sample the amplified reads since they should have very low modification rates. If there is high native methylation entropy in a genomic region it should be noticeable above an amplified background.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants