Skip to content

Commit

Permalink
Merge branch 'ar/hmm-dynamic-transitions' into 'master'
Browse files Browse the repository at this point in the history
[dmr] add segmenting HMM.

See merge request machine-learning/modkit!163
  • Loading branch information
ArtRand committed May 18, 2024
2 parents c1d6897 + 51e06c8 commit 92e65f2
Show file tree
Hide file tree
Showing 10 changed files with 980 additions and 20 deletions.
51 changes: 50 additions & 1 deletion book/src/dmr_scoring_details.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ Starting with a prior of \\(\text{Beta}(0.5, 0.5)\\), we can calculate the poste
![posterior_distributions](./images/beta_distributions.png)

What we need to calculate is the probability distribution of the _difference_ (the effect size) between the two conditions (high-modification and low-modification).
This can be done using a piecewise solution described by [Pham-Gia, Turkkan, and Eng in 1993](https://www.tandfonline.com/doi/abs/10.1080/03610929308831114), shown below:
This distribution can be done using a piecewise solution described by [Pham-Gia, Turkkan, and Eng in 1993](https://www.tandfonline.com/doi/abs/10.1080/03610929308831114), the distribution is shown below:

![beta_diff](./images/estimated_map_pvalue2.png)

Expand All @@ -108,3 +108,52 @@ d = \hat{p}_1 - \hat{p}_2 \\
\\[
\hat{p} = \frac{ N_{\text{mod}} }{ N_{\text{canonical}} } \\
\\]

## DMR segmentation hidden Markov model

When performing "single-site" analysis with `modkit dmr pair` (by omitting the `--regions` option) you can optionally run the "segmentation" model at the same time by passing the `--segment` option with a filepath to write the segments to.
The model is a simple 2-state hidden Markov model, shown below, where the two hidden states, "Different" and "Same" indicate that the position is either differentially methylated or not.
<div style="text-align: center;">

![hmm](./images/hmm2.png "2-state segmenting HMM")

</div>
The model is run over the intersection of the modified positions in a [pileup](https://nanoporetech.github.io/modkit/intro_bedmethyl.html#description-of-bedmethyl-output) for which there is enough coverage, from one or more samples.

## Transition parameters
There are two transition probability parameters, \\(p\\) and \\(d\\).
The \\(p\\) parameter is the probability of transitioning to the "Different" state, and can be roughly though of as the probability of a given site being differentially modified without any information about the site.
The \\(d\\) parameter is the maximum probability of remaining in the "Different" state, it is a maximum because the value of \\(d\\) will change dynamically depending on the proximity of the next modified site.
The model proposes that modified bases in close proximity will share modification characteristics.
Specifically, when a site is differentially modified the probability of the next site also being differentially modified depends on how close the next site happens to be.
For example, if a CpG dinucleotide is differentially modified and is immediately followed by another CpG (sequence is `CGCG`) we have the maximum expectation that the following site is also differentially modified.
However, as the next site becomes farther away (say the next site is one thousand base pairs away, `CG[N x 1000]CG`) these sites are not likely correlated and the probability of the next site being differentially modified decreases towards \\(p\\).
The chart below shows how the probability of staying in the "Different" state, \\(d\\), decreases as the distance to the next modified base increases.

<div style="text-align: center;">

![hmm](./images/dynamic_probs.png "dynamic transition probabilities")

</div>

In this case, the maximum value of \\(d\\) is 0.9, \\(p\\) is 0.1, and the `decay_distance` is 500 base pairs (these also happen to be the defaults).
This can be seen as the maximum value of both curves is 0.9, and the minimum value, reached at 500 base pairs, is 0.1.
These parameters can be set with the `--diff-stay`, `--dmr-prior`, and `--decay-distance`, parameters, respectively.
The two curves represent two different ways of interpolating the decay between the minimum (1) and the `decay_distance`, `linear` and `logistic`.
The `--log-transition-decay` flag will use the orange curve whereas the default is to use the blue curve.

In general, these settings don't need to be adjusted.
However, if you want very fine-grained segmentation, use the `--fine-grained` option which will produce smaller regions but also decrease the rate at which sites are classified as "Different" when they are in fact not different.

## Emission parameters
The emissions of the model are derived from the [likelihood ratio score](https://nanoporetech.github.io/modkit/dmr_scoring_details.html#likelihood-ratio-scoring-details).
One advantage to using this score is that differences in methylation type (i.e. changes from 5hmC to 5mC) will be modeled and detected.
The score is transformed into a probability by \\( p = e^{\text{-score}} \\).
The full description of the emission probabilities for the two states is:

\\[
p_{\text{Same}} = e^{\text{-score}}
\\]
\\[
p_{\text{Different}} = 1 - p_{\text{same}}
\\]
Binary file added book/src/images/dynamic_probs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added book/src/images/hmm2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 43 additions & 1 deletion book/src/intro_dmr.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ chr20 10034962 10035266 CpG: 35 1.294227443419004 C:7 1513 C:14

The full schema is described [below](#differential-methylation-output-format).

### 2. Perform differential methylation detection on all pairs of samples over regions from the genome.
## 2. Perform differential methylation detection on all pairs of samples over regions from the genome.
The `modkit dmr multi` command runs all pairwise comparisons for more than two samples for all regions provided in the regions BED file.
The preparation of the data is identical to that for the [previous section](#preparing-the-input-data) (for each sample, of course).
An example command could be:
Expand Down Expand Up @@ -233,3 +233,45 @@ modkit dmr pair \
```
these columns will not be present.


## Segmenting on differential methylation

When running `modkit dmr` without `--regions` (i.e. [single-site analysis](#3-detecting-differential-modification-at-single-base-positions)) you can generate regions of differential methylation on-the-fly using the segmenting [hidden Markov model](./dmr_scoring_details.html#dmr-segmentation-hidden-markov-model) (HMM).
To run segmenting on the fly, add the `--segments $segments_bed_fp` option to the command such as:


```bash
dmr_result=single_base_haplotype_dmr.bed
dmr_segments=single_base_segements.bed

modkit dmr pair \
-a ${hp1_pileup}.gz \
-b ${hp2_pileup}.gz \
-o ${dmr_result} \
--segments ${dmr_segments} \ # indicates to run segmentation
--ref ${ref} \
--base C \
--threads ${threads} \
--log-filepath dmr.log
```

The default settings for the HMM are to run in "coarse-grained" mode which will more eagerly join neighboring sites, potentially at the cost of including sites that are not differentially modified within "Different" blocks.
To activate "fine-grained" mode, pass the `--fine-grained` flag.
The output schema for the segments is:

| column | name | description | type |
|--------|------------------------------|-------------------------------------------------------------------------------------------|-------|
| 1 | chrom | name of reference sequence from bedMethyl input samples | str |
| 2 | start position | 0-based start position, from `--regions` argument | int |
| 3 | end position | 0-based exclusive end position, from `--regions` argument | int |
| 4 | state-name | "different" when sites are differentially modified, "same" otherwise | str |
| 5 | score | Difference score, more positive values have increased difference | float |
| 6 | N_<sub>sites<\sub> | Number of sites (bedmethyl records) in the segment | float |
| 7 | sample<sub>a</sub> counts | Counts of each base modification in the region, comma-separated, for sample A | str |
| 8 | sample<sub>a</sub> total | Total number of base modification calls in the region, including unmodified, for sample A | str |
| 9 | sample<sub>b</sub> counts | Counts of each base modification in the region, comma-separated, for sample B | str |
| 10 | sample<sub>b</sub> total | Total number of base modification calls in the region, including unmodified, for sample B | str |
| 11 | sample<sub>a</sub> fractions | Fraction of calls for each base modification in the region, comma-separated, for sample A | str |
| 12 | sample<sub>b</sub> fractions | Fraction of calls for each base modification in the region, comma-separated, for sample B | str |
| 13 | effect size | Percent modified in sample A (col 12) minus percent modified in sample B (col 13) | float |

Loading

0 comments on commit 92e65f2

Please sign in to comment.