General improvements and fixes #133

torfjelde · 2021-10-04T15:56:23Z

I had some issues getting MCMCTempering to work and this PR is an attempt at cleaning up the interface a bit in addition to dropping use of non-concrete types, mutating types, etc.

I have some tests locally that I just need to clean up a bit, and then I'll add those too.

Changes

Added compute_tempered_logdensities(model, sampler, transition, transition_other, β).
- A function to be implemented for a particular model and transition, possibly even specializing on sampler.
- This replaces make_tempered_loglikelihood and get_params, making it so that we only have to implement a single function rather than multiple.
- Often transition contains an already computed logdensity, and so this allows us to re-use this rather than re-compute, whenever it makes sense.
Removed make_tempered_loglikelihood.
- This should now be done in compute_tempered_logdensities
Removed get_params.
- Extracting the parameters is no longer necessary since we now have compute_tempered_logdensities.
swap_step now also takes an AbstractSwapStrategy as the first argument and thus we can specify the different strategies using multiple-dispatch rather than if-statements with Symbol, thus easily allowing extensions + it reads better.
Everything is now immutable, and instead we use immutable (and concretely typed) structs and Setfield.jl to update these.
- Motivation: I find it highly unlikely that these structs will every result in memory issues, and then it's generally a better idea to use immutable structs to allow the possibility of stack-allocation, etc.
Added packages to the test environment.
Fixed some bugs (at least I believe they were bugs) in the adaptation, in particular some pieces of code was (seemingly) using the logscale field as if it was indeed scale.

…mpler concrete

…red_loglikelihood

cpfiffer

No comments here. Looks clear & seems to follow good practice -- wonderful work as per usual, Tor!

yebai · 2021-10-04T19:26:54Z

src/stepping.jl

-    Δ_history       :: Array{<:Real, 2}
-    Δ_index_history :: Array{<:Integer, 2}
-    Ρ               :: Vector{AdaptiveState}
+@concrete struct TemperedState


Not quite sure about the use of ConcreteStructs. I found the type annotations helpful for understanding the code. Also, we would like to keep the dependency absolutely minimal where possible.

Feel free to copy and paste my macro from https://github.com/JuliaNonconvex/NonconvexCore.jl/blob/master/src/utilities/params.jl where you can keep the type annotations and make it concrete at the same time.

Not quite sure about the use of ConcreteStructs. I found the type annotations helpful for understanding the code. Also, we would like to keep the dependency absolutely minimal where possible.

With regards to minimizing dependencies, ConcreteStructs seems to take .004 seconds to load on my computer, so I don't think it's that big a deal.

@mohamed82008 Do you think that might be a good macro to add to ConcreteStructs.jl (so it can be used more easily outside of NonconvexCore)?

Yes if the authors of ConcreteStructs are ok with the move.

I found the type annotations helpful for understanding the code. Also, we would like to keep the dependency absolutely minimal where possible.

Though I get this sentiment, I think in general you want to avoid putting explicit type-constraints on parameteric types, in particular here where IMO the constraints where already stronger than necessary.

Feel free to copy and paste my macro from

Oh this looks quite nice! Might be worth doing that instead:)

Yes if the authors of ConcreteStructs are ok with the move.

Highly recommend you propose it! From a quick glance it just seems superior to @concrete in any way, no?

I didn't know about @concrete when I did this. I think mine is older (from years ago).

But IMO we shouldn't even restrict to something like a AbstractVector, e.g. for a small number of temperatures it might be better to use a Tuple, etc. Instead we should just document the fields properly.

mohamed82008 · 2021-10-04T19:33:32Z

src/stepping.jl

            kwargs...
        )
        for i in 1:length(spl.Δ)
    ]
    return (
-        states[sortperm(spl.Δ_init)[1]][1],
+        first(states[argmax(spl.Δ_init)]),


should this be argmax or argmin?

I believe it should be argmin. @torfjelde ?

@torfjelde once I get a confirmation on whether it should be argmin or argmax, is it cool if I merge this and register an update to 0.2.0 (assuming tests pass)? Should simplify TuringLang/Turing.jl#1628

argmax should result in the index corresponding to 1, which is what we want right? We want the return transition to correspond to the posterior, i.e. when delta = 1.

But IMO we should just make it [1] so that it's also possible to anneal in any way you want 🤷 I did this because I believe I saw this somewhere else i nthe code.

Should simplify TuringLang/Turing.jl#1628

Yep. There are several things that IMO should be done differently in that PR, e.g. we should be using the MiniBatchContext to temper the likelihoods as it will be faster. But I'm currently running into issues getting some tests to pass; I essentially can't get it to even fit a simple MvNormal properly. I'm uncertain if it's just me not knowing how to set parameters for tempering or if there's something wrong in the code.

@torfjelde if you think [1] works better I would go with that. I think it's less confusing and more general.

argmax should result in the index corresponding to 1

no, it's argmin

Done-diddit 👍

ptiede · 2021-10-16T01:33:55Z

src/swapping.jl

+taken traverses even chain indices, proposing swaps between neighbors.
+
+Note that this method is _not_ reversible, and does not satisfy detailed balance.
+As a result, this method is asymptotically biased.


Sorry, not sure if this is the right place to put this, but this statement is not true. It is not reversible, but it and preserves the correct distribution. This non-reversibility is actually the main reason why doing something like this swapping strategy is better. The non-reversibility prevents the swapping from devolving into a diffusion-type process. The Syed 2019 paper goes into this in a lot of detail.

@ptiede Do you have a link to that paper?

Yes sorry! I should have included it. Here is the arXiv link https://arxiv.org/abs/1905.02939

A follow-up paper https://arxiv.org/abs/2102.07720 also goes into detail for how to further optimize PT. The primary author of that paper, @s-syed, and I were actually hoping to implement something like this into this package so if this repo is picking up again, we would love to help!

Hi! I am more than happy to answer any questions. This odd-even scheme is actually extremely important and in some sense achieves the optimal performance and should always be used over the reversible counterpart. Furthermore, the tuning guidelines are very different and results in significantly different tuning guidelines.

In practice we find a 10-100x boost in performance compared to reversible swapping schemes and their corresponding tuning guidelines that have been implemented so far.

Sorry, not sure if this is the right place to put this, but this statement is not true. It is not reversible, but it and preserves the correct distribution.

Ah, thank you for pointing this out! This is indeed an incorrect statement.

And thank you for pointing me to those papers! Just read the first one, and it's real neat stuff. And we'd be happy to collaborate on this package. Personally I'm just starting to get into PT samplers, so it would be awesome to include someone with both practical and theoretical experience with these things.

Awesome! I'd love to help. I have some experience implementing PT samplers for use with HPC and clusters, so I would be thrilled to help in any way I can! My old code was written in C++, so moving to Julia would be great.

Awesome!:) I'll push some changes I've accumulated locally very soon.

torfjelde · 2021-10-18T10:38:21Z

Aight, so something we need to address here before this PR gets merged.

Arguably, the most straight-forward approach to handling a swap between i and i + 1 is by swapping the states of the samplers, i.e.

states[i], states[i + 1] = states[i + 1], states[i]

But since states[i] contains both the sampler's state and the current state of the Markov chain/sample for the parameters, e.g. AdvancedHMC also keeps information such as the PhasePoint in the state to allow not refreshing the momentum between every iteration, we also need to (possibly) update the parameters in the corresponding states:

states[i], states[i + 1] = (
    setparams!!(states[i], getparams(states[i + 1])),
    setparams!!(states[i + 1], getparams(states[i]))
)

Unfortunately, no such a setparams!! and getparams interface exists in AbstractMCMC.jl at the moment.

Of course, in the case where these states are completely equivalent, e.g. in the case of AdvancedMH.RWMH which has a simple AdvancedMH.Transition as the state and no sampler-specific state, we can actually just swap the states without any issues.

An alternative approach, which is what's currently done in this project is to swap the temperatures rather than the states, i.e. we do

β[i], β[i + 1] = β[i + 1], β[i]

This then means that after such a swap, states[i] no longer corresponds to the tempered model with temperature β[i] but with β[i + 1], and vice versa. This avoids the issue of updating parameters, etc., i.e. we can also run HMC samplers from AdvancedHMC just fine. This also has the added benefit that parallizing across temperatures is much cheaper since now we just need to transport the temperatures between chains rather than the full states of the chains.

But this has the issue that we're completely restricted to using the same sampler for every temperature. In theory, AFAIK, there's nothing stopping us from using, say, a MH kernel for all β < 1 and then HMC for β = 1 (@ptiede @s-syed can you confirm/debunk this?). We also want this package to support other methods, e.g. annealed importance sampling (AIS), which would also require running a collection of samplers in "parallel" and moving parameters between the chains.

So. I think the plan should be as follows:

Stick with the current approach of swapping temperatures rather than states because it works for a single sampler.
Push for adding something like setparams!! and getparams to AbstractMCMC (there are lots of other good reasons for wanting this too, so other than the naming, I don't think there's going to be a lot of resistance here). EDIT: Ref Improving interaction between different implementations of the interface AbstractMCMC.jl#85
Once (2) is done, we add an additional implementation which swaps the full states rather than just the temperatures.

Then we have support for both the single-sampler-cheap-to-parallelize impl of PT and the super-general-but-not-so-cheap-to-parallelize impl of PT. In most cases we can "automate" the selection of the internal implementation by simply choosing the temperature-swap approach when we're working with a single sampler and then the other when not.

Sounds like a plan? @ParadaCarleton @yebai @mohamed82008 (the rest of you are already pinged)

mohamed82008 · 2021-10-18T14:04:12Z

But this has the issue that we're completely restricted to using the same sampler for every temperature. In theory, AFAIK, there's nothing stopping us from using, say, a MH kernel for all β < 1 and then HMC for β = 1

Can't this be done by defining a new Gibbs-like sampler instead of redesigning the package? I think with the right abstraction for "group stepping" we can overload a function to do the right thing for our Gibbs-like sampler. Then another function can specialise the temperature swapping for this special sampler. I am not sure if this will turn out to be more work than what you propose though.

mohamed82008 · 2021-10-18T14:06:42Z

Overall I like the plan but the specifics can be discussed in a separate thread. If nothing else is holding this PR back, I suggest we merge and release so it can be "field tested".

torfjelde · 2021-10-18T14:17:36Z

Can't this be done by defining a new Gibbs-like sampler instead of redesigning the package? I think with the right abstraction for "group stepping" we can overload a function to do the right thing for our Gibbs-like sampler. Then another function can specialise the temperature swapping for this special sampler. I am not sure if this will turn out to be more work than what you propose though.

I mean, you can easily make the same argument for the current implementation, no? Why isn't the temperature swapping just another kernel working on a product space, which could then be used in a composition kernel (Gibbs is just an instance of this)? I'm guessing the answer for this is convenience in the sense of easier to maintain, easier to implement, and we can make some etc. + "we're not there yet" with more general interfaces for AbstractMCMC, right?

And we're not talking a major redesign here; it's just adding additional implementations for particular instantiations of TemperedSampler.

With all that being said, and I realize I wasn't quite clear about this, I meant that step (1), i.e. sticking with current design but making functional, is this PR and the rest is for another later PR 👍

mohamed82008 · 2021-10-18T14:35:23Z

All sounds good.

ptiede · 2021-10-18T14:50:15Z

But this has the issue that we're completely restricted to using the same sampler for every temperature. In theory, AFAIK, there's nothing stopping us from using, say, a MH kernel for all β < 1 and then HMC for β = 1

Yes, you can do this. In fact, what I have done in the past is use exact sampling for the β=0 chain since this helps mixing in all the other tempering levels. I also think this swapping question is a really good one. For HPC the big thing you need to keep in mind is communication times. So something like swapping indices is preferred since that will have minimal overhead. So I think the current plan it a great one.

ParadaCarleton · 2021-10-18T15:16:24Z

Everything above sounds good to me. (Assuming that kernel means proposal distribution? There's way too many things called a kernel in math and I'm pretty new to this 😅 .)

sefffal · 2022-04-03T20:14:34Z

I've been watching this repo for a while and I'm keen to try it out with AdvancedHMC on tricky model of mine.
Am I better of sticking to the main branch, or is the PR now at a stage I could use? If so, is there an example somewhere I could adapt?
Thanks!

fkrauer · 2022-08-19T13:57:46Z

Hi all, are there any updates on this PR?

torfjelde · 2022-08-19T14:02:08Z

Sorry for the delay here! As mentioned above, it's a bit stuck because it's really unclear to me if the adaptation strategies are useful or not.

I'll try to have a look at this over the weekend. Given the delay, I might just remove the adaptation completely and just require temperatures to be specified by hand for now without support for adaptation. Or just warn the user. No matter, it'll probably not be a very satisfying solution. But everything but the adaptation should be working.

s-syed · 2022-08-19T16:54:43Z

Hi @torfjelde and @fkrauer

I am the first author the "Non-reversible parallel tempering" published in JRSS-B last year where we designed a theoretically optimal and scalable algorithm for tuning parallel tempering (linked below). It turns out parallel tempering is extremely sensitive to how to propose swaps between chains, as well as the choice of the annealing parameters. Not only is tuning PT useful, we would argue it for hard problems PT is actually useless without proper tuning. The odd-even swapping scheme discussed earlier, may seem like an innocuous change algorithmically, but it leads to dramatically different scaling behaviour and eliminates the diffusivity presumed to be endemic to parallel tempering.

A consequence is that the odd-even swap non-asymptotically dominates other communication schemes and actually improves in performance with more chains. This makes it scalable to large-scale parallel implementations. We can also leverage this to cheaply and reliably learn the optimal schedule without making any structural assumptions on the target distribution or the statespace. The optimal schedule of annealing parameters also dramatically varies from problem to problem.

In the JRSS-B paper we benchmarked our method against 16 diverse models, including a 190 thousand dimensional Bayesian phylogetic inference problem. We emperically observed sigificant improvements in communication between the reference and target, and 10-100x gains in ESS/second compared to state-of-the-art tuning methods before our work. The method is simple to implement and general purpose, making it compatible with PPL's such as Turing.jl.

As a concrete example, @ptiede and Event Horizon Telescope used parallel tempering to generate the first image of the supermassive black hole in M87 in 2019. After implementing our work, they were able to more reliably replicate the 2019 photo in about 2% of the computation budget. This algorithmic efficiency from our tuning guidelines, was essential to the discover magnetic polarization in the M87 in 2021, and most recently generated the black hole photo of Sagittarius A* released in May 2022.

@ptiede and I were in talks with Turing.jl last year about implementing this work, but unfortunately we got busy finishing our thesis and moving to Harvard and Oxford for our postdocs respectively. If you are interested in working on this, @ptiede and I would love to chat with you and answer any questions. I am an expert on algorithmic design of parallel tempering algorithms, and @ptiede has expertise in implementing them at scale with Julia, so we think we would be able to make a meaningful contribution. We are passionate about Julia, would love to see use turing.jl for our tempering needs.

Cheers,
Saif

JRSS-B paper:
https://arxiv.org/abs/1905.02939

My thesis, for more details:
https://open.library.ubc.ca/soa/cIRcle/collections/ubctheses/24/items/1.0413120

@ptiede's Thesis: See chapter 4 for the analysis comparing out work to the EHT's prior implementation
https://uwspace.uwaterloo.ca/handle/10012/17199

isn't provided

…r usage

Added a `tempered_sample` helper function and no-swap burn-in functionality

yebai · 2022-12-04T22:16:57Z

Many thanks, @HarrisonWilde and @torfjelde, for the improvements. I am merging this PR because it is getting too big and taking too long. Let's add further improvements in separate PRs!

torfjelde added 8 commits September 11, 2021 01:46

additional deps and test deps

4d48a35

updated stepping code to use AbstractSwapStrategy and made TemperedSa…

2df4a2b

…mpler concrete

made TemperedSampler concretely typed and fixed soem docs

83c5c49

introduced AbstractSwapStrategy and removed get_params and make_tempe…

9e9153c

…red_loglikelihood

updated stepping.jl

26f12f1

added docstring for make_tempered_model

aa88ae4

updated adaptation.jl and made structs concrete

2523a9d

updated ladders.jl

7462ebf

cpfiffer approved these changes Oct 4, 2021

View reviewed changes

yebai approved these changes Oct 4, 2021

View reviewed changes

yebai requested a review from HarrisonWilde October 4, 2021 19:19

yebai reviewed Oct 4, 2021

View reviewed changes

mohamed82008 reviewed Oct 4, 2021

View reviewed changes

torfjelde added 3 commits October 13, 2021 15:56

addressed some comments

3fb13e9

added tests

90e60f2

fixed a bug

e30718d

ptiede reviewed Oct 16, 2021

View reviewed changes

torfjelde added 6 commits October 20, 2021 01:07

updated the StateHistoryCallback a bit

9e8607a

made the distinction between chains and processes clearer

3bb79df

added tests

90722c9

fixed incorrect statement

50ad1ec

renamed some fields to be more descriptive and fixed left-over bug

0c204ac

updated tests

e2bbc90

itsdfish mentioned this pull request Mar 2, 2022

Truly independent groups of chains itsdfish/DifferentialEvolutionMCMC.jl#51

Closed

yebai mentioned this pull request Jun 29, 2022

Add Disclaimer in README that this repo is non-functional #135

Closed

torfjelde and others added 21 commits November 14, 2022 18:13

improvements to parameterization of the adaptation techniques

4563d33

updated test env

86fbf0d

keep track of swapping ratios

229059b

tests are now runnable

c4583f9

commented out unused code

a1efd11

Corrected typo

c221572

Added 1D GMM, sort of works for it

632cca9

Make StandardSwap the default strategy when one

f30acfa

isn't provided

Fixing test case for GMM

0a0b131

Implementing burn-in, introduces depedency on StatsBase

910dc6d

Fixed error with burnin

a4437f9

cleaning up working_code

b9c6a90

QoL improvements on the code

79cbbf0

Removing StatsBase dependency and discard_initial override

4336516

Adding back accidentally deleted RandomPermutationSwap stuff

915b610

cleaned up testing a bit

4898193

made the compute_tempered_logdensities a bit more general

13cb3b2

Implementing tempered_sample to allow for no-swap burn-in and easie…

cb6937e

…r usage

Tweaking sample call

db4b842

Working burnin

fce26aa

Merge pull request #137 from TuringLang/harry/improvements_additions

8a6b47e

Added a `tempered_sample` helper function and no-swap burn-in functionality

yebai merged commit 1fdea5f into main Dec 4, 2022

delete-merged-branch bot deleted the tor/improvements branch December 4, 2022 22:15

General improvements and fixes #133

General improvements and fixes #133

Conversation

torfjelde commented Oct 4, 2021

Changes

cpfiffer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

torfjelde Oct 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParadaCarleton Oct 11, 2021 • edited Loading

Choose a reason for hiding this comment

ParadaCarleton Oct 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParadaCarleton Oct 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

torfjelde commented Oct 18, 2021 • edited Loading

mohamed82008 commented Oct 18, 2021

mohamed82008 commented Oct 18, 2021 • edited Loading

torfjelde commented Oct 18, 2021

mohamed82008 commented Oct 18, 2021

ptiede commented Oct 18, 2021

ParadaCarleton commented Oct 18, 2021 • edited Loading

sefffal commented Apr 3, 2022

fkrauer commented Aug 19, 2022

torfjelde commented Aug 19, 2022

s-syed commented Aug 19, 2022 • edited Loading

yebai commented Dec 4, 2022

torfjelde Oct 18, 2021 •

edited

Loading

ParadaCarleton Oct 11, 2021 •

edited

Loading

ParadaCarleton Oct 12, 2021 •

edited

Loading

ParadaCarleton Oct 12, 2021 •

edited

Loading

torfjelde commented Oct 18, 2021 •

edited

Loading

mohamed82008 commented Oct 18, 2021 •

edited

Loading

ParadaCarleton commented Oct 18, 2021 •

edited

Loading

s-syed commented Aug 19, 2022 •

edited

Loading