Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General improvements and fixes #133

Merged
merged 52 commits into from
Dec 4, 2022
Merged

General improvements and fixes #133

merged 52 commits into from
Dec 4, 2022

Conversation

torfjelde
Copy link
Member

I had some issues getting MCMCTempering to work and this PR is an attempt at cleaning up the interface a bit in addition to dropping use of non-concrete types, mutating types, etc.

I have some tests locally that I just need to clean up a bit, and then I'll add those too.

Changes

  • Added compute_tempered_logdensities(model, sampler, transition, transition_other, β).
    • A function to be implemented for a particular model and transition, possibly even specializing on sampler.
    • This replaces make_tempered_loglikelihood and get_params, making it so that we only have to implement a single function rather than multiple.
    • Often transition contains an already computed logdensity, and so this allows us to re-use this rather than re-compute, whenever it makes sense.
  • Removed make_tempered_loglikelihood.
    • This should now be done in compute_tempered_logdensities
  • Removed get_params.
    • Extracting the parameters is no longer necessary since we now have compute_tempered_logdensities.
  • swap_step now also takes an AbstractSwapStrategy as the first argument and thus we can specify the different strategies using multiple-dispatch rather than if-statements with Symbol, thus easily allowing extensions + it reads better.
  • Everything is now immutable, and instead we use immutable (and concretely typed) structs and Setfield.jl to update these.
    • Motivation: I find it highly unlikely that these structs will every result in memory issues, and then it's generally a better idea to use immutable structs to allow the possibility of stack-allocation, etc.
  • Added packages to the test environment.
  • Fixed some bugs (at least I believe they were bugs) in the adaptation, in particular some pieces of code was (seemingly) using the logscale field as if it was indeed scale.

Copy link
Member

@cpfiffer cpfiffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comments here. Looks clear & seems to follow good practice -- wonderful work as per usual, Tor!

@yebai yebai requested a review from HarrisonWilde October 4, 2021 19:19
src/stepping.jl Outdated
Δ_history :: Array{<:Real, 2}
Δ_index_history :: Array{<:Integer, 2}
Ρ :: Vector{AdaptiveState}
@concrete struct TemperedState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure about the use of ConcreteStructs. I found the type annotations helpful for understanding the code. Also, we would like to keep the dependency absolutely minimal where possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to copy and paste my macro from https://github.com/JuliaNonconvex/NonconvexCore.jl/blob/master/src/utilities/params.jl where you can keep the type annotations and make it concrete at the same time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure about the use of ConcreteStructs. I found the type annotations helpful for understanding the code. Also, we would like to keep the dependency absolutely minimal where possible.

With regards to minimizing dependencies, ConcreteStructs seems to take .004 seconds to load on my computer, so I don't think it's that big a deal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohamed82008 Do you think that might be a good macro to add to ConcreteStructs.jl (so it can be used more easily outside of NonconvexCore)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if the authors of ConcreteStructs are ok with the move.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the type annotations helpful for understanding the code. Also, we would like to keep the dependency absolutely minimal where possible.

Though I get this sentiment, I think in general you want to avoid putting explicit type-constraints on parameteric types, in particular here where IMO the constraints where already stronger than necessary.

Feel free to copy and paste my macro from

Oh this looks quite nice! Might be worth doing that instead:)

Yes if the authors of ConcreteStructs are ok with the move.

Highly recommend you propose it! From a quick glance it just seems superior to @concrete in any way, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about @concrete when I did this. I think mine is older (from years ago).

Copy link
Member Author

@torfjelde torfjelde Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But IMO we shouldn't even restrict to something like a AbstractVector, e.g. for a small number of temperatures it might be better to use a Tuple, etc. Instead we should just document the fields properly.

src/stepping.jl Outdated
kwargs...
)
for i in 1:length(spl.Δ)
]
return (
states[sortperm(spl.Δ_init)[1]][1],
first(states[argmax(spl.Δ_init)]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be argmax or argmin?

Copy link
Member

@ParadaCarleton ParadaCarleton Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it should be argmin. @torfjelde ?

Copy link
Member

@ParadaCarleton ParadaCarleton Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@torfjelde once I get a confirmation on whether it should be argmin or argmax, is it cool if I merge this and register an update to 0.2.0 (assuming tests pass)? Should simplify TuringLang/Turing.jl#1628

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argmax should result in the index corresponding to 1, which is what we want right? We want the return transition to correspond to the posterior, i.e. when delta = 1.

But IMO we should just make it [1] so that it's also possible to anneal in any way you want 🤷 I did this because I believe I saw this somewhere else i nthe code.

Should simplify TuringLang/Turing.jl#1628

Yep. There are several things that IMO should be done differently in that PR, e.g. we should be using the MiniBatchContext to temper the likelihoods as it will be faster. But I'm currently running into issues getting some tests to pass; I essentially can't get it to even fit a simple MvNormal properly. I'm uncertain if it's just me not knowing how to set parameters for tempering or if there's something wrong in the code.

Copy link
Member

@ParadaCarleton ParadaCarleton Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@torfjelde if you think [1] works better I would go with that. I think it's less confusing and more general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argmax should result in the index corresponding to 1

no, it's argmin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done-diddit 👍

src/swapping.jl Outdated
taken traverses even chain indices, proposing swaps between neighbors.

Note that this method is _not_ reversible, and does not satisfy detailed balance.
As a result, this method is asymptotically biased.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure if this is the right place to put this, but this statement is not true. It is not reversible, but it and preserves the correct distribution. This non-reversibility is actually the main reason why doing something like this swapping strategy is better. The non-reversibility prevents the swapping from devolving into a diffusion-type process. The Syed 2019 paper goes into this in a lot of detail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptiede Do you have a link to that paper?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry! I should have included it. Here is the arXiv link https://arxiv.org/abs/1905.02939

A follow-up paper https://arxiv.org/abs/2102.07720 also goes into detail for how to further optimize PT. The primary author of that paper, @s-syed, and I were actually hoping to implement something like this into this package so if this repo is picking up again, we would love to help!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I am more than happy to answer any questions. This odd-even scheme is actually extremely important and in some sense achieves the optimal performance and should always be used over the reversible counterpart. Furthermore, the tuning guidelines are very different and results in significantly different tuning guidelines.

In practice we find a 10-100x boost in performance compared to reversible swapping schemes and their corresponding tuning guidelines that have been implemented so far.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure if this is the right place to put this, but this statement is not true. It is not reversible, but it and preserves the correct distribution.

Ah, thank you for pointing this out! This is indeed an incorrect statement.

And thank you for pointing me to those papers! Just read the first one, and it's real neat stuff. And we'd be happy to collaborate on this package. Personally I'm just starting to get into PT samplers, so it would be awesome to include someone with both practical and theoretical experience with these things.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I'd love to help. I have some experience implementing PT samplers for use with HPC and clusters, so I would be thrilled to help in any way I can! My old code was written in C++, so moving to Julia would be great.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!:) I'll push some changes I've accumulated locally very soon.

@torfjelde
Copy link
Member Author

torfjelde commented Oct 18, 2021

Aight, so something we need to address here before this PR gets merged.

Arguably, the most straight-forward approach to handling a swap between i and i + 1 is by swapping the states of the samplers, i.e.

states[i], states[i + 1] = states[i + 1], states[i]

But since states[i] contains both the sampler's state and the current state of the Markov chain/sample for the parameters, e.g. AdvancedHMC also keeps information such as the PhasePoint in the state to allow not refreshing the momentum between every iteration, we also need to (possibly) update the parameters in the corresponding states:

states[i], states[i + 1] = (
    setparams!!(states[i], getparams(states[i + 1])),
    setparams!!(states[i + 1], getparams(states[i]))
)

Unfortunately, no such a setparams!! and getparams interface exists in AbstractMCMC.jl at the moment.

Of course, in the case where these states are completely equivalent, e.g. in the case of AdvancedMH.RWMH which has a simple AdvancedMH.Transition as the state and no sampler-specific state, we can actually just swap the states without any issues.

An alternative approach, which is what's currently done in this project is to swap the temperatures rather than the states, i.e. we do

β[i], β[i + 1] = β[i + 1], β[i]

This then means that after such a swap, states[i] no longer corresponds to the tempered model with temperature β[i] but with β[i + 1], and vice versa. This avoids the issue of updating parameters, etc., i.e. we can also run HMC samplers from AdvancedHMC just fine. This also has the added benefit that parallizing across temperatures is much cheaper since now we just need to transport the temperatures between chains rather than the full states of the chains.

But this has the issue that we're completely restricted to using the same sampler for every temperature. In theory, AFAIK, there's nothing stopping us from using, say, a MH kernel for all β < 1 and then HMC for β = 1 (@ptiede @s-syed can you confirm/debunk this?). We also want this package to support other methods, e.g. annealed importance sampling (AIS), which would also require running a collection of samplers in "parallel" and moving parameters between the chains.

So. I think the plan should be as follows:

  1. Stick with the current approach of swapping temperatures rather than states because it works for a single sampler.
  2. Push for adding something like setparams!! and getparams to AbstractMCMC (there are lots of other good reasons for wanting this too, so other than the naming, I don't think there's going to be a lot of resistance here). EDIT: Ref Improving interaction between different implementations of the interface AbstractMCMC.jl#85
  3. Once (2) is done, we add an additional implementation which swaps the full states rather than just the temperatures.

Then we have support for both the single-sampler-cheap-to-parallelize impl of PT and the super-general-but-not-so-cheap-to-parallelize impl of PT. In most cases we can "automate" the selection of the internal implementation by simply choosing the temperature-swap approach when we're working with a single sampler and then the other when not.

Sounds like a plan? @ParadaCarleton @yebai @mohamed82008 (the rest of you are already pinged)

@mohamed82008
Copy link
Member

But this has the issue that we're completely restricted to using the same sampler for every temperature. In theory, AFAIK, there's nothing stopping us from using, say, a MH kernel for all β < 1 and then HMC for β = 1

Can't this be done by defining a new Gibbs-like sampler instead of redesigning the package? I think with the right abstraction for "group stepping" we can overload a function to do the right thing for our Gibbs-like sampler. Then another function can specialise the temperature swapping for this special sampler. I am not sure if this will turn out to be more work than what you propose though.

@mohamed82008
Copy link
Member

mohamed82008 commented Oct 18, 2021

Overall I like the plan but the specifics can be discussed in a separate thread. If nothing else is holding this PR back, I suggest we merge and release so it can be "field tested".

@torfjelde
Copy link
Member Author

Can't this be done by defining a new Gibbs-like sampler instead of redesigning the package? I think with the right abstraction for "group stepping" we can overload a function to do the right thing for our Gibbs-like sampler. Then another function can specialise the temperature swapping for this special sampler. I am not sure if this will turn out to be more work than what you propose though.

I mean, you can easily make the same argument for the current implementation, no? Why isn't the temperature swapping just another kernel working on a product space, which could then be used in a composition kernel (Gibbs is just an instance of this)? I'm guessing the answer for this is convenience in the sense of easier to maintain, easier to implement, and we can make some etc. + "we're not there yet" with more general interfaces for AbstractMCMC, right?

And we're not talking a major redesign here; it's just adding additional implementations for particular instantiations of TemperedSampler.

With all that being said, and I realize I wasn't quite clear about this, I meant that step (1), i.e. sticking with current design but making functional, is this PR and the rest is for another later PR 👍

@mohamed82008
Copy link
Member

All sounds good.

@ptiede
Copy link

ptiede commented Oct 18, 2021

But this has the issue that we're completely restricted to using the same sampler for every temperature. In theory, AFAIK, there's nothing stopping us from using, say, a MH kernel for all β < 1 and then HMC for β = 1

Yes, you can do this. In fact, what I have done in the past is use exact sampling for the β=0 chain since this helps mixing in all the other tempering levels. I also think this swapping question is a really good one. For HPC the big thing you need to keep in mind is communication times. So something like swapping indices is preferred since that will have minimal overhead. So I think the current plan it a great one.

@ParadaCarleton
Copy link
Member

ParadaCarleton commented Oct 18, 2021

Everything above sounds good to me. (Assuming that kernel means proposal distribution? There's way too many things called a kernel in math and I'm pretty new to this 😅 .)

@sefffal
Copy link

sefffal commented Apr 3, 2022

I've been watching this repo for a while and I'm keen to try it out with AdvancedHMC on tricky model of mine.
Am I better of sticking to the main branch, or is the PR now at a stage I could use? If so, is there an example somewhere I could adapt?
Thanks!

@fkrauer
Copy link

fkrauer commented Aug 19, 2022

Hi all, are there any updates on this PR?

@torfjelde
Copy link
Member Author

Sorry for the delay here! As mentioned above, it's a bit stuck because it's really unclear to me if the adaptation strategies are useful or not.

I'll try to have a look at this over the weekend. Given the delay, I might just remove the adaptation completely and just require temperatures to be specified by hand for now without support for adaptation. Or just warn the user. No matter, it'll probably not be a very satisfying solution. But everything but the adaptation should be working.

@s-syed
Copy link

s-syed commented Aug 19, 2022

Hi @torfjelde and @fkrauer

I am the first author the "Non-reversible parallel tempering" published in JRSS-B last year where we designed a theoretically optimal and scalable algorithm for tuning parallel tempering (linked below). It turns out parallel tempering is extremely sensitive to how to propose swaps between chains, as well as the choice of the annealing parameters. Not only is tuning PT useful, we would argue it for hard problems PT is actually useless without proper tuning. The odd-even swapping scheme discussed earlier, may seem like an innocuous change algorithmically, but it leads to dramatically different scaling behaviour and eliminates the diffusivity presumed to be endemic to parallel tempering.

A consequence is that the odd-even swap non-asymptotically dominates other communication schemes and actually improves in performance with more chains. This makes it scalable to large-scale parallel implementations. We can also leverage this to cheaply and reliably learn the optimal schedule without making any structural assumptions on the target distribution or the statespace. The optimal schedule of annealing parameters also dramatically varies from problem to problem.

In the JRSS-B paper we benchmarked our method against 16 diverse models, including a 190 thousand dimensional Bayesian phylogetic inference problem. We emperically observed sigificant improvements in communication between the reference and target, and 10-100x gains in ESS/second compared to state-of-the-art tuning methods before our work. The method is simple to implement and general purpose, making it compatible with PPL's such as Turing.jl.

As a concrete example, @ptiede and Event Horizon Telescope used parallel tempering to generate the first image of the supermassive black hole in M87 in 2019. After implementing our work, they were able to more reliably replicate the 2019 photo in about 2% of the computation budget. This algorithmic efficiency from our tuning guidelines, was essential to the discover magnetic polarization in the M87 in 2021, and most recently generated the black hole photo of Sagittarius A* released in May 2022.

@ptiede and I were in talks with Turing.jl last year about implementing this work, but unfortunately we got busy finishing our thesis and moving to Harvard and Oxford for our postdocs respectively. If you are interested in working on this, @ptiede and I would love to chat with you and answer any questions. I am an expert on algorithmic design of parallel tempering algorithms, and @ptiede has expertise in implementing them at scale with Julia, so we think we would be able to make a meaningful contribution. We are passionate about Julia, would love to see use turing.jl for our tempering needs.

Cheers,
Saif

JRSS-B paper:
https://arxiv.org/abs/1905.02939

My thesis, for more details:
https://open.library.ubc.ca/soa/cIRcle/collections/ubctheses/24/items/1.0413120

@ptiede's Thesis: See chapter 4 for the analysis comparing out work to the EHT's prior implementation
https://uwspace.uwaterloo.ca/handle/10012/17199

@yebai yebai merged commit 1fdea5f into main Dec 4, 2022
@delete-merged-branch delete-merged-branch bot deleted the tor/improvements branch December 4, 2022 22:15
@yebai
Copy link
Member

yebai commented Dec 4, 2022

Many thanks, @HarrisonWilde and @torfjelde, for the improvements. I am merging this PR because it is getting too big and taking too long. Let's add further improvements in separate PRs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.