-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General improvements and fixes #133
Conversation
…red_loglikelihood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No comments here. Looks clear & seems to follow good practice -- wonderful work as per usual, Tor!
src/stepping.jl
Outdated
Δ_history :: Array{<:Real, 2} | ||
Δ_index_history :: Array{<:Integer, 2} | ||
Ρ :: Vector{AdaptiveState} | ||
@concrete struct TemperedState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure about the use of ConcreteStructs. I found the type annotations helpful for understanding the code. Also, we would like to keep the dependency absolutely minimal where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to copy and paste my macro from https://github.com/JuliaNonconvex/NonconvexCore.jl/blob/master/src/utilities/params.jl where you can keep the type annotations and make it concrete at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure about the use of ConcreteStructs. I found the type annotations helpful for understanding the code. Also, we would like to keep the dependency absolutely minimal where possible.
With regards to minimizing dependencies, ConcreteStructs seems to take .004 seconds to load on my computer, so I don't think it's that big a deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mohamed82008 Do you think that might be a good macro to add to ConcreteStructs.jl (so it can be used more easily outside of NonconvexCore)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes if the authors of ConcreteStructs are ok with the move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the type annotations helpful for understanding the code. Also, we would like to keep the dependency absolutely minimal where possible.
Though I get this sentiment, I think in general you want to avoid putting explicit type-constraints on parameteric types, in particular here where IMO the constraints where already stronger than necessary.
Feel free to copy and paste my macro from
Oh this looks quite nice! Might be worth doing that instead:)
Yes if the authors of ConcreteStructs are ok with the move.
Highly recommend you propose it! From a quick glance it just seems superior to @concrete
in any way, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about @concrete
when I did this. I think mine is older (from years ago).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But IMO we shouldn't even restrict to something like a AbstractVector
, e.g. for a small number of temperatures it might be better to use a Tuple
, etc. Instead we should just document the fields properly.
src/stepping.jl
Outdated
kwargs... | ||
) | ||
for i in 1:length(spl.Δ) | ||
] | ||
return ( | ||
states[sortperm(spl.Δ_init)[1]][1], | ||
first(states[argmax(spl.Δ_init)]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be argmax or argmin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it should be argmin. @torfjelde ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@torfjelde once I get a confirmation on whether it should be argmin or argmax, is it cool if I merge this and register an update to 0.2.0 (assuming tests pass)? Should simplify TuringLang/Turing.jl#1628
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argmax
should result in the index corresponding to 1
, which is what we want right? We want the return transition to correspond to the posterior, i.e. when delta = 1
.
But IMO we should just make it [1]
so that it's also possible to anneal in any way you want 🤷 I did this because I believe I saw this somewhere else i nthe code.
Should simplify TuringLang/Turing.jl#1628
Yep. There are several things that IMO should be done differently in that PR, e.g. we should be using the MiniBatchContext
to temper the likelihoods as it will be faster. But I'm currently running into issues getting some tests to pass; I essentially can't get it to even fit a simple MvNormal
properly. I'm uncertain if it's just me not knowing how to set parameters for tempering or if there's something wrong in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@torfjelde if you think [1]
works better I would go with that. I think it's less confusing and more general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argmax should result in the index corresponding to 1
no, it's argmin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done-diddit 👍
src/swapping.jl
Outdated
taken traverses even chain indices, proposing swaps between neighbors. | ||
|
||
Note that this method is _not_ reversible, and does not satisfy detailed balance. | ||
As a result, this method is asymptotically biased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, not sure if this is the right place to put this, but this statement is not true. It is not reversible, but it and preserves the correct distribution. This non-reversibility is actually the main reason why doing something like this swapping strategy is better. The non-reversibility prevents the swapping from devolving into a diffusion-type process. The Syed 2019 paper goes into this in a lot of detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ptiede Do you have a link to that paper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sorry! I should have included it. Here is the arXiv link https://arxiv.org/abs/1905.02939
A follow-up paper https://arxiv.org/abs/2102.07720 also goes into detail for how to further optimize PT. The primary author of that paper, @s-syed, and I were actually hoping to implement something like this into this package so if this repo is picking up again, we would love to help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I am more than happy to answer any questions. This odd-even scheme is actually extremely important and in some sense achieves the optimal performance and should always be used over the reversible counterpart. Furthermore, the tuning guidelines are very different and results in significantly different tuning guidelines.
In practice we find a 10-100x boost in performance compared to reversible swapping schemes and their corresponding tuning guidelines that have been implemented so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, not sure if this is the right place to put this, but this statement is not true. It is not reversible, but it and preserves the correct distribution.
Ah, thank you for pointing this out! This is indeed an incorrect statement.
And thank you for pointing me to those papers! Just read the first one, and it's real neat stuff. And we'd be happy to collaborate on this package. Personally I'm just starting to get into PT samplers, so it would be awesome to include someone with both practical and theoretical experience with these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I'd love to help. I have some experience implementing PT samplers for use with HPC and clusters, so I would be thrilled to help in any way I can! My old code was written in C++, so moving to Julia would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!:) I'll push some changes I've accumulated locally very soon.
Aight, so something we need to address here before this PR gets merged. Arguably, the most straight-forward approach to handling a swap between states[i], states[i + 1] = states[i + 1], states[i] But since states[i], states[i + 1] = (
setparams!!(states[i], getparams(states[i + 1])),
setparams!!(states[i + 1], getparams(states[i]))
) Unfortunately, no such a Of course, in the case where these states are completely equivalent, e.g. in the case of An alternative approach, which is what's currently done in this project is to swap the temperatures rather than the states, i.e. we do β[i], β[i + 1] = β[i + 1], β[i] This then means that after such a swap, But this has the issue that we're completely restricted to using the same sampler for every temperature. In theory, AFAIK, there's nothing stopping us from using, say, a MH kernel for all So. I think the plan should be as follows:
Then we have support for both the single-sampler-cheap-to-parallelize impl of PT and the super-general-but-not-so-cheap-to-parallelize impl of PT. In most cases we can "automate" the selection of the internal implementation by simply choosing the temperature-swap approach when we're working with a single sampler and then the other when not. Sounds like a plan? @ParadaCarleton @yebai @mohamed82008 (the rest of you are already pinged) |
Can't this be done by defining a new Gibbs-like sampler instead of redesigning the package? I think with the right abstraction for "group stepping" we can overload a function to do the right thing for our Gibbs-like sampler. Then another function can specialise the temperature swapping for this special sampler. I am not sure if this will turn out to be more work than what you propose though. |
Overall I like the plan but the specifics can be discussed in a separate thread. If nothing else is holding this PR back, I suggest we merge and release so it can be "field tested". |
I mean, you can easily make the same argument for the current implementation, no? Why isn't the temperature swapping just another kernel working on a product space, which could then be used in a composition kernel (Gibbs is just an instance of this)? I'm guessing the answer for this is convenience in the sense of easier to maintain, easier to implement, and we can make some etc. + "we're not there yet" with more general interfaces for AbstractMCMC, right? And we're not talking a major redesign here; it's just adding additional implementations for particular instantiations of With all that being said, and I realize I wasn't quite clear about this, I meant that step (1), i.e. sticking with current design but making functional, is this PR and the rest is for another later PR 👍 |
All sounds good. |
Yes, you can do this. In fact, what I have done in the past is use exact sampling for the β=0 chain since this helps mixing in all the other tempering levels. I also think this swapping question is a really good one. For HPC the big thing you need to keep in mind is communication times. So something like swapping indices is preferred since that will have minimal overhead. So I think the current plan it a great one. |
Everything above sounds good to me. (Assuming that kernel means proposal distribution? There's way too many things called a kernel in math and I'm pretty new to this 😅 .) |
I've been watching this repo for a while and I'm keen to try it out with AdvancedHMC on tricky model of mine. |
Hi all, are there any updates on this PR? |
Sorry for the delay here! As mentioned above, it's a bit stuck because it's really unclear to me if the adaptation strategies are useful or not. I'll try to have a look at this over the weekend. Given the delay, I might just remove the adaptation completely and just require temperatures to be specified by hand for now without support for adaptation. Or just warn the user. No matter, it'll probably not be a very satisfying solution. But everything but the adaptation should be working. |
Hi @torfjelde and @fkrauer I am the first author the "Non-reversible parallel tempering" published in JRSS-B last year where we designed a theoretically optimal and scalable algorithm for tuning parallel tempering (linked below). It turns out parallel tempering is extremely sensitive to how to propose swaps between chains, as well as the choice of the annealing parameters. Not only is tuning PT useful, we would argue it for hard problems PT is actually useless without proper tuning. The odd-even swapping scheme discussed earlier, may seem like an innocuous change algorithmically, but it leads to dramatically different scaling behaviour and eliminates the diffusivity presumed to be endemic to parallel tempering. A consequence is that the odd-even swap non-asymptotically dominates other communication schemes and actually improves in performance with more chains. This makes it scalable to large-scale parallel implementations. We can also leverage this to cheaply and reliably learn the optimal schedule without making any structural assumptions on the target distribution or the statespace. The optimal schedule of annealing parameters also dramatically varies from problem to problem. In the JRSS-B paper we benchmarked our method against 16 diverse models, including a 190 thousand dimensional Bayesian phylogetic inference problem. We emperically observed sigificant improvements in communication between the reference and target, and 10-100x gains in ESS/second compared to state-of-the-art tuning methods before our work. The method is simple to implement and general purpose, making it compatible with PPL's such as Turing.jl. As a concrete example, @ptiede and Event Horizon Telescope used parallel tempering to generate the first image of the supermassive black hole in M87 in 2019. After implementing our work, they were able to more reliably replicate the 2019 photo in about 2% of the computation budget. This algorithmic efficiency from our tuning guidelines, was essential to the discover magnetic polarization in the M87 in 2021, and most recently generated the black hole photo of Sagittarius A* released in May 2022. @ptiede and I were in talks with Turing.jl last year about implementing this work, but unfortunately we got busy finishing our thesis and moving to Harvard and Oxford for our postdocs respectively. If you are interested in working on this, @ptiede and I would love to chat with you and answer any questions. I am an expert on algorithmic design of parallel tempering algorithms, and @ptiede has expertise in implementing them at scale with Julia, so we think we would be able to make a meaningful contribution. We are passionate about Julia, would love to see use turing.jl for our tempering needs. Cheers, JRSS-B paper: My thesis, for more details: @ptiede's Thesis: See chapter 4 for the analysis comparing out work to the EHT's prior implementation |
isn't provided
Added a `tempered_sample` helper function and no-swap burn-in functionality
Many thanks, @HarrisonWilde and @torfjelde, for the improvements. I am merging this PR because it is getting too big and taking too long. Let's add further improvements in separate PRs! |
I had some issues getting MCMCTempering to work and this PR is an attempt at cleaning up the interface a bit in addition to dropping use of non-concrete types, mutating types, etc.
I have some tests locally that I just need to clean up a bit, and then I'll add those too.
Changes
compute_tempered_logdensities(model, sampler, transition, transition_other, β)
.model
andtransition
, possibly even specializing onsampler
.make_tempered_loglikelihood
andget_params
, making it so that we only have to implement a single function rather than multiple.transition
contains an already computedlogdensity
, and so this allows us to re-use this rather than re-compute, whenever it makes sense.make_tempered_loglikelihood
.compute_tempered_logdensities
get_params
.compute_tempered_logdensities
.swap_step
now also takes anAbstractSwapStrategy
as the first argument and thus we can specify the different strategies using multiple-dispatch rather than if-statements withSymbol
, thus easily allowing extensions + it reads better.logscale
field as if it was indeedscale
.