-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A probability distribution on the halflife itself? #32
Comments
Tried this law to compute |
Nice work! Explicitly tracking the halflife is what Duolingo's halflife regression does! Read their paper here: https://github.com/duolingo/halflife-regression/blob/master/settles.acl16.pdf and they also have a blog post about it. It needs machine learning to update halflives (and you'll see in that repository's issues where I, along with others, have described serious concerns with their training algorithm to do that). When I began work on Ebisu I wanted to see if we could do something like halflife regression purely Bayesianly (I talk a bit about this in my blog post about Ebisu—I mention considering the Gamma distribution for halflife), and it turned out that it was easy enough to just put a prior on recall at some time, and then deterministically put that through the exponential to get the prior at any given time. Now I'm intrigued though: A concern here is, would different variances on A side note though. I do somewhat disparage this in the Ebisu doc but if we store model halflife, then you could indeed replace I'll look more later. |
And here I am, discussing statistics while I could (and should) be working on my quiz app instead... but it's just way too dang interesting 😄 Thank you for the (another) in-depth reply, and especially the link to your old blog post, which I hadn't seen before. So this ground has indeed been trodden before... although not with a log-normal distribution perhaps? I played around with a gamma distribution just now too; at first I thought we couldn't position and scale it the way we needed, but we can; the usual parameters are just a bit unintuitive to work with. Also I didn't even know that a log-normal distribution was an official thing, with a Wikipedia page and all. The value of the mean, which I failed to derive, has kindly been provided already (and is simple to compute), and maybe some of the other equations are useful too. A log-normal would be nice because it doesn't need a complicated, expensive and numerically tricky gamma (or gammaln) function.
I can see how this arises, but it's an inconvenient property because it makes |
A comment—Monte Carlo can help you quickly get a sense of how different priors on different variables behave. For example, here's a snippet to simulate a log-normal and shows you what happens for various parameters, import numpy as np
from scipy.stats import norm
def weighted(samples, likelihood=[1]):
sum = np.sum(likelihood)
mean = np.sum(likelihood * samples) / sum
var = np.sum(likelihood * (samples - mean)**2) / sum
return dict(mean=mean, std=np.sqrt(var))
def monteCarlo(priorHalflifeMean,
priorLogHalflifeStd,
tnow,
result,
size=1_000_000,
random_state=1):
# generate some Monte Carlo samples
logHalflife1 = norm.rvs(
np.log(priorHalflifeMean), priorLogHalflifeStd, size=size, random_state=random_state)
halflife1 = np.exp(logHalflife1) # also Monte Carlo
prior = 2**(-tnow / halflife1) # also Monte Carlo! Probability of recall at `tnow`
# This is the Bernoulli likelihood: a *probability*
likelihood = prior if result else (1 - prior)
print('### prior halflife mean={}, prior log-halflife std={}, tnow={}, result={}'.format(
priorHalflifeMean, priorLogHalflifeStd, tnow, result))
print('- probability of recall before quiz', np.mean(prior))
# print('- prior halflife', np.mean(halflife1))
# print('- posterior halflife moments', weighted(halflife1, likelihood))
print('- prior LOG-halflife', np.mean(logHalflife1))
print('- posterior LOG-halflife moments', weighted(logHalflife1, likelihood))
priorHalflifeMean = 1.0
priorLogHalflifeStd = 0.5
monteCarlo(priorHalflifeMean, priorLogHalflifeStd, 1, True)
monteCarlo(priorHalflifeMean, priorLogHalflifeStd, 10., True)
monteCarlo(priorHalflifeMean, priorLogHalflifeStd, 10., False)
monteCarlo(priorHalflifeMean, priorLogHalflifeStd * 3, 10., False) This prints out: prior halflife mean=1.0, prior log-halflife std=0.5, tnow=1, result=True
prior halflife mean=1.0, prior log-halflife std=0.5, tnow=10.0, result=True
prior halflife mean=1.0, prior log-halflife std=0.5, tnow=10.0, result=False
prior halflife mean=1.0, prior log-halflife std=1.5, tnow=10.0, result=False
Which shows us that the scheme has nice properties, it responds as expected to
Hmm, if this is the goal, I'm not sure if a Bayesian approach would be fruitful. The expected recall probability You can use the above Monte Carlo snippet to check that indeed as you change the sigma of However, what you mentioned in your earlier issue,
it seems like this should work without bringing in Bayesian machinery. You could store If the goal is just to have blazing fast If you're concerned about accuracy,
Nevertheless, I do think it's worth digging into how to do |
Edit Oops, I stopped too soon in the derivation of the posterior when you model log-halflife as normal—here's a corrected version: it's still not clear how to analytically find the mean and variance of the posterior on log-halflife in this case, but maybe someone can point out a solution: Markdown/LaTeX for the above(code to convert the following to the above, in PDF:
It's probably worth checking with a few other priors on the halflife or log-halflife, to see if the posterior's form is less fearsome. |
Oops, my previous post was incomplete, I updated it to be more correct. |
Thank you for looking into this! Pity it didn't work out as nicely as I hoped. Exponentials-in-exponentials... that looks daunting indeed, although I was never much good at integration.
I'm not at all convinced that this is impossible. In fact I feel that the overall shape of |
Good news! I was able to simplify aggressively and found that with a Gamma prior on halflife directly, the posterior turns out to have tractable moments, and can be fit back to a Gamma! It uses the modified Bessel function of the 2nd kind, which thank goodness Scipy provides and will be fun to find JS/Java/Dart versions of 😇. I'm putting the finishing touches on the negative boolean update and will post an update. I haven't looked at how noisy-binary and binomial quizzes will work with it, though I'm tentatively confident they should be tractable too. There's an elegant way to reframe the posterior when you're working with halflife and log-halflife, which could help us evaluate other priors (I tried log-normal, logistic, chi-square, and Gamma distributions). |
That's good news indeed!
Does that mean that the result is actually only an approximation of the posterior, and not the real thing? In fact I've been wondering the same about the original Ebisu algorithm. Not that it really matters, as the whole Ebbinghaus curve is already a very simplified approximation anyway 😄
There are beta, gamma and Bessel of the first kind functions in Apache Commons, if that helps. In fact their gammaln was initially the basis for my port... but it was pretty complicated and didn't pass one of the unit tests. Whether that was an issue with that implementation, with yours, or with my port, I daren't say; I ended up just porting your much simpler version. So what are your thoughts on this model compared to the original beta distribution? |
Correct, in general none of these these posteriors are conjugate, meaning they're some different distribution than the prior after the prior goes through the binary update. For both Beta-on-recall and Gamma-on-halflife, we have an analytical expression for the posterior, but in order to return a new model that's still Beta or Gamma, we have to moment-match, which is the one and only approximation. (The one exception was, with Beta-on-recall, the time-traveled prior (a GB1 distribution) is conjugate with the binary update if the quiz was successful, but not otherwise. So not really super-useful.) |
Edit corrected the code below after initially posting. I think we may have to give up on the Gamma prior on halflife. While my attention was distracted making sure that updates after binary quizzes are doable, I failed to notice that To see this, consider— import numpy as np
from scipy.stats import gamma as gammarv
from scipy.special import gamma
from scipy.special import kv
a = 2.
b = 1.5
t = (a / b) # mean halflife!
halflifes = gammarv.rvs(a, scale=1 / b, size=1_000_000)
prob = np.exp(-t / halflifes)
predictRecall = lambda a, b, t: 2 * (t * b)**(a / 2) * kv(a, 2 * np.sqrt(t * b)) / gamma(a)
predicted = predictRecall(a, b, t)
print('predictRecall @ time={}: analytical={}, Monte Carlo={}'.format(t, predicted, np.mean(prob)))
# predictRecall @ time=1.3333333333333333: analytical=0.3092345700088991, Monte Carlo=0.30936420184691676 (I use Anyway, the point is, it's not analytically straightforward to calculate what the predicted recall is for a given Actually, it's helpful to see this visually: ts = np.linspace(0, 10)
import matplotlib.pylab as plt
plt.ion()
plt.figure()
plt.plot(ts, np.exp(-ts / (a / b)), label='approx', linewidth=4)
plt.plot(ts, np.vectorize(lambda t: predictRecall(a, b, t))(ts), label='pred1', linewidth=3)
plt.plot(ts, np.vectorize(lambda t: predictRecall(a * 3, b * 3, t))(ts), label='pred2', linewidth=2)
plt.plot(ts, np.vectorize(lambda t: predictRecall(a / 3, b / 3, t))(ts), label='pred3', linewidth=1)
plt.legend()
plt.xlabel('time')
plt.ylabel('recall probability')
plt.savefig('gamma.png', dpi=300) produces the following: So since the Gamma-on-halflife doesn't really buy us anything over the Beta-on-probability (other than possibly simplifying the update step by getting rid of rebalancing), and might make it more complicated to do noisy-binary and binomial quizzes, I'm going to try and focus on quantifying the risk of just replacing If we can show that that approximation is reasonable, then quiz apps can benefit from having a super-fast alternative to What do you think? |
Good call. Maybe we should call it "elflife"? It's fitting, because elves do have a longer lifespan than halflings ;) Thinking about Jensen's inequality: this seems to dash my hopes of finding any distribution
because this is essentially (setting
but Jensen gives, for nonlinear
meaning it can never be equal. That seems to be pretty much what you said above as well. I also thought, at first glance, that your plot does seem to contradict that conclusion: But maybe we can quantify the effect that changes to def approxRecall(a, b, t):
h = a / b
return np.exp(-t / h)
ts = np.linspace(0, 5, 100)
plt.ion()
plt.figure(figsize=(20, 10))
for i in (0.1, 0.3, 1, 3, 10, 30):
#plt.plot(ts, np.vectorize(lambda t: predictRecall(a * i, b * i, t))(ts), label='pred%.1f' % i, linewidth=4)
#plt.plot(ts, np.vectorize(lambda t: approxRecall(a * i, b * i, t))(ts), label='approx%.1f' % i, linewidth=2)
plt.plot(ts, np.vectorize(lambda t: approxRecall(a * i, b * i, t) / predictRecall(a * i, b * i, t))(ts), label='error%.1f' % i, linewidth=2)
plt.legend()
plt.xlabel('time')
plt.ylabel('error in approx recall probability')
plt.savefig('error.png', dpi=96) Those don't look like any basic function that I know of. I thought that they would at least agree at Edit: inverted the graph because it makes more sense this way. |
I'm not following—as you say, the biggest blue "approx" curve represents We can show the same thing with the Monte Carlo samples—after running my first code snippet, I can do the following in IPython: In [7]: jensen = lambda t: [np.mean(np.exp(-t / halflifes)), np.exp(-t / np.mean(halflifes))]
In [8]: jensen(0.5)
Out[8]: [0.5855561840096577, 0.6874954822547535]
In [9]: jensen(10)
Out[9]: [0.007355087627151486, 0.0005564126211527937] which show
We know analytically what those are though, because we know analytically what I'll write up the math for But as I mentioned earlier, since we don't have a |
Errata in the below, the following corrections have to be made (I don't want to edit the LaTeX and upload a new image, forgive me)
Original: Here's a quick writeup of the math for Gamma-on-time-constant: See the Markdown+LaTeX source---
geometry: "left=3cm,right=3cm,top=2cm,bottom=2cm"
---
Suppose we have a flashcard with time constant (in arbitrary units) $h ∼ Γ(α, β)$, that is, drawn from a Gamma distribution with known parameters $α$ and $β$ (we use "h" here to allude to "halflife" though that requires the exponent to be 2, not $e$ as we have below); and $t$ the time since last review, in units consistent with $h$. Then the expected recall probability of this flaschard at time $t$ is
$$
E[\exp(-t/h)] = \frac{β^α}{Γ(α)} \int_0^∞ \exp(-t/h) ⋅ h^{α - 1} \exp(-β h) \, dh = 2 (t β)^{α/2} K_α(2 \sqrt{t β}) / Γ(α),
$$
where we use $E[f(h)] = \int_0^∞ f(h) P(h) \, dh$ (since $h$ positive), and where, apologies, $Γ$ is reused as the gamma function (the generalization of the factorial to the reals), and where $K_α$ denotes the modified Bessel function of the second kind of order $α$.
I found this via Wolfram Alpha: `Integrate[exp(-3 / h) * h^(a - 1) * exp(-b * h) * b^a / Gamma[a], h, 0, inf]` (it doesn't work if you leave it as `t`, but does work when you use `3=t`).
Next, after a quiz $x ∼ Bernoulli(\exp(-t/h))$, i.e., a coin toss, we can ask for the posterior on the halflife $h$ in light of the quiz $x$:
$$
P(h | x) = \frac{P(x | h) P(h)}{\int_0^∞ P(x | h) P(h) \, dh}
$$
where the denominator is simply a normalizing constant equal to the integral of the numerator (so the resulting posterior has integral 1); the numerator is the likelihood $P(x | h)$ and the prior $P(h)$. We can break this up into the two cases of $x$:
$$
P(h|x=1) = \frac{ \exp(-t/h) ⋅ h^{α - 1} \exp(-β h) }{\int_0^∞ \exp(-t/h) ⋅ h^{α - 1} \exp(-β h) \, dh}.
$$
The $N$th moment of this distribution, i.e., $E[(h|x=1)^N] = \int_0^∞ h^n P(h|x=1) \, dh$ turns out to be, via straightforward application of the expectation above, equal to
$$
m_{N, x=1} = \frac{
t^{(α + N)/2} β^{-(a+N)/2} K_{α + N}(2 \sqrt{t b})
}{
t^{α/2} β^{-a/2} K_{α}(2 \sqrt{t b})
}
$$
The first moment $m_1$ is the posterior *mean*, and the posterior variance is simply $\sigma^2 = m_2 - m_1^2$. If you moment-match these to a new Gamma distribution, your new $(α', β') = (m_1^2 / σ^2, m_1 / σ^2)$. For lightness, I've omitted the $x=1$ subscript here.
When $x=0$, i.e., a quiz failure, the likelihood is $P(x=0|h) = (1-\exp(-t/h))$. Repeating the exercise above, if we haven't flubbed the bookkeeping, the moments of this posterior are
$$
m_{N,x=0} = \frac{
m'_N Γ(α) β^{-α} - t^{(α + N)/2} β^{-(a+N)/2} K_{α + N}(2 \sqrt{t b})
}{
Γ(α) β^{-α} - t^{α/2} β^{-a/2} K_{α}(2 \sqrt{t b})
}
$$
where $m'_N$ are the moments of the original Gamma distribution: via Wikipedia etc., $m'_1 = α / β$ and $m'_2 = α / β^2 + {m'_1}^2$. In this numerator and denominator, you can see the numerator and denominator of the previous $m_{N,x=1}$ case, and we may be able to simplify further. As before, one can moment-match a new Gamma to these. I rendered the above via I implemented this and verified it via numerical integration in https://github.com/fasiha/ebisu/blob/889588a5774a24a0820b46517c22ea2fbfa1ca07/gam.py. |
Nice work! But I agree that it seems to have no advantage over the beta-on-recall model with rebalancing. Still, it was interesting and educational. Thank you for investigating! |
hi, thank you very much for the amazing library. @fasiha
Could you please explain a little bit more on the technique of calculating 10 predict recalls at different time intervals (are they fixed timestamps for all terms like 1/1/21, 2/1/21 etc or sort of review_time + delta(t=1 day, 1 week etc)? ) and then when we need to select the terms to study, how should we. go about it? From my understanding of your reply, for each term after a review, we will update the term model with predictRecall of 10 different time intervals (review_time + delta(t)) at T0 we want to find which terms to review, we would look sort terms by (predict_recall_timestamp, probability) say lower 10% and pick a random term to review? Thanks |
@jwtnb thanks for the questions! First thing I should say is, as I mention in #53, I'm working on a rewrite of Ebisu and it's going to embrace @ttencate's idea in this issue: the Bayesian prior is on halflife, and even though it's an approximation, The really simple scheme I proposed,
works just like you describe. The idea is, you run
or some similar exponentially-increasing list of times and store that in a database. Then at some point in the future, you look for the two (maybe one) time slots that bracket the current time, and do a linear interpolation. … That is probably very confusing, sorry, hopefully this code snippet is more clear 😅: import ebisu
# Courtesy of https://stackoverflow.com/a/62314804
def lerp(x1: float, x2: float, y1: float, y2: float, x: float):
"""Perform linear interpolation for x between (x1,y1) and (x2,y2) """
return ((y2 - y1) * x + x2 * y1 - x1 * y2) / (x2 - x1)
# at quiz time, put these in the database
halflife = 1.0
model = (3., 3., halflife)
# equivalent to 4**numpy.arange(-1, 3.6, 0.5)
futures = [0] + [4**(-1 + i * 0.5) for i in range(10)]
pRecalls = [ebisu.predictRecall(model, future) for future in futures]
# now let's figure out the lerp step using these lists
def predictRecallCached(model, tnow: float, futures: list[float], pRecalls: list[float]) -> float:
"""Try to linearly interpolate `predictRecall`"""
rightidx = next(filter(lambda idx: futures[idx] > tnow, range(len(futures))), None)
if rightidx is not None:
# tnow <= futures[-1], we can do linear interpolation!
leftidx = rightidx - 1
# We know leftidx >= 0 because `futures[0]==0`!
return lerp(futures[leftidx], futures[rightidx], pRecalls[leftidx], pRecalls[rightidx], tnow)
else:
# tnow > futures[-1]
return ebisu.predictRecall(model, tnow)
# some time later
tnow = 50 # time since last quiz
pRecall = predictRecallCached(model, tnow, futures, pRecalls)
print(f'pRecall = {pRecall:0.2g}') Hopefully that makes sense! You can tweak how many samples of (Ebisu v3 that I mentioned above is going to remove all this headache because everything will be in terms of halflife.) Regarding the other tweak I mentioned, you asked:
this is an app-level decision beyond Ebisu but yes, I think it's better to (slightly) randomize quizzes with low probability of recall. So you can
But again, these issues have more to do with your quiz app, Ebisu doesn't care 😄. |
First off, thank you for reading my ramblings thus far! I'm learning something new every day.
This isn't really an issue with Ebisu, more of a discussion topic, but I'm opening it as an issue so it can be preserved for posterity and easily linked to if this topic comes up again.
My thought was: instead of assuming a beta distribution at a particular time t, could we assume a distribution on the halflife itself? I haven't yet tried to do the math, because maybe you've tried this already and it leads nowhere, so it seemed best to discuss it first.
The advantage would be that
predictRecall
could simply take the expected value of the halflife distribution and boil down to a very cheapreturn 0.5**(tnow / h)
, which can even be done in SQL if needed.What would such a distribution look like? We need something whose "mass peak" can be shifted around arbitrarily to represent the expected halflife, and whose "width" can be tuned to represent the certainty. We can do both of these things with a normal (Gaussian) distribution, as in
h ~ N(mu, sigma)
, but the problem is that it has a nonzero tail to the left of the y axis. Really our distribution should only exist fort >= 0
and be zero att = 0
.So how about
log(h) ~ N(log(mu), sigma)
? Technically it's undefined forh = 0
but it converges to 0 when approached fromh > 0
, so I can live with that. It seems to have the desired properties:This is how far I got. Here are the things I'm concerned about, which an experienced statistician could probably answer easily:
h
instead oflog(h)
?log(h)
is obviouslylog(mu)
, but the expected value ofh
is probably notmu
, as the tail is skewed to the right. So maybepredictRecall
still needs to do some more work to compute the actual0.5**(tnow / h)
from the parametersmu, sigma
. Or maybe we can re-parameterize the distribution so that one of the parameters is equal to the expected value.updateRecall
given some observation (quiz result) using Bayes' rule. For all I know, the reason people use all these well-known distributions is that they're the only ones that can be dealt with analytically.The text was updated successfully, but these errors were encountered: