Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-boolean quiz results #23

Closed
garfieldnate opened this issue Mar 7, 2020 · 24 comments
Closed

non-boolean quiz results #23

garfieldnate opened this issue Mar 7, 2020 · 24 comments

Comments

@garfieldnate
Copy link

Many quiz systems do not assign a simple pass-fail to study events. Some systems, like Anki, simply ask the user how well they think they know the answer. Others, like Duolingo, assign a score based on performance in a study session with several exercises. It would be great if Ebisu could be extended to handle this case; so updateRecall(prior: tuple, result: bool, tnow: float) would be changed to `updateRecall(prior: tuple, result: float, tnow: float).

This would also enable comarison with the other systems compared in the half-life regression paper from Duolingo, meaning this ticket may be a pre-requisite for #22.

@fasiha
Copy link
Owner

fasiha commented Mar 7, 2020

Thanks for commenting! I talked about this in another issue: #19 (comment)

In a nutshell, yes, I think Ebisu can handle the binomial quiz case, like Duolingo, instead of a Bernoulli quiz case like it currently does. I'm just not sure when I can promise to make time to look into this, but it's definitely on my radar.

@fasiha
Copy link
Owner

fasiha commented Mar 9, 2020

Closed by version 2.0.0: https://github.com/fasiha/ebisu/blob/gh-pages/CHANGELOG.md

@fasiha fasiha closed this as completed Mar 9, 2020
@garfieldnate
Copy link
Author

Wow. That was fast! 😄

So the binomial case. This models the quiz as a bunch of boolean results, right? Is there any model mismatch between that and the Anki case? Anki doesn't give the result of a series of boolean quizes. It gives the user's confidence score about how well they think they know the answer. I'm working on a quiz app that uses this mechanism.

@fasiha
Copy link
Owner

fasiha commented Mar 9, 2020

Your understanding is accurate: there’s not an exact match between binomial quizzing and Anki’s confidence ratings (in contrast, the binomial quiz corresponds well to Duolingo’s quizzes 😝).

Nonetheless I am tentatively confident that you can use the new API to fake Anki-style confidence, we just don’t yet have guidelines on how to best do so (whereas we’ve built up a lot of confidence in Ebisu’s handling of binary quizzes).

For example you could do this:

  • fail: success=0, total=1
  • normal ease: success=1, total=1
  • hard ease: success=1, total=2
  • easy: success=2, total=2
  • (not in Anki but available: total hard fail: success=0, total=2)

But do note that there’s risks to using large the number of successes:

  1. the unit tests cover total=5 and below for a range of other parameters, but you can still encounter numerical instability and assertion failures in extreme edge cases.
  2. Moreover, the math is very aggressive when successes is either close to 0 or close to total. I will update this thread with examples but qualitatively if a successful quiz increased the half life by 50% for the total=1 case, it might increase it by 2x for total=2, and 4x for `total=3‘: there’s some kind of exponential factor that I noticed during testing but this is what I mean when I say we don’t have guidelines on how to hack this use case into the framework currently.l, and your experimentation will be a very useful part of that 😁.

One thing I’ll add that so far I’ve built quiz apps with automatically-graded binary quizzes and really liked that. I personally will continue designing quiz apps like that, and only allowing the use of the true binomial case (with total>1 in cases) when the student feels very strongly about going back and changing the rating of their quiz, like “I really really got that flashcard, go back and mark it as super-easy” or “crap that was really hard to remember and yes, I know that exercising near-forgotten memories strengthen them but please mark that as a near-miss”.

Does all this make sense? I’m happy to elaborate. I’ll update this with some numbers in roughly twelve hours.

@fasiha
Copy link
Owner

fasiha commented Mar 9, 2020

the binomial case. This models the quiz as a bunch of boolean results, right?

A hyperfine quibble here: that is only somewhat the case. A binary (Bernoulli) experiment is a coin flip. A binomial quiz is N simultaneous con flips (or actually, independent coin flips, don’t have to be simultaneous). As long as you keep this on mind when calling them “a bunch of booleans” you are good to go.

I mention this only because, if you actually have a sequence of boolean quizzes (where you show the user the correct answer after each quiz) then that’s not a binomial quiz, that’s an entirely normal sequence of binary quizzes 😆. Let me know if I should explain more.

@garfieldnate
Copy link
Author

garfieldnate commented Mar 11, 2020

One thing I’ll add that so far I’ve built quiz apps with automatically-graded binary quizzes and really liked that.

I personally have enjoyed the slight interactivity that comes with grading yourself on the strength of the memory (like Anki, or my favorite, CleverDeck), but admittedly I don't know if it's actually beneficial to my learning. As a user it feels a little bit motivating to be able to declare to the app, "I almost got that perfectly!", and the added interactivity may contribute to using the app more often, which would mean more learning in the long run. But I don't know if this is actually useful information for scheduling reviews, especially since the currently-available apps leave it up to the user to decide how to grade themselves. Doing this with a proper bayesian statistical model would require the app to learn this on a per-user basis.

when the student feels very strongly about going back and changing the rating of their quiz

So this is like if the user accidentally clicks the wrong button or something, right? Is this feature really the proper use for that? Seems like the app author should provide some kind of rewind/undo functionality for this that in the end only recalculates using total=1. Wouldn't total=2, successes=1 lead to different model parameters?

if you actually have a sequence of boolean quizzes (where you show the user the correct answer after each quiz) then that’s not a binomial quiz, that’s an entirely normal sequence of binary quizzes

The second one is actually the Duolingo case, then; although the data they provide has floating point numbers, in actuality their app presents several quizzes in a row, and what they report in the dataset is the final score over these individual quizzes. Now that I write that (out loud? XD), that actually seems like an important problem with their dataset.

But if that's the Duolingo case, what is the actual case that's being modeled by the new feature? Or is the new feature even designed for the modeling of a specific case? The user must be quizzed several times with no dependencies between the answers on each quiz. I know it's common to assume independence for modeling purposes, but this really seems unlikely in a quiz setting like this; apps give feedback to correct a user between quizzes, or the user consistently gets a fact right or wrong.

@fasiha
Copy link
Owner

fasiha commented Mar 15, 2020

I personally have enjoyed the slight interactivity that comes with grading yourself on the strength of the memory (like Anki, or my favorite, CleverDeck)

Excellent, this is good to know—prevents author tunnel vision 😇! For me, the extra cognitive burden of interrupting my reviews to make the meta-decision is draining, so thank you for the counterexample.

I don't know if this is actually useful information for scheduling reviews,

Intuitively, if something was easy, then its updated halflife should be scaled more aggressively than if it was very difficult to remember right?

But I wonder how accurate that intuition is, and if the reality is more complicated than that. Occasionally I have the experience where I see a review and think to myself, "Wuh, when did the app teach me this?" I know the answer but can't recall learning it.

Other times I contrast the experience of (1) having to review something for which I didn't create a strong enough mnemonic for, so I'm floundering for the correct answer, versus (2) a similar situation—relatively young flashcard with a hastily-made mnemonic—but that I remember the answer easily. Does this mean case 1 was harder than case 2? Or just that brain performance, like physical performance, is related to various factors like time of day, food, sleep, stress, etc.?

Thinking about these things tempers my rush to convert all my apps from binary quizzes to faking ease with binomial quizzes, except when I feel strongly about it, which brings us to…

So this is like if the user accidentally clicks the wrong button or something, right?

No, more like, "I'm annoyed at how often I've been asked this, seriously, increase its halflife aggressively" (set success=total=2) or "O m g, I really struggled to remember this, please ask me to review this sooner than otherwise" (set success=1, total=2). In case of typo, yes, the app should provide a simpler mechanism to correct.

The second one is actually the Duolingo case, then; although the data they provide has floating point numbers, in actuality their app presents several quizzes in a row, and what they report in the dataset is the final score over these individual quizzes. Now that I write that (out loud? XD), that actually seems like an important problem with their dataset.

I forget if Duolingo app gives you feedback for each sentence? I hate it so much I don't want to reinstall it to find out, forgive me. When I first wrote this, I thought it maybe didn't give detailed feedback on what you missed until the end, but I am likely misremembering.

But if that's the Duolingo case, what is the actual case that's being modeled by the new feature? Or is the new feature even designed for the modeling of a specific case?

The quiz style that would closest match the binomial quizzes Ebisu now handles would be where you're asked to recall the same flashcard multiple times in a single review session in a context where you don't particularly focus on that flashcard itself, and are instead occupied with the broader context of the review task at hand.

For example. If Duolingo asked you to translate a few sentences into French, and then at the end of the review session, it reviewed what you'd missed, showing you that you'd misconjugated ecrir once but got it right once, e.g., then that'd be a great match for binomial quizzes.

As you say, if you give feedback on a per-flashcard level after each quiz, then you probably have a series of binary quizzes. Such an app would also be very boring since you're over-reviewing things too frequently (assuming you got them right, Ebisu would barely change the model after the first quiz since, because so little time elapsed since the previous quiz).

So supporting that was one goal for allowing binomial quizzes. The other was to see if it turned out to be a good way to hack user-reported ease.

@ttencate
Copy link
Contributor

ttencate commented Jun 10, 2020

Hi, I'm also interested in this topic. If it all works out, I'll send my Dart port of Ebisu your way in due time :) [Update 2020-09-05: see #36.]

I forget if Duolingo app gives you feedback for each sentence? I hate it so much I don't want to reinstall it to find out, forgive me. When I first wrote this, I thought it maybe didn't give detailed feedback on what you missed until the end, but I am likely misremembering.

Duolingo gives you immediate feedback on your answers; it doesn't wait until the end of the session. So @garfieldnate is right: it's just like a regular repeat, albeit with a very short interval.

Is that also a good way to treat it in Ebisu though? The second repeat happens only minutes after the first, so the recall probability will be close to 1. If the right answer is given the second time, the algorithm won't infer much from this; if the wrong answer is given, half-life is severely penalized. OK, I guess this is what we want actually :)

@fasiha
Copy link
Owner

fasiha commented Jun 10, 2020

I'll send my Dart port of Ebisu your way in due time :)

Wow, most kind thank you @ttencate!

Is that also a good way to treat it in Ebisu though? The second repeat happens only minutes after the first, so the recall probability will be close to 1. If the right answer is given the second time, the algorithm won't infer much from this; if the wrong answer is given, half-life is severely penalized. OK, I guess this is what we want actually :)

I agree with your analysis. As the quiz app author, you'll have to decide how to structure your quizzes, if you want to have closely-spaced repeats, and how to handle failures soon after successes. You should feel free to cheat Ebisu for the greater good.

I might add here that I'm working on a solution to the numerical instability problem for low successes and high total cases after very little time has elapsed. It's a pathological edge case that shouldn't happen often in real life but I'm hoping to find a proper workaround, maybe it will help experimentation with binomial quizzes.

@ttencate
Copy link
Contributor

Cool, thanks for confirming.

Back on the topic of non-binary results: how hard would it be to implement this? Would the model need to be changed drastically, or could we come up with a way to interpret a fractional result as "partially forgotten"?

Faking it with the successes and total arguments might achieve the desired effect to some extent, but is it theoretically correct to just substitute x% by a successes/total ratio that approximates x%? If that is indeed correct, then a real-valued correctness score would be strictly more general, and thus better from an API point of view.

@fasiha
Copy link
Owner

fasiha commented Jun 10, 2020

Hmm, good questions. A couple of things.

The binary-to-binomial extension was straightforward mathematically (even if a chore to get the derivation and implementation right!), since a binary quiz is a binomial quiz, with integer successes and total.

But I will certainly think about how to do updates with a "soft-binary" result, i.e., a percent. Likely we can keep the same form of the model. Maybe we can make quizzes be noisy-binary… will think!

Edit—I think this is very relevant https://stats.stackexchange.com/questions/419197/bayesian-inference-for-beta-distribution-after-an-uncertain-outcome though there's no guarantee we can include it into our GB1 prior (Beta prior after exponential forgetting) cleanly.

@fasiha
Copy link
Owner

fasiha commented Jun 12, 2020

I found a tidy way to allow quiz results to be float between 0 and 1 (inclusive). Code at branch abecdbb#diff-35ff1833326ddbe951882636c4cbc678R131

It uses the noisy binary model linked earlier and I describe it in detail below to invite comments on the API to design around it, because the mathematical model is more flexible than I want to expose programmatically.

Mathematically, a noisy-binary quiz consists of a normal binary quiz (drawn from the prior probability distribution on recall probability; in the README I call this x) that goes through a scrambling process, giving you the observed noisy-binary quiz result, which we can call z. You don't get to see the original x, only z. Two parameters govern this scrambling process:

  • q_1 = Probability(z = 1 | x = 1) so 1 - q_1 = Probability(z = 0 | x = 1),
  • q_0 = Probability(z = 1 | x = 0) so 1 - q_0 = Probability(z = 0 | x = 0)

So a noisy-binary quiz result consists of:

  1. the boolean result, True or False, (this is z)
  2. q_1 between 0.5 and 1, and
  3. q_0 between 0 and 0.5.

In Ebisu's terms, I'm thinking of the original quiz result (original as in, before scrambling) as the result you would have gotten if your recall was purely a function of the strength of the memory in your mind (which we have a probability distribution over). Meanwhile, the scrambled noisy-binary quiz result is whether you actually passed the quiz or not, and is related to any number of other factors beyond the strength of your memory (sleep, hunger, motivation, focus, etc.).

I admit it's tricky to carefully map probability into the real world, and this explanation might be forced, feel free to weigh in.

As mentioned above, the math gives us two knobs to independently turn, q_0 and q_1, which, again, are the probabilities that you actually passed the quiz (z=1) given that the true unscrambled quiz result x=0 or x=1 respectively. If you don't want to think too much about this, I think it's reasonable to link both so q_0 = 1 - q_1. The table below shows the halflife after a noisy-binary quiz where the "Noise" column = q_0 = 1 - q_1. Noise of 0 means the no-noise binary case that we've known this whole time. Initial model is (3.3, 3.3, 1), i.e., an initial halflife of 1. The table also shows different quiz times (tnow) to help guage the algorithm's behavior:

Noise Observed result Quiz time New halflife
0 True 0.25 1.061
0 True 1.0 1.241
0 True 3.0 1.718
0 False 0.25 0.762
0 False 1.0 0.816
0 False 3.0 0.902
0.1 True 0.25 1.052
0.1 True 1.0 1.188
0.1 True 3.0 1.362
0.1 False 0.25 0.852
0.1 False 1.0 0.849
0.1 False 3.0 0.914
0.25 True 0.25 1.037
0.25 True 1.0 1.113
0.25 True 3.0 1.143
0.25 False 0.25 0.931
0.25 False 1.0 0.901
0.25 False 3.0 0.937
0.5 True 0.25 1.000
0.5 True 1.0 1.000
0.5 True 3.0 1.000
0.5 False 0.25 1.000
0.5 False 1.0 1.000
0.5 False 3.0 1.000

The noise dial serves to dampen the impact of the review. When the noise level is 0, you get the normal Ebisu behavior. At noise level is 0.5 (the highest noise level, meaning z is a pure coin flip, without any dependence on x), the quiz is compeltely uninformative, and gives you a totally unchanged updated model. In between, you get an updated model whose halflife is between these two.

I am considering updating updateRecall's API to take a single float between 0 and 1 for noisy-binary results, and parsing it as follows:

if noisyResult > 0.5:
  result = True
  q_1 = noisyResult
  q_0 = 1 - noisyResult
else:
  result = False
  q_1 = 1 - noisyResult
  q_0 = noisyResult

Might I propose the following matches to Anki's levels:

  • "good", meaning, "SRS, continue doing your thing": noisyResult = 1
  • "hard", meaning, "SRS, you were too aggressive, I'm not comfortable how close I was to forgetting": noisyResult = 0.9 (see table above: pass, noise=0.1)
  • "fail", meaning, "Please have mercy": noisyResult = 0

I will add another function in the API to allow you to do the equivalent of Anki's "easy" and its inverse, "epic fail": it will take a model and a number to scale the halflife with, and return a new model with the same spread as the original but calibrated to a new halflife. This new function totally side-steps the Bayesian update process, and is intended to be used sparingly for flashcards you really want to delay reviewing (e.g., scale halflife by 2x, say) or that you want to review more frequently (scale halflife by 0.5x).

Comments welcome. Related #19.

@fasiha fasiha reopened this Jun 12, 2020
@garfieldnate
Copy link
Author

Thanks for doing all this! This is such great work using a pretty unique skill. Plenty of us can hack, but I don't know many that can apply this kind of mathematical skill while doing it.

In the code block you provided, I didn't understand the meaning of setting result to True or False. Shouldn't the value be determined by input to the API?

The API you present makes sense. Result plus a noise parameter. I do have some comments on the suggested usage, though.

Specifically for the Anki case, I think a better strategy would be 0/0.1/0.9/1 for the user judgement inputs. Since Anki presents the "Again" and "Easy" buttons like they are regular judgements and not special cases, I don't think it would make sense to step outside of the model for those inputs. Perhaps if the user were presented with three choices ("got it", "almost got it", "don't got it"), then the 0/0.9/1 input would make sense. Then as a separate feature we could allow the user to manually change the review times when they think to themselves, "I'm sick of this one please, stop showing it!" or "I don't remember ever seeing this before, better show me that again soon" using UI that makes it clear that this is an exceptional case.

Also, ideally it may be worth learning the noise parameter (outside of Ebisu, not within it) for these judgements for each user, as they are quite subjective and can be interpreted differently for different users.

@fasiha
Copy link
Owner

fasiha commented Jun 12, 2020

In the code block you provided, I didn't understand the meaning of setting result to True or False. Shouldn't the value be determined by input to the API?

Ah I should clarify, the code snippet in my comment above, starting with if noisyResult > 0.5, would be inside updateRecall: you would only need to provide a float between 0 and 1. So you are free to construct any mapping between user responses and Ebisu for your app. I have only very vague memories of Anki so your instinct about matching Anki's approach to Ebisu would doubtless be better than mine.

One thought I did have is, there's no way to use the noisy-binary update to dramatically change the halflife—it might have been nice if you could give 2.0 to the noisy-binary update to indicate "easy", but that won't work. Noisy-binary can only dial between the binary 0 and 1 cases.

Side node though—the Ebisu version 2 binomial quiz model, with integer successes and total, does provide this: giving successes of 0 or 2 when total=2 serves to exponentially decrease or increase halflife, since it models multiple independent trials of memory, and can push the halflife beyond what you'd get with total=1 (the binary quiz case). I'm hesitant to officially recommend using this to achieve this effect though, since it is a modeling error… But at the same time, it's more principled than just rescaling the halflife (more below).

I'm still sorting out my feelings about offering three ways to change models:

  1. noisy-binary quiz, where you provide a single float to updateRecall,
  2. binomial quiz session, where you provide integer successes and totals to updateRecall (though, having high totals-successes with tnow much lower than halflife can cause numerical instability 😣 I'm working on it),
  3. quiz rescaling (more below)

Ebisu used to offer one method, now it will offer three. I'm somewhat concerned that this makes the API harder to learn to use effectively, and constrains the future evolution of the code. But I think it's reasonable to offer this menu to quiz app authors.

Here's the tentative API and docstring for rescaleHalflife (full source):

def rescaleHalflife(prior, scale=1.):
  """Given any model, return a new model with the original's halflife scaled.

  Use this function to adjust the halflife of a model.

  Perhaps you want to see this flashcard far less, because you *really* know it.
  `newModel = rescaleHalflife(model, 5)` to shift its memory model out to five
  times the old halflife.

  Or if there's a flashcard that suddenly you want to review more frequently,
  perhaps because you've recently learned a confuser flashcard that interferes
  with your memory of the first, `newModel = rescaleHalflife(model, 0.1)` will
  reduce its halflife by a factor of one-tenth.

  Useful tip: the returned model will have matching α = β, where `alpha, beta,
  newHalflife = newModel`. This happens because we first find the old model's
  halflife, then we time-shift its probability density to that halflife. That's
  the distribution this function returns, except at the *scaled* halflife.
  """

Comments and questions and thrown tomatoes welcome.

@garfieldnate
Copy link
Author

I don't have all of the math expertise you have, but it seems like the three methods of updating are fairly generalizable to new models, if you choose to evolve it. I can understand the hesitation, though, given you have several different language implementations to maintain.

@garfieldnate
Copy link
Author

Hi Ahmed, I don't mean to bother you, but I want to express my continued interest in this topic :) I'm working on a quiz app and would love to be able to specify floats for the recall update.

@fasiha
Copy link
Owner

fasiha commented Mar 29, 2021

Thanks for pinging @garfieldnate! I will aim to package up the changes we talked about in this thread this week.

We also have #41 with a better way to initialize rebalance and #31 to always rebalance that I'd like to push out but those are behind-the-scenes changes I can work on whenever. I'd like to avoid delaying releasing fuzzy reviews and the rescaling API!

(Part of the reason I delayed releasing these was because of #43, which raised a crucial modeling issue that made me go back to the drawing board for much of Ebisu 😅. That issue, though brand new, was raised on Reddit in late-November 2020, so apologies, I still delayed almost six months on this issue!)

@garfieldnate
Copy link
Author

No need to apologize! It's volunteer work 😄 I really appreciate what you have been sharing here and also how responsive you are.

@fasiha
Copy link
Owner

fasiha commented Apr 11, 2021

@garfieldnate I haven't forgotten about this, please expect this to land within a couple of days, and please feel free to ping if it doesn't and you get tired of waiting. It's the usual thing—prototyping something is often the easy part, productionizing it with unit tests and this insane standard for documentation I'm holding myself to for this repo and etc. means things take 10x to 100x longer.

@fasiha fasiha closed this as completed in df1a526 Apr 16, 2021
@fasiha
Copy link
Owner

fasiha commented Apr 16, 2021

@garfieldnate thanks for your patience! Pushed to PyPI 😄!

@ernesto-butto
Copy link

Hello @fasiha! I was wondering if you were planning to publish a changelog of version 2.1.0 ? Thank you!

@fasiha
Copy link
Owner

fasiha commented Apr 16, 2021

@poolebu yes! It’s at https://github.com/fasiha/ebisu/blob/gh-pages/CHANGELOG.md

@fasiha
Copy link
Owner

fasiha commented Apr 16, 2021

@poolebu in short, and maybe I should highlight this more in the changelog, no breaking changes, hence the 2.0 -> 2.1, just more functionality. But now that I think about it, the underlying behavior of updateRecall changed so calling it now at 2.1 with the same arguments will result in different numbers than calling it before at 2.0. Does that mean it should have been a major version update? 🤔

@fasiha fasiha unpinned this issue Apr 16, 2021
@ernesto-butto
Copy link

Hello @fasiha! Thank you for all the changes and documentation. I will be trying rescaleHalflife in the upcoming months. Probably my app will have a beta user group and we will get users feedback and stats.

I do not think the new updateRecall behavior should result in a new version update if

the differences are statistically very minor

Thank you again, congrats for the new updates! and wish you a great day

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants