-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-boolean quiz results #23
Comments
Thanks for commenting! I talked about this in another issue: #19 (comment) In a nutshell, yes, I think Ebisu can handle the binomial quiz case, like Duolingo, instead of a Bernoulli quiz case like it currently does. I'm just not sure when I can promise to make time to look into this, but it's definitely on my radar. |
Closed by version 2.0.0: https://github.com/fasiha/ebisu/blob/gh-pages/CHANGELOG.md |
Wow. That was fast! 😄 So the binomial case. This models the quiz as a bunch of boolean results, right? Is there any model mismatch between that and the Anki case? Anki doesn't give the result of a series of boolean quizes. It gives the user's confidence score about how well they think they know the answer. I'm working on a quiz app that uses this mechanism. |
Your understanding is accurate: there’s not an exact match between binomial quizzing and Anki’s confidence ratings (in contrast, the binomial quiz corresponds well to Duolingo’s quizzes 😝). Nonetheless I am tentatively confident that you can use the new API to fake Anki-style confidence, we just don’t yet have guidelines on how to best do so (whereas we’ve built up a lot of confidence in Ebisu’s handling of binary quizzes). For example you could do this:
But do note that there’s risks to using large the number of successes:
One thing I’ll add that so far I’ve built quiz apps with automatically-graded binary quizzes and really liked that. I personally will continue designing quiz apps like that, and only allowing the use of the true binomial case (with Does all this make sense? I’m happy to elaborate. I’ll update this with some numbers in roughly twelve hours. |
A hyperfine quibble here: that is only somewhat the case. A binary (Bernoulli) experiment is a coin flip. A binomial quiz is N simultaneous con flips (or actually, independent coin flips, don’t have to be simultaneous). As long as you keep this on mind when calling them “a bunch of booleans” you are good to go. I mention this only because, if you actually have a sequence of boolean quizzes (where you show the user the correct answer after each quiz) then that’s not a binomial quiz, that’s an entirely normal sequence of binary quizzes 😆. Let me know if I should explain more. |
I personally have enjoyed the slight interactivity that comes with grading yourself on the strength of the memory (like Anki, or my favorite, CleverDeck), but admittedly I don't know if it's actually beneficial to my learning. As a user it feels a little bit motivating to be able to declare to the app, "I almost got that perfectly!", and the added interactivity may contribute to using the app more often, which would mean more learning in the long run. But I don't know if this is actually useful information for scheduling reviews, especially since the currently-available apps leave it up to the user to decide how to grade themselves. Doing this with a proper bayesian statistical model would require the app to learn this on a per-user basis.
So this is like if the user accidentally clicks the wrong button or something, right? Is this feature really the proper use for that? Seems like the app author should provide some kind of rewind/undo functionality for this that in the end only recalculates using
The second one is actually the Duolingo case, then; although the data they provide has floating point numbers, in actuality their app presents several quizzes in a row, and what they report in the dataset is the final score over these individual quizzes. Now that I write that (out loud? XD), that actually seems like an important problem with their dataset. But if that's the Duolingo case, what is the actual case that's being modeled by the new feature? Or is the new feature even designed for the modeling of a specific case? The user must be quizzed several times with no dependencies between the answers on each quiz. I know it's common to assume independence for modeling purposes, but this really seems unlikely in a quiz setting like this; apps give feedback to correct a user between quizzes, or the user consistently gets a fact right or wrong. |
Excellent, this is good to know—prevents author tunnel vision 😇! For me, the extra cognitive burden of interrupting my reviews to make the meta-decision is draining, so thank you for the counterexample.
Intuitively, if something was easy, then its updated halflife should be scaled more aggressively than if it was very difficult to remember right? But I wonder how accurate that intuition is, and if the reality is more complicated than that. Occasionally I have the experience where I see a review and think to myself, "Wuh, when did the app teach me this?" I know the answer but can't recall learning it. Other times I contrast the experience of (1) having to review something for which I didn't create a strong enough mnemonic for, so I'm floundering for the correct answer, versus (2) a similar situation—relatively young flashcard with a hastily-made mnemonic—but that I remember the answer easily. Does this mean case 1 was harder than case 2? Or just that brain performance, like physical performance, is related to various factors like time of day, food, sleep, stress, etc.? Thinking about these things tempers my rush to convert all my apps from binary quizzes to faking ease with binomial quizzes, except when I feel strongly about it, which brings us to…
No, more like, "I'm annoyed at how often I've been asked this, seriously, increase its halflife aggressively" (set
I forget if Duolingo app gives you feedback for each sentence? I hate it so much I don't want to reinstall it to find out, forgive me. When I first wrote this, I thought it maybe didn't give detailed feedback on what you missed until the end, but I am likely misremembering.
The quiz style that would closest match the binomial quizzes Ebisu now handles would be where you're asked to recall the same flashcard multiple times in a single review session in a context where you don't particularly focus on that flashcard itself, and are instead occupied with the broader context of the review task at hand. For example. If Duolingo asked you to translate a few sentences into French, and then at the end of the review session, it reviewed what you'd missed, showing you that you'd misconjugated ecrir once but got it right once, e.g., then that'd be a great match for binomial quizzes. As you say, if you give feedback on a per-flashcard level after each quiz, then you probably have a series of binary quizzes. Such an app would also be very boring since you're over-reviewing things too frequently (assuming you got them right, Ebisu would barely change the model after the first quiz since, because so little time elapsed since the previous quiz). So supporting that was one goal for allowing binomial quizzes. The other was to see if it turned out to be a good way to hack user-reported ease. |
Hi, I'm also interested in this topic. If it all works out, I'll send my Dart port of Ebisu your way in due time :) [Update 2020-09-05: see #36.]
Duolingo gives you immediate feedback on your answers; it doesn't wait until the end of the session. So @garfieldnate is right: it's just like a regular repeat, albeit with a very short interval. Is that also a good way to treat it in Ebisu though? The second repeat happens only minutes after the first, so the recall probability will be close to 1. If the right answer is given the second time, the algorithm won't infer much from this; if the wrong answer is given, half-life is severely penalized. OK, I guess this is what we want actually :) |
Wow, most kind thank you @ttencate!
I agree with your analysis. As the quiz app author, you'll have to decide how to structure your quizzes, if you want to have closely-spaced repeats, and how to handle failures soon after successes. You should feel free to cheat Ebisu for the greater good. I might add here that I'm working on a solution to the numerical instability problem for low |
Cool, thanks for confirming. Back on the topic of non-binary results: how hard would it be to implement this? Would the model need to be changed drastically, or could we come up with a way to interpret a fractional result as "partially forgotten"? Faking it with the |
Hmm, good questions. A couple of things. The binary-to-binomial extension was straightforward mathematically (even if a chore to get the derivation and implementation right!), since a binary quiz is a binomial quiz, with integer But I will certainly think about how to do updates with a "soft-binary" result, i.e., a percent. Likely we can keep the same form of the model. Maybe we can make quizzes be noisy-binary… will think! Edit—I think this is very relevant https://stats.stackexchange.com/questions/419197/bayesian-inference-for-beta-distribution-after-an-uncertain-outcome though there's no guarantee we can include it into our GB1 prior (Beta prior after exponential forgetting) cleanly. |
I found a tidy way to allow quiz results to be float between 0 and 1 (inclusive). Code at branch abecdbb#diff-35ff1833326ddbe951882636c4cbc678R131 It uses the noisy binary model linked earlier and I describe it in detail below to invite comments on the API to design around it, because the mathematical model is more flexible than I want to expose programmatically. Mathematically, a noisy-binary quiz consists of a normal binary quiz (drawn from the prior probability distribution on recall probability; in the README I call this
So a noisy-binary quiz result consists of:
In Ebisu's terms, I'm thinking of the original quiz result (original as in, before scrambling) as the result you would have gotten if your recall was purely a function of the strength of the memory in your mind (which we have a probability distribution over). Meanwhile, the scrambled noisy-binary quiz result is whether you actually passed the quiz or not, and is related to any number of other factors beyond the strength of your memory (sleep, hunger, motivation, focus, etc.).
As mentioned above, the math gives us two knobs to independently turn,
The noise dial serves to dampen the impact of the review. When the noise level is 0, you get the normal Ebisu behavior. At noise level is 0.5 (the highest noise level, meaning I am considering updating if noisyResult > 0.5:
result = True
q_1 = noisyResult
q_0 = 1 - noisyResult
else:
result = False
q_1 = 1 - noisyResult
q_0 = noisyResult Might I propose the following matches to Anki's levels:
I will add another function in the API to allow you to do the equivalent of Anki's "easy" and its inverse, "epic fail": it will take a model and a number to scale the halflife with, and return a new model with the same spread as the original but calibrated to a new halflife. This new function totally side-steps the Bayesian update process, and is intended to be used sparingly for flashcards you really want to delay reviewing (e.g., scale halflife by 2x, say) or that you want to review more frequently (scale halflife by 0.5x). Comments welcome. Related #19. |
Thanks for doing all this! This is such great work using a pretty unique skill. Plenty of us can hack, but I don't know many that can apply this kind of mathematical skill while doing it. In the code block you provided, I didn't understand the meaning of setting The API you present makes sense. Result plus a noise parameter. I do have some comments on the suggested usage, though. Specifically for the Anki case, I think a better strategy would be 0/0.1/0.9/1 for the user judgement inputs. Since Anki presents the "Again" and "Easy" buttons like they are regular judgements and not special cases, I don't think it would make sense to step outside of the model for those inputs. Perhaps if the user were presented with three choices ("got it", "almost got it", "don't got it"), then the 0/0.9/1 input would make sense. Then as a separate feature we could allow the user to manually change the review times when they think to themselves, "I'm sick of this one please, stop showing it!" or "I don't remember ever seeing this before, better show me that again soon" using UI that makes it clear that this is an exceptional case. Also, ideally it may be worth learning the noise parameter (outside of Ebisu, not within it) for these judgements for each user, as they are quite subjective and can be interpreted differently for different users. |
Ah I should clarify, the code snippet in my comment above, starting with One thought I did have is, there's no way to use the noisy-binary update to dramatically change the halflife—it might have been nice if you could give 2.0 to the noisy-binary update to indicate "easy", but that won't work. Noisy-binary can only dial between the binary 0 and 1 cases. Side node though—the Ebisu version 2 binomial quiz model, with integer I'm still sorting out my feelings about offering three ways to change models:
Ebisu used to offer one method, now it will offer three. I'm somewhat concerned that this makes the API harder to learn to use effectively, and constrains the future evolution of the code. But I think it's reasonable to offer this menu to quiz app authors. Here's the tentative API and docstring for def rescaleHalflife(prior, scale=1.):
"""Given any model, return a new model with the original's halflife scaled.
Use this function to adjust the halflife of a model.
Perhaps you want to see this flashcard far less, because you *really* know it.
`newModel = rescaleHalflife(model, 5)` to shift its memory model out to five
times the old halflife.
Or if there's a flashcard that suddenly you want to review more frequently,
perhaps because you've recently learned a confuser flashcard that interferes
with your memory of the first, `newModel = rescaleHalflife(model, 0.1)` will
reduce its halflife by a factor of one-tenth.
Useful tip: the returned model will have matching α = β, where `alpha, beta,
newHalflife = newModel`. This happens because we first find the old model's
halflife, then we time-shift its probability density to that halflife. That's
the distribution this function returns, except at the *scaled* halflife.
""" Comments and questions and thrown tomatoes welcome. |
I don't have all of the math expertise you have, but it seems like the three methods of updating are fairly generalizable to new models, if you choose to evolve it. I can understand the hesitation, though, given you have several different language implementations to maintain. |
Hi Ahmed, I don't mean to bother you, but I want to express my continued interest in this topic :) I'm working on a quiz app and would love to be able to specify floats for the recall update. |
Thanks for pinging @garfieldnate! I will aim to package up the changes we talked about in this thread this week. We also have #41 with a better way to initialize rebalance and #31 to always rebalance that I'd like to push out but those are behind-the-scenes changes I can work on whenever. I'd like to avoid delaying releasing fuzzy reviews and the rescaling API! (Part of the reason I delayed releasing these was because of #43, which raised a crucial modeling issue that made me go back to the drawing board for much of Ebisu 😅. That issue, though brand new, was raised on Reddit in late-November 2020, so apologies, I still delayed almost six months on this issue!) |
No need to apologize! It's volunteer work 😄 I really appreciate what you have been sharing here and also how responsive you are. |
@garfieldnate I haven't forgotten about this, please expect this to land within a couple of days, and please feel free to ping if it doesn't and you get tired of waiting. It's the usual thing—prototyping something is often the easy part, productionizing it with unit tests and this insane standard for documentation I'm holding myself to for this repo and etc. means things take 10x to 100x longer. |
@garfieldnate thanks for your patience! Pushed to PyPI 😄! |
Hello @fasiha! I was wondering if you were planning to publish a changelog of version 2.1.0 ? Thank you! |
@poolebu yes! It’s at https://github.com/fasiha/ebisu/blob/gh-pages/CHANGELOG.md |
@poolebu in short, and maybe I should highlight this more in the changelog, no breaking changes, hence the 2.0 -> 2.1, just more functionality. But now that I think about it, the underlying behavior of |
Hello @fasiha! Thank you for all the changes and documentation. I will be trying I do not think the new updateRecall behavior should result in a new version update if
Thank you again, congrats for the new updates! and wish you a great day |
Many quiz systems do not assign a simple pass-fail to study events. Some systems, like Anki, simply ask the user how well they think they know the answer. Others, like Duolingo, assign a score based on performance in a study session with several exercises. It would be great if Ebisu could be extended to handle this case; so
updateRecall(prior: tuple, result: bool, tnow: float)
would be changed to `updateRecall(prior: tuple, result: float, tnow: float).This would also enable comarison with the other systems compared in the half-life regression paper from Duolingo, meaning this ticket may be a pre-requisite for #22.
The text was updated successfully, but these errors were encountered: