-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Would you be interested in benchmarking Ebisu against FSRS? #23
Comments
Ooops, sorry, I referenced the wrong issue. Don't pay attention to the "mentioned this issue" thing above. |
Very nice, thanks so much for sharing! I will figure out how to use Hugging Face's benchmarking suite. I see the notes about the challenges converting Anki ratings (1-4) into something Ebisu handles 😅 I have a similar converter. I know that for example Language Reactor (who makes Language Learning With Netflix) and Satori Reader (and no doubt others) have a lot of data from passive reading, i.e., you're reading a story or paragraph in the language you are learning, you are able to click on a word to read its definition if you want to. I'd looooove to get some of that data as well, so we can benchmark algorithms that handle cases like "the user didn't click this word when they read it but they failed the flashcard review the next day", or more generally how to adjust review schedules when the user clicks or doesn't click on a word to review it while reading. The Anki model of flashcards with pass/fail is only somewhat useful in this context 😁 |
I'm not great at math, but how about just assigning different values of q0 for each button? In other words, each button would have it's own false positive rate. |
I write a script to output stability from ebisu: import ebisu
import numpy as np
from datetime import datetime, timedelta
def stability2halflife(s):
return s * np.log(0.5) / np.log(0.9) * 24
def halflife2stability(h):
return h / 24 / np.log(0.5) * np.log(0.9)
r_history = [1, 1, 1, 1, 1, 0]
t_history = [1, 2, 4, 8, 16, 32]
date0 = datetime(2017, 4, 19, 22, 0, 0)
defaultModel = (3.0, 3.0, stability2halflife(1)) # alpha, beta, and half-life in hours
item = dict(factID=1, model=defaultModel, lastTest=date0)
oneHour = timedelta(hours=1)
print("Default model:", defaultModel)
total = 0
result = 0
now = date0
for r, t in zip(r_history, t_history):
now += timedelta(days=t)
total += 1
result += 1 if r == 1 else 0
p_recall = ebisu.predictRecall(
item["model"], (now - item["lastTest"]) / oneHour, exact=True
)
print(
"Fact #{} probability of recall: {:0.1f}%".format(
item["factID"], p_recall * 100
)
)
newModel = ebisu.updateRecall(
item["model"], result, total, (now - item["lastTest"]) / oneHour
)
print("New model for fact #1:", newModel)
item["model"] = newModel
item["lastTest"] = now
meanHalflife = ebisu.modelToPercentileDecay(item["model"])
print(
"Fact #{} has stability of ≈{:0.1f} days".format(
item["factID"], halflife2stability(meanHalflife)
)
) The output:
I find that the last stability still increase even when the last recall fails. It's intended? Or my implementation is incorrect? |
By the way, are alpha and beta trainable? |
Thanks @L-M-Sherlock for putting this together!!! A couple of fixes for your script, the import ebisu
import numpy as np
from datetime import datetime, timedelta
def stability2halflife(s):
return s * np.log(0.5) / np.log(0.9) * 24
def halflife2stability(h):
return h / 24 / np.log(0.5) * np.log(0.9)
r_history = [1, 1, 1, 1, 1, 0]
t_history = [1, 2, 4, 8, 16, 32]
date0 = datetime(2017, 4, 19, 22, 0, 0)
defaultModel = (3.0, 3.0, stability2halflife(1)) # alpha, beta, and half-life in hours
item = dict(factID=1, model=defaultModel, lastTest=date0)
oneHour = timedelta(hours=1)
print("Default model:", defaultModel)
now = date0
for r, t in zip(r_history, t_history):
now += timedelta(days=t)
p_recall = ebisu.predictRecall(item["model"], (now - item["lastTest"]) / oneHour, exact=True)
print("Fact #{} probability of recall: {:0.1f}%".format(item["factID"], p_recall * 100))
newModel = ebisu.updateRecall(item["model"], r, 1, (now - item["lastTest"]) / oneHour)
print("New model for fact #1:", newModel)
item["model"] = newModel
item["lastTest"] = now
meanHalflife = ebisu.modelToPercentileDecay(item["model"])
print("Fact #{} has stability of ≈{:0.1f} days".format(item["factID"],
halflife2stability(meanHalflife))) Output:
The above is with Ebisu v2.1. For Ebisu v3-rc1, which you can install the latest version development version via python -m pip install "git+https://github.com/fasiha/ebisu@v3-release-candidate" here's the code and output: import ebisu # get v3 via `python -m pip install "git+https://github.com/fasiha/ebisu@v3-release-candidate"`
import numpy as np
from datetime import datetime, timedelta
def stability2halflife(s):
return s * np.log(0.5) / np.log(0.9) * 24
def halflife2stability(h):
return h / 24 / np.log(0.5) * np.log(0.9)
r_history = [1, 1, 1, 1, 1, 0]
t_history = [1, 2, 4, 8, 16, 32]
date0 = datetime(2017, 4, 19, 22, 0, 0)
defaultModel = ebisu.initModel(stability2halflife(1), 10e3, initialAlphaBeta=3.0, firstWeight=0.5)
item = dict(factID=1, model=defaultModel, lastTest=date0)
oneHour = timedelta(hours=1)
print("Default model:", defaultModel)
now = date0
for r, t in zip(r_history, t_history):
now += timedelta(days=t)
p_recall = ebisu.predictRecall(item["model"], (now - item["lastTest"]) / oneHour)
print("Fact #{} probability of recall: {:0.1f}%".format(item["factID"], p_recall * 100))
newModel = ebisu.updateRecall(item["model"], r, 1, (now - item["lastTest"]) / oneHour)
# print("New model for fact #1:", newModel)
item["model"] = newModel
item["lastTest"] = now
meanHalflife = ebisu.modelToPercentileDecay(item["model"])
print("Fact #{} has stability of ≈{:0.1f} days".format(item["factID"],
halflife2stability(meanHalflife))) Output:
For Ebisu3 I chose the same initial alpha and beta and halflife as your example but a new feature in v3 is that instead of a single Beta random variable, there's an ensemble of them, by default 5 atoms, with halflife logarithmically-spaced from your initial halflife to way much longer halflives. And as is expected for an ensemble, each atom has a weight, so in the above example, the first atom has weight 0.5 and the rest of the atoms have logarithmically-decreasing weights (that all sum to 1). I expect v3 to be much more accurate when it comes to Per your question—initial alpha and beta are totally trainable but in my rough experiments with a small training set of a few tens of flashcards with a few months of data, the Bayesian framework is not very sensitive to choice of initial alpha/beta. For example with the v3 script above,
In v3, for atom in defaultModel:
print(f'halflife={atom.time:.0f} hours, weight={2**atom.log2weight:.02g}') Output:
The default for
Tuning this obviously has a much more impact than the initial alpha/beta. What do you think? I know Ebisu v2 is very pessimistic predictions (I used it for years just to relatively rank flashcards, without considering the absolute value of the predicted recall probability) but I'm hoping v3 will help make those absolute predictions much better. The challenge I think will be to gain confidence in what defaults to pick for the first weight (and potentially alpha/beta). Thanks again so so much for looking and sharing this code! I need to read more about what "stability" means here and what is "good" stability 😅 |
By the way, if you want to test Ebisu v2 and v3 at the same time, here's a little trick: install v3 via python -m pip install "git+https://github.com/fasiha/ebisu@v3-release-candidate" and then import ebisu # this is ebisu v3
import ebisu.ebisu2beta as ebisu2 # this is pretty much exactly the same code as v2 since v3 uses v2 to handle each of its atoms. |
Stability is the interval (in days) when probability of recall is 90%. You can read more about FSRS here: https://github.com/open-spaced-repetition/fsrs4anki/wiki/The-Algorithm |
An error occurred when I use V3: def stability2halflife(s):
return s * np.log(0.5) / np.log(0.9) * 24
def halflife2stability(h):
return h / 24 / np.log(0.5) * np.log(0.9)
history = [[1, 0],
[1, 0],
[1, 0],
[1, 0],
[1, 0],
[1, 0],
[1, 0],
[1, 0],
[1, 0],
[27, 0],
[2, 0],
[1, 0],
[1275, 0],
[2, 0],
[2, 1],
[5, 0],
[3, 1],
[4, 0],
[2, 0],
[2, 0],
[2, 0],
[2, 1],
[5, 0],
[2, 1]]
date0 = datetime(2017, 4, 19, 22, 0, 0)
defaultModel = ebisu.initModel(stability2halflife(1), 10e3, initialAlphaBeta=3.0, firstWeight=0.5)
item = dict(factID=1, model=defaultModel, lastTest=date0)
oneHour = timedelta(hours=1)
print("Default model:", defaultModel)
now = date0
for t, r in history:
now += timedelta(days=t)
p_recall = ebisu.predictRecall(item["model"], (now - item["lastTest"]) / oneHour)
print("Fact #{} probability of recall: {:0.1f}%".format(item["factID"], p_recall * 100))
newModel = ebisu.updateRecall(item["model"], r, 1, (now - item["lastTest"]) / oneHour)
# print("New model for fact #1:", newModel)
item["model"] = newModel
item["lastTest"] = now
meanHalflife = ebisu.modelToPercentileDecay(item["model"])
print("Fact #{} has stability of ≈{:0.1f} days".format(item["factID"],
halflife2stability(meanHalflife)))
|
I don't think it works this way. In Ebisu, you need alpha and beta to convert stabilities/halflives, and there is no closed-form expression for that, Ebisu uses a numerical approximation. |
Again see https://github.com/fasiha/ebisu.js/issues/23\#issuecomment-1816160634 The old fixed bracket is just unreliable so let's do a very quick log-10 to find a bracket before handing off to Scipy
I see you're torturing my poor library @L-M-Sherlock 🤣 thanks for this, turns out the probability of recall for an atom with Then I noticed that this pathological case was messing up the search in SO. Now if you reinstall
@Expertium is right in general but in this initial situation where alpha=beta, the time parameter of the model is indeed the halflife and you can skip the numerical search (the mean of a Thank you both so much for this lively and productive discussion 🥹🙏 |
I benchmark Ebisu v3 in this PR: open-spaced-repetition/srs-benchmark#11 The initial results are not good: Weighted by ln(number of reviews)
Note: I map the four ratings into binary results in Ebisu. And I don't know how to utilize the first learning. If we solve these problems, the performance will improve. Here is the code: https://github.com/open-spaced-repetition/fsrs-benchmark/blob/0da0a9adeecb4d354ce3f8933fc2d5ee300a9bb6/other.py#L234-L264 |
Regarding grade mapping, I have suggested using different values of q0 for each grade. |
Thanks for the benchmark @L-M-Sherlock! Is there a Python (or Rust or JavaScript or whatever) standalone library that I can use to call FSRS without using Anki? After reading 1 and 2 I'd like to try it on my quiz history! |
The python lib: https://github.com/open-spaced-repetition/fsrs-optimizer |
@L-M-Sherlock thanks! So I construct a CSV file of reviews, use this optimizer to analyze them, and feed the resulting weights into the equations in https://github.com/open-spaced-repetition/fsrs4anki/wiki/The-Algorithm? Is there a standalone implementation of the latter that I could use or should I just implement all the equations myself? (If the latter, I'll try to share the script to help other non-Anki folks use FSRS with their flashcards.) I mentioned this in fasiha/ebisu#64 and you'd kindly replied there with an interesting idea and we can continue discussing that there but, here's an example of the issue with log-likelihood. This shows two different Ebisu v3 models for the same flashcard, and shows both the log-likelihood of each quiz as well as a running total of log-likelihood sum. The red one is worse than the blue one in terms of log-likelihoods but we can clearly see that the red one is just more conservative: it thinks the string of successes are unsustainable and assigning them low probability which hurts its log-likelihood but is vindicated when there's a failure. But the blue overconfident model doesn't take that much of a hit due to mis-predicting the failure— In the above example, I actually prefer the more pessimistic red model, but I'm not sure how to measure that. I'm really interested in trying to avoid this problem if we can! Because otherwise my way of measuring models against each other is whether they output "reasonable" predictions and handle good growth for highly error-prone cards and highly-successful cards? |
When we have a huge dataset, it wouldn't be a problem. When the predicted probability is close to the real probability, the cross entropy will arrive the minimum. For example, if the real probability is 0.9, the sampling results could be: 1, 1, 1, 0, 1, 1, 1, 1, 1, 1 (0 - failure, 1 - success) For prediction 0.9, the cross entropy is - (9 * ln(0.9) + 1 * ln(1-0.9)) = 3.2508297339. If the model is over-confident, the prediction may be 0.95, and the cross entropy is - (9 * ln(0.95) + 1 * ln(1-0.95)) = 3.457371923, which is larger than the perfect perdition. If the mode is conservative, the prediction may be 0.85, and the cross entropy is - (9 * ln(0.85) + 1 * ln(1-0.85)) = 3.3597903504, which is also larger than the perfect perdition. @fasiha, what do you think of that? |
@L-M-Sherlock Thanks for that! Hmmm. Your illustrative example definitely works for cases where samples come from a known distribution ( So I think we're in the same situation with quizzes? Continuing from the plot above, for example, the conservative red model is very competitive with the optimistic blue model for most examples but for flashcards with a lot of failures (20% failures, left-most points), the optimistic blue model has better cross-entropy/sum-of-log-likelihood, because it is more willing to forgive and overlook early failures whereas the conservative red model doesn't. The red model tends to produce predictions that I post hoc like more than the blue model, because I know the failures are coming and the red model was right to demand more quizzes? But it's punished for that, which upsets me! I have not at all experimented with weighted cross-entropy (there's actually many ways to do this, e.g., https://datascience.stackexchange.com/a/40889 links to a Facebook paper that just does |
Yeah. I agree that we are thinking this problem in different way. I treat the distribution as Bernoulli(0.9) because I only consider the review history (the rating behavior of user). For example: Above forgetting curve is generated from my Anki collection. I controlled the condition that the first rating should be 4 (easy). Then I used the next reviews to draw the curve. If we don't know anything about the content of these cards, the recall event can be treated as a Bernoulli distribution. |
@L-M-Sherlock am I understanding correctly, does this mean that you take a each of the four Anki ratings (fail, hard, good, easy) as indicating four different but unknown recall probabilities? So for example every time you see an "easy" rating, that implies that the probability of recall at that time was the same value (unknown but fixed) as every other time you saw an "easy" rating? And that's why you can use the straightforward unweighted cross-entropy to measure that? (Ebisu is a bit different than this—Ebisu is a bit more like classic classification where your observations are (in the simple case) just 0/1, pass/fail, with a Bayesian model that sees those as random draws from a Bernoulli distribution whose underlying probability parameter is itself a Beta random variable: in probabilistic terms, quiz result at time (Edit: of course I should mention that in Ebisu, there's a nonlinear transformation to move the Beta random variable at time |
Maybe my words are still unclear. You can read my explanation here: https://github.com/open-spaced-repetition/fsrs4anki/wiki/Spaced-Repetition-Algorithm:-A-Three%E2%80%90Day-Journey-from-Novice-to-Expert#day-3-the-latest-progress TL;DR: For items with the same review history (the ratings and intervals are consistent), I assume they have the same stability (or half-life). So they have the same probability of recall with the same elapsed days. |
I think Sherlock means that as long as delta_t and S are exactly the same, the probability of recall is also exactly the same. If a card with delta_t=x (last review was x days ago) and S=y (stability is y days) has a p% chance of being recalled, than another card with delta_t=x and S=y will also have p% chance of being recalled. |
Yeah. But the S could be an average of a group of cards. For more details, please see: https://github.com/open-spaced-repetition/heterogeneous-memory-research |
@L-M-Sherlock @fasiha I'm just reminding you in case you were busy and forgot about this issue. |
@fasiha me and LMSherlock are reworking RMSE and I just wanted to ask whether you still want to participate in the benchmark. |
@Expertium thanks for reaching out! I'm happy to help in any way I can. I'm still experimenting with different versions of Ebisu—the latest fasiha/ebisu#66 combines Ebisu v2 in a couple of different ways to achieve accuracy that seems as good as if not better than prior versions, while fixing a key problem. Would you be interested in benchmarking that? If so, let me know, I can provide you with a simple Python module that implements it. I'm sorry that I haven't made time to properly digest the explanations given above about how the benchmark works 😔 but again I'm happy to provide code that implements the standard Ebisu API functions |
Well, I'm not the code expert here, that's LMSherlock. |
https://github.com/open-spaced-repetition
FSRS uses the Difficulty, Stability, Retrievability model, just like SuperMemo algorithms. It is developed by @L-M-Sherlock and has been recently integrated into Anki. LMSherlock has benchmarked it against several other algorithms, see these repos:
https://github.com/open-spaced-repetition/fsrs-benchmark
https://github.com/open-spaced-repetition/fsrs-vs-sm17
I think it would be interesting to compare the accuracy of Ebisu to the accuracy of FSRS. In order to do that, all that's necessary is for you to run Ebisu on the same dataset that is used in the benchmark, and output a value of probability of recall for each review. Here's the dataset: https://huggingface.co/datasets/open-spaced-repetition/fsrs-dataset
If you are interested, please contact LMSherlock.
On a side note, I'm just a guy who is interested in spaced repetition. I'm asking just because I like crunching numbers, lol.
The text was updated successfully, but these errors were encountered: