Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would you be interested in benchmarking Ebisu against FSRS? #23

Open
Expertium opened this issue Nov 14, 2023 · 27 comments
Open

Would you be interested in benchmarking Ebisu against FSRS? #23

Expertium opened this issue Nov 14, 2023 · 27 comments

Comments

@Expertium
Copy link

https://github.com/open-spaced-repetition

FSRS uses the Difficulty, Stability, Retrievability model, just like SuperMemo algorithms. It is developed by @L-M-Sherlock and has been recently integrated into Anki. LMSherlock has benchmarked it against several other algorithms, see these repos:
https://github.com/open-spaced-repetition/fsrs-benchmark
https://github.com/open-spaced-repetition/fsrs-vs-sm17

I think it would be interesting to compare the accuracy of Ebisu to the accuracy of FSRS. In order to do that, all that's necessary is for you to run Ebisu on the same dataset that is used in the benchmark, and output a value of probability of recall for each review. Here's the dataset: https://huggingface.co/datasets/open-spaced-repetition/fsrs-dataset

If you are interested, please contact LMSherlock.

On a side note, I'm just a guy who is interested in spaced repetition. I'm asking just because I like crunching numbers, lol.

@Expertium
Copy link
Author

Ooops, sorry, I referenced the wrong issue. Don't pay attention to the "mentioned this issue" thing above.

@fasiha
Copy link
Owner

fasiha commented Nov 15, 2023

Very nice, thanks so much for sharing! I will figure out how to use Hugging Face's benchmarking suite. I see the notes about the challenges converting Anki ratings (1-4) into something Ebisu handles 😅 I have a similar converter.

I know that for example Language Reactor (who makes Language Learning With Netflix) and Satori Reader (and no doubt others) have a lot of data from passive reading, i.e., you're reading a story or paragraph in the language you are learning, you are able to click on a word to read its definition if you want to. I'd looooove to get some of that data as well, so we can benchmark algorithms that handle cases like "the user didn't click this word when they read it but they failed the flashcard review the next day", or more generally how to adjust review schedules when the user clicks or doesn't click on a word to review it while reading. The Anki model of flashcards with pass/fail is only somewhat useful in this context 😁

@Expertium
Copy link
Author

converting Anki ratings (1-4) into something Ebisu handles

I'm not great at math, but how about just assigning different values of q0 for each button? In other words, each button would have it's own false positive rate.

@L-M-Sherlock
Copy link

L-M-Sherlock commented Nov 17, 2023

I write a script to output stability from ebisu:

import ebisu
import numpy as np
from datetime import datetime, timedelta

def stability2halflife(s):
    return s * np.log(0.5) / np.log(0.9) * 24


def halflife2stability(h):
    return h / 24 / np.log(0.5) * np.log(0.9)

r_history = [1, 1, 1, 1, 1, 0]
t_history = [1, 2, 4, 8, 16, 32]

date0 = datetime(2017, 4, 19, 22, 0, 0)
defaultModel = (3.0, 3.0, stability2halflife(1))  # alpha, beta, and half-life in hours
item = dict(factID=1, model=defaultModel, lastTest=date0)
oneHour = timedelta(hours=1)


print("Default model:", defaultModel)

total = 0
result = 0
now = date0

for r, t in zip(r_history, t_history):
    now += timedelta(days=t)
    total += 1
    result += 1 if r == 1 else 0
    p_recall = ebisu.predictRecall(
        item["model"], (now - item["lastTest"]) / oneHour, exact=True
    )
    print(
        "Fact #{} probability of recall: {:0.1f}%".format(
            item["factID"], p_recall * 100
        )
    )

    newModel = ebisu.updateRecall(
        item["model"], result, total, (now - item["lastTest"]) / oneHour
    )
    print("New model for fact #1:", newModel)
    item["model"] = newModel
    item["lastTest"] = now
    meanHalflife = ebisu.modelToPercentileDecay(item["model"])
    print(
        "Fact #{} has stability of ≈{:0.1f} days".format(
            item["factID"], halflife2stability(meanHalflife)
        )
    )

The output:

Default model: (3.0, 3.0, 157.89152349505406)
Fact #1 probability of recall: 89.0%
New model for fact #1: (3.00646253444218, 3.0064625344421816, 164.3683605066929)
Fact #1 has stability of ≈1.0 days
Fact #1 probability of recall: 80.2%
New model for fact #1: (3.0278398639446356, 3.0278398639446302, 190.11941757789788)
Fact #1 has stability of ≈1.2 days
Fact #1 probability of recall: 69.0%
New model for fact #1: (3.0702796304531583, 3.070279630453153, 266.3216348721204)
Fact #1 has stability of ≈1.7 days
Fact #1 probability of recall: 59.7%
New model for fact #1: (3.130193819780603, 3.1301938197806427, 465.2991397449263)
Fact #1 has stability of ≈2.9 days
Fact #1 probability of recall: 55.8%
New model for fact #1: (3.200153605581459, 3.2001536055814306, 950.5701887173561)
Fact #1 has stability of ≈6.0 days
Fact #1 probability of recall: 56.5%
New model for fact #1: (4.285306697312605, 4.285306697312728, 1468.8331445864046)
Fact #1 has stability of ≈9.3 days

I find that the last stability still increase even when the last recall fails. It's intended? Or my implementation is incorrect?

@L-M-Sherlock
Copy link

By the way, are alpha and beta trainable?

@fasiha
Copy link
Owner

fasiha commented Nov 17, 2023

Thanks @L-M-Sherlock for putting this together!!! A couple of fixes for your script, the result and total are per quiz event, so in this example, with binary quizzes, you always want total=1. This should work better, with failures decreasing stability/halflife:

import ebisu
import numpy as np
from datetime import datetime, timedelta


def stability2halflife(s):
  return s * np.log(0.5) / np.log(0.9) * 24


def halflife2stability(h):
  return h / 24 / np.log(0.5) * np.log(0.9)


r_history = [1, 1, 1, 1, 1, 0]
t_history = [1, 2, 4, 8, 16, 32]

date0 = datetime(2017, 4, 19, 22, 0, 0)
defaultModel = (3.0, 3.0, stability2halflife(1))  # alpha, beta, and half-life in hours
item = dict(factID=1, model=defaultModel, lastTest=date0)
oneHour = timedelta(hours=1)

print("Default model:", defaultModel)

now = date0

for r, t in zip(r_history, t_history):
  now += timedelta(days=t)
  p_recall = ebisu.predictRecall(item["model"], (now - item["lastTest"]) / oneHour, exact=True)
  print("Fact #{} probability of recall: {:0.1f}%".format(item["factID"], p_recall * 100))

  newModel = ebisu.updateRecall(item["model"], r, 1, (now - item["lastTest"]) / oneHour)
  print("New model for fact #1:", newModel)
  item["model"] = newModel
  item["lastTest"] = now
  meanHalflife = ebisu.modelToPercentileDecay(item["model"])
  print("Fact #{} has stability of ≈{:0.1f} days".format(item["factID"],
                                                         halflife2stability(meanHalflife)))

Output:

Default model: (3.0, 3.0, 157.89152349505406)
Fact #1 probability of recall: 89.0%
New model for fact #1: (3.00646253444218, 3.0064625344421816, 164.3683605066929)
Fact #1 has stability of ≈1.0 days
Fact #1 probability of recall: 80.2%
New model for fact #1: (3.018273528208685, 3.018273528208677, 177.27467031468353)
Fact #1 has stability of ≈1.1 days
Fact #1 probability of recall: 67.3%
New model for fact #1: (3.0384192428683914, 3.0384192428684, 202.92238743744258)
Fact #1 has stability of ≈1.3 days
Fact #1 probability of recall: 51.7%
New model for fact #1: (3.069552535205376, 3.069552535205368, 253.69532845220152)
Fact #1 has stability of ≈1.6 days
Fact #1 probability of recall: 37.0%
New model for fact #1: (3.112427550543429, 3.1124275505425345, 353.7743133604174)
Fact #1 has stability of ≈2.2 days
Fact #1 probability of recall: 26.1%
New model for fact #1: (3.907923571583006, 3.907923571583026, 306.77808396348297)
Fact #1 has stability of ≈1.9 days

The above is with Ebisu v2.1.

For Ebisu v3-rc1, which you can install the latest version development version via

python -m pip install "git+https://github.com/fasiha/ebisu@v3-release-candidate"

here's the code and output:

import ebisu # get v3 via `python -m pip install "git+https://github.com/fasiha/ebisu@v3-release-candidate"`
import numpy as np
from datetime import datetime, timedelta


def stability2halflife(s):
  return s * np.log(0.5) / np.log(0.9) * 24


def halflife2stability(h):
  return h / 24 / np.log(0.5) * np.log(0.9)


r_history = [1, 1, 1, 1, 1, 0]
t_history = [1, 2, 4, 8, 16, 32]

date0 = datetime(2017, 4, 19, 22, 0, 0)
defaultModel = ebisu.initModel(stability2halflife(1), 10e3, initialAlphaBeta=3.0, firstWeight=0.5)
item = dict(factID=1, model=defaultModel, lastTest=date0)
oneHour = timedelta(hours=1)

print("Default model:", defaultModel)

now = date0

for r, t in zip(r_history, t_history):
  now += timedelta(days=t)
  p_recall = ebisu.predictRecall(item["model"], (now - item["lastTest"]) / oneHour)
  print("Fact #{} probability of recall: {:0.1f}%".format(item["factID"], p_recall * 100))

  newModel = ebisu.updateRecall(item["model"], r, 1, (now - item["lastTest"]) / oneHour)
  # print("New model for fact #1:", newModel)
  item["model"] = newModel
  item["lastTest"] = now
  meanHalflife = ebisu.modelToPercentileDecay(item["model"])
  print("Fact #{} has stability of ≈{:0.1f} days".format(item["factID"],
                                                         halflife2stability(meanHalflife)))

Output:

Default model: [Atom(log2weight=-0.9999999943197082, alpha=3.0, beta=3.0, time=157.89152349505412), Atom(log2weight=-1.9467772475526095, alpha=3.0, beta=3.0, time=445.41938243913336), Atom(log2weight=-2.893554500785511, alpha=3.0, beta=3.0, time=1256.5489385418066), Atom(log2weight=-3.8403317540184125, alpha=3.0, beta=3.0, time=3544.783404584555), Atom(log2weight=-4.787109007251314, alpha=3.0, beta=3.0, time=10000.0)]
Fact #1 probability of recall: 93.2%
Fact #1 has stability of ≈2.3 days
Fact #1 probability of recall: 88.0%
Fact #1 has stability of ≈2.6 days
Fact #1 probability of recall: 80.6%
Fact #1 has stability of ≈3.3 days
Fact #1 probability of recall: 72.5%
Fact #1 has stability of ≈4.8 days
Fact #1 probability of recall: 65.8%
Fact #1 has stability of ≈7.9 days
Fact #1 probability of recall: 61.5%
Fact #1 has stability of ≈4.4 days

For Ebisu3 I chose the same initial alpha and beta and halflife as your example but a new feature in v3 is that instead of a single Beta random variable, there's an ensemble of them, by default 5 atoms, with halflife logarithmically-spaced from your initial halflife to way much longer halflives. And as is expected for an ensemble, each atom has a weight, so in the above example, the first atom has weight 0.5 and the rest of the atoms have logarithmically-decreasing weights (that all sum to 1).

I expect v3 to be much more accurate when it comes to predictHalflife, the ensemble fixes a core modeling problem in older versions that were grossly pessimistic.

Per your question—initial alpha and beta are totally trainable but in my rough experiments with a small training set of a few tens of flashcards with a few months of data, the Bayesian framework is not very sensitive to choice of initial alpha/beta. For example with the v3 script above,

  • alpha=beta=2 → final stability ≈4.9 days
  • alpha=beta=3 → ≈4.4 days
  • alpha=beta=5 → ≈4.0 days

In v3, firstWeight is also important. 0.5 means that the longer-duration atoms have considerable weight and therefore can respond quite quickly to aggressive quiz schedules:

for atom in defaultModel:
  print(f'halflife={atom.time:.0f} hours, weight={2**atom.log2weight:.02g}')

Output:

halflife=158 hours, weight=0.5
halflife=445 hours, weight=0.26
halflife=1257 hours, weight=0.13
halflife=3545 hours, weight=0.07
halflife=10000 hours, weight=0.036

The default for firstWeight is actually set to 0.9, which makes it quite conservative—that needs more quizzes to coax the overall ensemble halflife up than 0.5. So for example,

  • firstWeight=0.5 → final stability ≈4.4 days
  • firstWeight=0.7 → ≈3.2 days
  • firstWeight=0.9 → ≈2.3 days

Tuning this obviously has a much more impact than the initial alpha/beta.

What do you think? I know Ebisu v2 is very pessimistic predictions (I used it for years just to relatively rank flashcards, without considering the absolute value of the predicted recall probability) but I'm hoping v3 will help make those absolute predictions much better. The challenge I think will be to gain confidence in what defaults to pick for the first weight (and potentially alpha/beta).

Thanks again so so much for looking and sharing this code! I need to read more about what "stability" means here and what is "good" stability 😅

@fasiha
Copy link
Owner

fasiha commented Nov 17, 2023

By the way, if you want to test Ebisu v2 and v3 at the same time, here's a little trick: install v3 via

python -m pip install "git+https://github.com/fasiha/ebisu@v3-release-candidate"

and then

import ebisu # this is ebisu v3
import ebisu.ebisu2beta as ebisu2 # this is pretty much exactly the same code as v2

since v3 uses v2 to handle each of its atoms.

@L-M-Sherlock
Copy link

Thanks again so so much for looking and sharing this code! I need to read more about what "stability" means here and what is "good" stability 😅

Stability is the interval (in days) when probability of recall is 90%. You can read more about FSRS here: https://github.com/open-spaced-repetition/fsrs4anki/wiki/The-Algorithm

@L-M-Sherlock
Copy link

L-M-Sherlock commented Nov 17, 2023

An error occurred when I use V3:

def stability2halflife(s):
  return s * np.log(0.5) / np.log(0.9) * 24


def halflife2stability(h):
  return h / 24 / np.log(0.5) * np.log(0.9)


history = [[1, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [27, 0],
        [2, 0],
        [1, 0],
        [1275, 0],
        [2, 0],
        [2, 1],
        [5, 0],
        [3, 1],
        [4, 0],
        [2, 0],
        [2, 0],
        [2, 0],
        [2, 1],
        [5, 0],
        [2, 1]]

date0 = datetime(2017, 4, 19, 22, 0, 0)
defaultModel = ebisu.initModel(stability2halflife(1), 10e3, initialAlphaBeta=3.0, firstWeight=0.5)
item = dict(factID=1, model=defaultModel, lastTest=date0)
oneHour = timedelta(hours=1)

print("Default model:", defaultModel)

now = date0

for t, r in history:
  now += timedelta(days=t)
  p_recall = ebisu.predictRecall(item["model"], (now - item["lastTest"]) / oneHour)
  print("Fact #{} probability of recall: {:0.1f}%".format(item["factID"], p_recall * 100))

  newModel = ebisu.updateRecall(item["model"], r, 1, (now - item["lastTest"]) / oneHour)
  # print("New model for fact #1:", newModel)
  item["model"] = newModel
  item["lastTest"] = now
  meanHalflife = ebisu.modelToPercentileDecay(item["model"])
  print("Fact #{} has stability of ≈{:0.1f} days".format(item["factID"],
                                                         halflife2stability(meanHalflife)))
{
	"name": "AssertionError",
	"message": "individualLogProbabilities=[0.0, -0.0001561984253351121, -0.002639336235548395, -0.03040955091291429, -0.19115547804705163], model=[Atom(log2weight=-2.731432635627584e-09, alpha=13.589721116802567, beta=13.589721116803345, time=35.29059163499345), Atom(log2weight=-28.976466642518957, alpha=3.0, beta=3.0, time=445.41938243913336), Atom(log2weight=-47.04057705125492, alpha=3.0, beta=3.0, time=1256.5489385418066), Atom(log2weight=-65.60508073267627, alpha=3.0, beta=3.0, time=3544.783404584555), Atom(log2weight=-84.38072270751144, alpha=3.0, beta=3.0, time=10000.0)], elapsedTime=30600.0",
	"stack": "---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/Users/jarrettye/Codes/open-spaced-repetition/ebisu/test.ipynb Cell 2 line 4
     <a href='vscode-notebook-cell:/Users/jarrettye/Codes/open-spaced-repetition/ebisu/test.ipynb#W1sZmlsZQ%3D%3D?line=44'>45</a> p_recall = ebisu.predictRecall(item[\"model\"], (now - item[\"lastTest\"]) / oneHour)
     <a href='vscode-notebook-cell:/Users/jarrettye/Codes/open-spaced-repetition/ebisu/test.ipynb#W1sZmlsZQ%3D%3D?line=45'>46</a> print(\"Fact #{} probability of recall: {:0.1f}%\".format(item[\"factID\"], p_recall * 100))
---> <a href='vscode-notebook-cell:/Users/jarrettye/Codes/open-spaced-repetition/ebisu/test.ipynb#W1sZmlsZQ%3D%3D?line=47'>48</a> newModel = ebisu.updateRecall(item[\"model\"], r, 1, (now - item[\"lastTest\"]) / oneHour)
     <a href='vscode-notebook-cell:/Users/jarrettye/Codes/open-spaced-repetition/ebisu/test.ipynb#W1sZmlsZQ%3D%3D?line=48'>49</a> # print(\"New model for fact #1:\", newModel)
     <a href='vscode-notebook-cell:/Users/jarrettye/Codes/open-spaced-repetition/ebisu/test.ipynb#W1sZmlsZQ%3D%3D?line=49'>50</a> item[\"model\"] = newModel

File /opt/homebrew/Caskroom/miniforge/base/lib/python3.11/site-packages/ebisu/ebisu.py:182, in updateRecall(model, successes, total, elapsedTime, q0, updateThreshold, weightThreshold)
    177 else:
    178   individualLogProbabilities = [
    179       _binomialLogProbability(int(successes), total=total, p=p) for p in pRecalls
    180   ]
--> 182 assert all(x < 0 for x in individualLogProbabilities
    183           ), f'{individualLogProbabilities=}, {model=}, {elapsedTime=}'
    185 newAtoms: BetaEnsemble = []
    186 for (
    187     oldAtom,
    188     updatedAtom,
   (...)
    195     _exceedsThresholdLeft([2**x.log2weight for x in model], weightThreshold),
    196 ):

AssertionError: individualLogProbabilities=[0.0, -0.0001561984253351121, -0.002639336235548395, -0.03040955091291429, -0.19115547804705163], model=[Atom(log2weight=-2.731432635627584e-09, alpha=13.589721116802567, beta=13.589721116803345, time=35.29059163499345), Atom(log2weight=-28.976466642518957, alpha=3.0, beta=3.0, time=445.41938243913336), Atom(log2weight=-47.04057705125492, alpha=3.0, beta=3.0, time=1256.5489385418066), Atom(log2weight=-65.60508073267627, alpha=3.0, beta=3.0, time=3544.783404584555), Atom(log2weight=-84.38072270751144, alpha=3.0, beta=3.0, time=10000.0)], elapsedTime=30600.0"
}

@Expertium
Copy link
Author

def halflife2stability(h):
  return h / 24 / np.log(0.5) * np.log(0.9)

I don't think it works this way. In Ebisu, you need alpha and beta to convert stabilities/halflives, and there is no closed-form expression for that, Ebisu uses a numerical approximation.

fasiha referenced this issue in fasiha/ebisu Nov 18, 2023
Again see https://github.com/fasiha/ebisu.js/issues/23\#issuecomment-1816160634

The old fixed bracket is just unreliable so let's do
a very quick log-10 to find a bracket before handing
off to Scipy
@fasiha
Copy link
Owner

fasiha commented Nov 18, 2023

An error occurred when I use V3:

I see you're torturing my poor library @L-M-Sherlock 🤣 thanks for this, turns out the probability of recall for an atom with (13.6, 13.6, 35.3) parameters after 30600 time units (i.e., almost 900 times the halflife) is 0%, exactly 🥲 expm1 can't help here, so I'm relaxing the assertion (commit).

Then I noticed that this pathological case was messing up the search in modelToPercentileDecay 😂 so I made that more robust (commit plus two more because I'm so error-prone).

SO. Now if you reinstall python -m pip install "git+https://github.com/fasiha/ebisu@v3-release-candidate" and rerun your script it should produce some sane output:

Default model: [Atom(log2weight=-0.9999999943197082, alpha=3.0, beta=3.0, time=157.89152349505412), Atom(log2weight=-1.9467772475526095, alpha=3.0, beta=3.0, time=445.41938243913336), Atom(log2weight=-2.893554500785511, alpha=3.0, beta=3.0, time=1256.5489385418066), Atom(log2weight=-3.8403317540184125, alpha=3.0, beta=3.0, time=3544.783404584555), Atom(log2weight=-4.787109007251314, alpha=3.0, beta=3.0, time=10000.0)]
Fact #1 probability of recall: 93.2%
Fact #1 has stability of ≈0.9 days, hl=145.6
Fact #1 probability of recall: 87.8%
Fact #1 has stability of ≈0.6 days, hl=98.1
Fact #1 probability of recall: 83.4%
Fact #1 has stability of ≈0.5 days, hl=77.7
Fact #1 probability of recall: 79.9%
Fact #1 has stability of ≈0.4 days, hl=65.9
Fact #1 probability of recall: 77.0%
Fact #1 has stability of ≈0.4 days, hl=57.8
Fact #1 probability of recall: 74.4%
Fact #1 has stability of ≈0.3 days, hl=51.7
Fact #1 probability of recall: 72.0%
Fact #1 has stability of ≈0.3 days, hl=46.9
Fact #1 probability of recall: 69.7%
Fact #1 has stability of ≈0.3 days, hl=43.0
Fact #1 probability of recall: 67.5%
Fact #1 has stability of ≈0.3 days, hl=39.8
Fact #1 probability of recall: 0.0%
Fact #1 has stability of ≈0.3 days, hl=39.8
Fact #1 probability of recall: 43.6%
Fact #1 has stability of ≈0.2 days, hl=37.6
Fact #1 probability of recall: 64.0%
Fact #1 has stability of ≈0.2 days, hl=35.3
Fact #1 probability of recall: 0.0%
Fact #1 has stability of ≈0.2 days, hl=35.3
Fact #1 probability of recall: 39.3%
Fact #1 has stability of ≈0.2 days, hl=33.7
Fact #1 probability of recall: 37.7%
Fact #1 has stability of ≈0.2 days, hl=36.2
Fact #1 probability of recall: 11.3%
Fact #1 has stability of ≈0.2 days, hl=35.5
Fact #1 probability of recall: 25.3%
Fact #1 has stability of ≈0.2 days, hl=39.0
Fact #1 probability of recall: 19.2%
Fact #1 has stability of ≈0.2 days, hl=38.0
Fact #1 probability of recall: 41.8%
Fact #1 has stability of ≈0.2 days, hl=36.5
Fact #1 probability of recall: 40.4%
Fact #1 has stability of ≈0.2 days, hl=35.2
Fact #1 probability of recall: 39.1%
Fact #1 has stability of ≈0.2 days, hl=34.0
Fact #1 probability of recall: 37.8%
Fact #1 has stability of ≈0.2 days, hl=35.8
Fact #1 probability of recall: 10.8%
Fact #1 has stability of ≈0.2 days, hl=35.3
Fact #1 probability of recall: 39.2%
Fact #1 has stability of ≈0.2 days, hl=37.1

In Ebisu, you need alpha and beta to convert stabilities/halflives, and there is no closed-form expression for that, Ebisu uses a numerical approximation

@Expertium is right in general but in this initial situation where alpha=beta, the time parameter of the model is indeed the halflife and you can skip the numerical search (the mean of a Beta(a, b) random variable is a / (a + b) which is 0.5 when a=b). I see in the latest script @L-M-Sherlock is taking care to call halflife2stability on the output of ebisu.modelToPercentileDecay 🙇 which I think is right.

Thank you both so much for this lively and productive discussion 🥹🙏

@L-M-Sherlock
Copy link

L-M-Sherlock commented Nov 18, 2023

I benchmark Ebisu v3 in this PR: open-spaced-repetition/srs-benchmark#11

The initial results are not good:

Weighted by ln(number of reviews)

Algorithm Log Loss RMSE RMSE(bins)
FSRS v4 0.3819 0.3311 0.0543
FSRS rs 0.3858 0.3326 0.0582
FSRS v3 0.5132 0.3670 0.1326
LSTM short-term 0.5917 0.3709 0.1302
LSTM 0.5934 0.3755 0.1382
SM2 0.8847 0.4131 0.2185
Ebisu v3 0.8680 0.4861 0.3354
HLR 2.5175 0.5574 0.4141

Note: I map the four ratings into binary results in Ebisu. And I don't know how to utilize the first learning. If we solve these problems, the performance will improve.

Here is the code: https://github.com/open-spaced-repetition/fsrs-benchmark/blob/0da0a9adeecb4d354ce3f8933fc2d5ee300a9bb6/other.py#L234-L264

@Expertium
Copy link
Author

Regarding grade mapping, I have suggested using different values of q0 for each grade.

@fasiha
Copy link
Owner

fasiha commented Nov 18, 2023

Thanks for the benchmark @L-M-Sherlock! Is there a Python (or Rust or JavaScript or whatever) standalone library that I can use to call FSRS without using Anki?

After reading 1 and 2 I'd like to try it on my quiz history!

@L-M-Sherlock
Copy link

Is there a Python (or Rust or JavaScript or whatever) standalone library that I can use to call FSRS without using Anki?

The python lib: https://github.com/open-spaced-repetition/fsrs-optimizer

@fasiha
Copy link
Owner

fasiha commented Nov 20, 2023

@L-M-Sherlock thanks! So I construct a CSV file of reviews, use this optimizer to analyze them, and feed the resulting weights into the equations in https://github.com/open-spaced-repetition/fsrs4anki/wiki/The-Algorithm? Is there a standalone implementation of the latter that I could use or should I just implement all the equations myself? (If the latter, I'll try to share the script to help other non-Anki folks use FSRS with their flashcards.)

I mentioned this in fasiha/ebisu#64 and you'd kindly replied there with an interesting idea and we can continue discussing that there but, here's an example of the issue with log-likelihood. This shows two different Ebisu v3 models for the same flashcard, and shows both the log-likelihood of each quiz as well as a running total of log-likelihood sum. The red one is worse than the blue one in terms of log-likelihoods but we can clearly see that the red one is just more conservative: it thinks the string of successes are unsustainable and assigning them low probability which hurts its log-likelihood but is vindicated when there's a failure. But the blue overconfident model doesn't take that much of a hit due to mis-predicting the failure—

cumulative-log-lik

In the above example, I actually prefer the more pessimistic red model, but I'm not sure how to measure that. I'm really interested in trying to avoid this problem if we can! Because otherwise my way of measuring models against each other is whether they output "reasonable" predictions and handle good growth for highly error-prone cards and highly-successful cards?

@L-M-Sherlock
Copy link

L-M-Sherlock commented Nov 20, 2023

The red one is worse than the blue one in terms of log-likelihoods but we can clearly see that the red one is just more conservative: it thinks the string of successes are unsustainable and assigning them low probability which hurts its log-likelihood but is vindicated when there's a failure.

When we have a huge dataset, it wouldn't be a problem. When the predicted probability is close to the real probability, the cross entropy will arrive the minimum. For example, if the real probability is 0.9, the sampling results could be:

1, 1, 1, 0, 1, 1, 1, 1, 1, 1 (0 - failure, 1 - success)

For prediction 0.9, the cross entropy is - (9 * ln(0.9) + 1 * ln(1-0.9)) = 3.2508297339.

If the model is over-confident, the prediction may be 0.95, and the cross entropy is - (9 * ln(0.95) + 1 * ln(1-0.95)) = 3.457371923, which is larger than the perfect perdition.

If the mode is conservative, the prediction may be 0.85, and the cross entropy is - (9 * ln(0.85) + 1 * ln(1-0.85)) = 3.3597903504, which is also larger than the perfect perdition.

@fasiha, what do you think of that?

@fasiha
Copy link
Owner

fasiha commented Nov 20, 2023

@L-M-Sherlock Thanks for that! Hmmm. Your illustrative example definitely works for cases where samples come from a known distribution (Bernoulli(0.9)) but of course our quizzes' true distribution is very complicated right, so I don't think this is the whole story? Even in traditional binary classification, unbalanced classes is a big problem, e.g., https://www.tensorflow.org/tutorials/structured_data/imbalanced_data discusses various techniques like weighting the underrepresented class and resampling it etc. Similar ideas in this link https://datascience.stackexchange.com/a/31385. People also resort to these tricks when one class is "worse" than another, e.g., when you want to trade off false alarms versus missed detections (i.e., Type I versus Type II errors). In fact, reading https://stackoverflow.com/q/44560549 points to this TensorFlow function, weighted_cross_entropy_with_logits which takes not just observations and predicted probabilities but also weights to apply to each sample in computing cross-entropy.

So I think we're in the same situation with quizzes? Continuing from the plot above, for example, the conservative red model is very competitive with the optimistic blue model for most examples but for flashcards with a lot of failures (20% failures, left-most points), the optimistic blue model has better cross-entropy/sum-of-log-likelihood, because it is more willing to forgive and overlook early failures whereas the conservative red model doesn't. The red model tends to produce predictions that I post hoc like more than the blue model, because I know the failures are coming and the red model was right to demand more quizzes? But it's punished for that, which upsets me!

quizzes-sum-log-lik

I have not at all experimented with weighted cross-entropy (there's actually many ways to do this, e.g., https://datascience.stackexchange.com/a/40889 links to a Facebook paper that just does (1-p)**gamma fixed weighting). What do you think? Please forgive me if I'm still misunderstanding! I am extremely thankful for your thoughtful and helpful explanations 🙇!

@L-M-Sherlock
Copy link

Your illustrative example definitely works for cases where samples come from a known distribution (Bernoulli(0.9)) but of course our quizzes' true distribution is very complicated right, so I don't think this is the whole story?

Yeah. I agree that we are thinking this problem in different way. I treat the distribution as Bernoulli(0.9) because I only consider the review history (the rating behavior of user). For example:

pretrain_3

Above forgetting curve is generated from my Anki collection. I controlled the condition that the first rating should be 4 (easy). Then I used the next reviews to draw the curve. If we don't know anything about the content of these cards, the recall event can be treated as a Bernoulli distribution.

@fasiha
Copy link
Owner

fasiha commented Nov 21, 2023

I treat the distribution as Bernoulli(0.9) because I only consider the review history (the rating behavior of user)

@L-M-Sherlock am I understanding correctly, does this mean that you take a each of the four Anki ratings (fail, hard, good, easy) as indicating four different but unknown recall probabilities? So for example every time you see an "easy" rating, that implies that the probability of recall at that time was the same value (unknown but fixed) as every other time you saw an "easy" rating? And that's why you can use the straightforward unweighted cross-entropy to measure that?

(Ebisu is a bit different than this—Ebisu is a bit more like classic classification where your observations are (in the simple case) just 0/1, pass/fail, with a Bayesian model that sees those as random draws from a Bernoulli distribution whose underlying probability parameter is itself a Beta random variable: in probabilistic terms, quiz result at time t is x ~ Bernoulli(p) where p ~ Beta(alpha, beta), and we have to estimate alpha and beta from samples of x. So quiz results only tell you a little bit about the underlying probability of recall at the time of the quiz: there's no way that each success/1 was an iid draw from the same Bernoulli distribution right. And this is also why I can't use cross-entropy to compare models, since the 0 vs 1 classes are so imbalanced, like the credit card fraud example from TensorFlow.)

(Edit: of course I should mention that in Ebisu, there's a nonlinear transformation to move the Beta random variable at time t to some other time, via either exponential decay for Ebisu v2 or power law decay for the weird model proposed in fasiha/ebisu#64)

@L-M-Sherlock
Copy link

does this mean that you take a each of the four Anki ratings (fail, hard, good, easy) as indicating four different but unknown recall probabilities? So for example every time you see an "easy" rating, that implies that the probability of recall at that time was the same value (unknown but fixed) as every other time you saw an "easy" rating? And that's why you can use the straightforward unweighted cross-entropy to measure that?

Maybe my words are still unclear. You can read my explanation here: https://github.com/open-spaced-repetition/fsrs4anki/wiki/Spaced-Repetition-Algorithm:-A-Three%E2%80%90Day-Journey-from-Novice-to-Expert#day-3-the-latest-progress

TL;DR: For items with the same review history (the ratings and intervals are consistent), I assume they have the same stability (or half-life). So they have the same probability of recall with the same elapsed days.

@Expertium
Copy link
Author

I think Sherlock means that as long as delta_t and S are exactly the same, the probability of recall is also exactly the same. If a card with delta_t=x (last review was x days ago) and S=y (stability is y days) has a p% chance of being recalled, than another card with delta_t=x and S=y will also have p% chance of being recalled.

@L-M-Sherlock
Copy link

Yeah. But the S could be an average of a group of cards. For more details, please see: https://github.com/open-spaced-repetition/heterogeneous-memory-research

@Expertium
Copy link
Author

@L-M-Sherlock @fasiha I'm just reminding you in case you were busy and forgot about this issue.

@Expertium
Copy link
Author

@fasiha me and LMSherlock are reworking RMSE and I just wanted to ask whether you still want to participate in the benchmark.

@fasiha
Copy link
Owner

fasiha commented Feb 24, 2024

@Expertium thanks for reaching out! I'm happy to help in any way I can. I'm still experimenting with different versions of Ebisu—the latest fasiha/ebisu#66 combines Ebisu v2 in a couple of different ways to achieve accuracy that seems as good as if not better than prior versions, while fixing a key problem. Would you be interested in benchmarking that? If so, let me know, I can provide you with a simple Python module that implements it.

I'm sorry that I haven't made time to properly digest the explanations given above about how the benchmark works 😔 but again I'm happy to provide code that implements the standard Ebisu API functions predictRecall and updateRecall and let you do your thing.

@Expertium
Copy link
Author

Well, I'm not the code expert here, that's LMSherlock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants