Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark Ebisu v2 #112

Merged
merged 4 commits into from
Sep 23, 2024
Merged

Benchmark Ebisu v2 #112

merged 4 commits into from
Sep 23, 2024

Conversation

L-M-Sherlock
Copy link
Member

Close #85

TODO: add introduction of Ebisu.

@L-M-Sherlock L-M-Sherlock added the enhancement New feature or request label Aug 27, 2024
@L-M-Sherlock L-M-Sherlock requested a review from Expertium August 27, 2024 02:08
@Expertium
Copy link
Contributor

Why is it under "untrainable"?

@Expertium
Copy link
Contributor

Also pinging @fasiha for feedback

@L-M-Sherlock
Copy link
Member Author

Why is it under "untrainable"?

Is it trainable? Ebisu is optimized in card level, like SM-2. I think it's untrainable.

@L-M-Sherlock
Copy link
Member Author

I guess he is busy working on Ebisu v3. I will merge this PR when I come back from vacation if there's not any feedback.

@fasiha
Copy link

fasiha commented Aug 28, 2024

Sorry for the delay friends! Unfortunately I haven't made time to understand how the benchmark works so I can't offer any feedback, but I'm sure you've done a good job and I have no problem with you merging this PR 😁.

(I have an experimental single-file Python version of a proposed post-Ebisu2 model at https://github.com/fasiha/ebisu3split/blob/main/ebisu3split/ebisu3split.py. Same predictRecall and updateRecall API but different initModel. You run this as

from ebisu3split import ebisu3split as ebisu3

model = ebisu3.initModel(
    1.25,
    24,  # hours
    w1=0.35,
    w2=0.35,
    scale2=5,
    hl3=365 * 24 * 10)  # hours

ankiResults = [3, 1, 3, 4, 3, 1, 3, 1, 4, 3, 3, 2, 1]
elapsedHours = [
    33.92992861114908, 113.95267277775565, 32.16110722220037,
    46.51897999999346, 170.2255811111536, 576.4576613888494, 95.66869305557339,
    1153.3546958333463, 67.68838499998674, 510.41189500002656,
    830.4334513888462, 975.9241575000342, 1370.9732519444078
]
assert len(ankiResults) == len(elapsedHours)

for x, t in zip(ankiResults, elapsedHours):
    p = ebisu3.predictRecall(model, t)
    successes = x > 1
    model = ebisu3.updateRecall(model, successes, 1, t)
    print(
        f"recall probability={p:0.2f}, new halflife={ebisu3.modelToPercentileDecay(model):0.1f} hours"
    )

This isn't an official Ebisu release so I don't want to bother you by asking you to benchmark this, but it's what I'm experimenting with right now and should be a lot more accurate with predicted recall probabilities than Ebisu v2. It's a stand-alone version of the algorithm described in fasiha/ebisu#66 and it just requires Ebisu v2 to be installed.)

I'm not sure what "untrainable" means either—I personally use a train/test split of historic data and cross-validation to find good values for Ebisu defaults (that's how I picked the values above). But maybe "untrainable" here means, once you pick the initial Bayesian parameters for your model, there's no further "training" step that happens after quizzes. That is, incorporating fresh data is an analytic update.

@Expertium
Copy link
Contributor

Expertium commented Aug 28, 2024

"Trainable" means that there are parameters that:

  1. Work at the collection-level rather than card-level. In other words, they are the same for each card
  2. We can run gradient descent on them

I don't know the inner workings of Ebisu, but these look trainable to me:

model = ebisu3.initModel(
    1.25,
    24,  # hours
    w1=0.35,
    w2=0.35,
    scale2=5,
    hl3=365 * 24 * 10)  # hours

@L-M-Sherlock
Copy link
Member Author

I don't know the inner workings of Ebisu, but these look trainable to me:

I don't know how to apply SGD to these parameters.

@Expertium
Copy link
Contributor

Expertium commented Aug 30, 2024

@L-M-Sherlock @fasiha I get that you both are busy, but I'm sure we all can agree that a half-assed implementation where nobody really knows what's going on is not good, right?
Right now Ebisu is performing extremely poorly.
image
It makes SM-2 look good. That's not supposed to happen. This means that either the current implementation is bugged or Ebisu is in dire need of improvement. If it's the latter, then Fasiha, I'm sure you would be curious to find ways to improve it, right? And we can benchmark Ebisu-v3, too.
So please find some time to work on this, both of you 🙏

@fasiha
Copy link

fasiha commented Aug 30, 2024

I love the passion here @Expertium! One of my favorite quotes is from legendary mathematician Richard Hamming who said “the purpose of computing is insight, not numbers” (that’s the title of my blog). And we all now know thanks to late-stage capitalism that “make the number go up” is a bad mindset to get into because it can mean we optimize the wrong thing or we fail to understand why others might not want to optimize that thing.

In this situation, you are using the output of Ebisu’s predictRecall to compare it against other algorithms (I think! If I'm wrong, let me know and I'll rethink this whole reply 😅). Almost from the beginning we realized that Ebisu was overly pessimistic with that raw number. And yet I and many others prefer Ebisu because

  1. we don’t care about the absolute accuracy of predictRecall: we’re making quiz apps that run predictRecall on all facts and rank them to find the facts that need to be quizzed right now (instead of scheduling a review for each fact at some future date). But more importantly,
  2. the specific Bayesian model Ebisu uses allows us to make really interesting not-boring flash cards.

Specifically, Ebisu’s statistical model allows us to handle cards like

  • the student wrote a sentence that conjugated ir verbs’ past tense twice but failed one of them (binomial quizzes)
  • the student passively read this passage and didn’t click on this word for a definition: they might remember the meaning but there’s a X% chance they forgot it and didn’t bother clicking (noisy-binary quizzes).

Work on Ebisu v3 has been slow mostly because I am super-lazy but in part because fixing the accuracy of predictRecall (to make it competitive with the other algorithms being benchmarked here) is both hard and not a priority. It’s hard because if we tweak or redo the Bayesian model to improve this one behavior, other desiderata suffer. It’s not a priority because once again I don’t care about the absolute accuracy of recall predictions, only their relative ranking, which I think Ebisu handles just fine.

The experimental Ebisu 3-split algorithm I shared above is my current favorite post-Ebisu2 algorithm, but I personally need to do a bunch more live testing with it in real quiz apps before gaining confidence in it, even though it performs quite well in my benchmarks, because I really care about ensuring that predictRecall outputs numbers that can be ranked against each other under all the interesting varied quizzes that Ebisu supports. I hesitate to ask you to benchmark it since it’s provisional and I don’t know if you want to clutter your table with a bunch of transient algorithms, but you’re welcome to!

Finally, I’m quite confident Ebisu 2 isn’t buggy because we’ve tested the implementation analytically (Wolfram Alpha), with raw Monte Carlo, and with Markov-chain Monte Carlo (STAN). It’s a faithful implementation of the statistical model we’ve constructed, and all the above issues lie with the drawbacks of that model.

@Expertium
Copy link
Contributor

Expertium commented Aug 30, 2024

only their relative ranking, which I think Ebisu handles just fine.

I'm afraid you are mistaken. RMSE and log-loss measure calibration aka how close predicted probability of recall is to the actual review outcome; but AUC (rightmost column in the image above) measures discrimination. It doesn't care about absolute values, only about ranking aka whether cards that have been forgotten had a lower value associated with them than cards that were successfully recalled. For example, if the value was 0.9 for a recalled card and 0.1 for a forgotten card, the AUC score will be the same as if the values were 0.999 and 0.001. BUT, if the value was 0.1 for a recalled card and 0.9 for a forgotten card, then the AUC score will be different.
And the AUC score of Ebisu is lower than that of HLR, which is completely unscientific and doesn't even care about the order of reviews. For HLR, [fail, pass, pass] is the same as [pass, pass, fail].

Finally, I’m quite confident Ebisu 2 isn’t buggy

My bad, I meant that the way it's used in the benchmark isn't proper, perhaps. But again, I don't know. You and LMSherlock would have to double-check the code in this repo to find out.

@fasiha
Copy link

fasiha commented Aug 30, 2024

Very very cool, thanks for explaining @Expertium!!! I love AUC but never thought to apply it here, let me see if I can figure out what's going on in the PR. Maybe Ebisu v2 is just terrible at both absolute probability and relative ranking 🤣

I don't know how to apply SGD to these parameters.

I'm not sure if this is what you mean but I sometimes run the algorithm for a grid of parameter values on the entire training split of the data to build charts like this one to visualize the behavior of the algorithm as two values vary. So maybe you could set up an objective function of all the initial parameters (for Ebisu 2, there's two: initial α=β > 1 and initial halflife > 0; for Ebisu 3split above, there's a few more) that outputs a number (sum-of-log-loss/focal loss, AUC) and optimize that via SGD?

It's true that some of these parameters (especially initial halflife) can and will be tweaked by apps since the student can tell you "I know this fact" but I think all Ebisu users just pick a fixed initial α=β and use it for all cards, so that's effectively one parameter. And I often treat even initial halflife as a parameter that I hold constant for all data when picking reasonable defaults.

Anyway, let me think about AUC for a bit and see how it works here. Thanks very much, I feel I'm always learning from you two!

@L-M-Sherlock
Copy link
Member Author

L-M-Sherlock commented Aug 31, 2024

I have manually picked the parameters to make ebisu more optimistic. Admittedly, I haven't try my best to test every possible combination of parameters for ebisu. If you want it, I will try when I come back from vacation.

@fasiha
Copy link

fasiha commented Aug 31, 2024

Very fun, I coded up ROC/AUC for my Ebisu v2 and 3-split models (discussed in fasiha/ebisu#66 (comment), there are some preliminary ROC curves there). I'm going to play with the Hugging Face dataset and see if I can get the same AUC for the same Ebisu v2 params (initial α=β=1, initial half-life of 7 days) that you found. Then I'm really curious to see what the ROC/AUC looks like for the 3-split algorithm, and build more intuition about this application of this technique 🎇🚀

@fasiha
Copy link

fasiha commented Sep 3, 2024

Can we hold off on merging this? I'm digging into a few issues—

  1. I'm adding a delta_t_sec column to https://huggingface.co/datasets/open-spaced-repetition/FSRS-Anki-20k (patch below) because Ebisu breaks when you give it a quiz at 0 time elapsed since last quiz—as you discovered. Rather than giving Ebisu 0.001 days, we'll just give it a fractional day. No idea if this'll change anything but best to be sure.
  2. Can you upgrade to Ebisu 2.2? As I was running through the data, some models were throwing exceptions due to numerical instability, and I've fixed these in 2.2. I guess this benchmark silently swallows exceptions?
  3. (We should also initialize Ebisu's alpha and beta to >1. Even Ebisu 2.2 can't handle this—it needs a tiny bit of prior, but $\text{Beta}(1,1)$ is totally uninformative (uniform distribution): ebisu.updateRecall(prior=(1, 1, 24), successes=1, total=1, tnow=24) throws. But I'll suggest better parameters later.)
  4. (Note to self. When I run Ebisu on millions of cards, I run into memory issues because it's caching intermediate calls of the Beta function 🙄 you can clear the cache and recover memory with ebisu.ebisu._BETALNCACHE = {}.)

Here's the tentative patch adding delta_t_sec to the dataset:

diff --git a/revlogs2dataset.py b/revlogs2dataset.py
index 1fe711dec..8429e8ef8 100644
--- a/revlogs2dataset.py
+++ b/revlogs2dataset.py
@@ -54,7 +54,12 @@ def process_revlog(revlog):
         lambda x: int((x / 1000 - entries.next_day_at) / 86400)
     )
     df["delta_t"] = df["relative_day"].diff().fillna(0).astype("int64")
+    df["relative_day_frac"] = df["review_time"].apply(
+        lambda x: (x * 1e-3 - entries.next_day_at))
+    df["delta_t_sec"] = df["relative_day_frac"].diff().fillna(
+        0).round().astype("int64")
     df.loc[df["state"] == 0, "delta_t"] = -1
+    df.loc[df["state"] == 0, "delta_t_sec"] = -1
     df["card_id"] = pd.factorize(df["card_id"])[0]
     df["review_th"] = df["review_time"].rank(method="dense").astype("int64")
     df.drop(
@@ -65,11 +70,15 @@ def process_revlog(revlog):
             "last_learn_start",
             "mask",
             "relative_day",
+            "relative_day_frac",
             "i",
         ],
         inplace=True,
     )
-    df = df[["card_id", "review_th", "delta_t", "rating", "state", "duration"]]
+    df = df[[
+        "card_id", "review_th", "delta_t", "rating", "state", "duration",
+        "delta_t_sec"
+    ]]
 
     if df.empty:
         return 0

Here's the first few rows of the resultant 1/1.csv:

card_id,review_th,delta_t,rating,state,duration,delta_t_sec
0,1,-1,3,0,5737,-1
0,2,0,3,1,5700,6
0,3,4,3,2,60000,343518
0,163,6,4,2,60000,492674
0,237,1,2,4,36968,113476
0,380,11,4,2,14912,945071
1,4,-1,3,0,60000,-1
1,14,0,1,1,60000,855
1,16,0,1,1,60000,3325

@fasiha
Copy link

fasiha commented Sep 3, 2024

No idea if this'll change anything but best to be sure.

Actually. I'm guessing giving Ebisu quiz time of 0.001 days when delta_t=0 will really mess it up in case of failures: you're telling Ebisu that 1.4 minutes after review, the student failed a quiz. Ebisu will often amend its model very aggressively, which might impact performance for the rest of that card's life.

Perhaps in the aggregate, over millions of quizzes/cards, the impact of this is minor? But it'll be good to just give Ebisu accurate times.

@L-M-Sherlock
Copy link
Member Author

Can you upgrade to Ebisu 2.2

Have you uploaded it to pypi? I installed Ebisu via pip directly.

@Expertium
Copy link
Contributor

Expertium commented Sep 3, 2024

But it'll be good to just give Ebisu accurate times.

We can't. That's just how Anki works. You can't access the exact intervals of same-day reviews. Maybe you could add a new parameter, which can be interpreted as "average interval length of same-day reviews".
Btw, I think the biggest improvement will be figuring out how to run gradient descent on Ebisu's parameters, so that they can be adjusted for each collection aka for each user.

@L-M-Sherlock
Copy link
Member Author

You can't access the exact intervals of same-day reviews.

We have access to the exact intervals in the review data (float delta_t). We just cannot get it in Anki's scheduler.

@Expertium
Copy link
Contributor

Yeah. But then Ebisu would get an unfair advantage.

@L-M-Sherlock
Copy link
Member Author

But then Ebisu would get an unfair advantage.

I think we can add a note for that. And GRU-P-short and FSRS-5 both got an unfair advantage.

@Expertium
Copy link
Contributor

Expertium commented Sep 4, 2024

And GRU-P-short and FSRS-5 both got an unfair advantage.

No, because they don't have access to exact interval lengths. They only rely on grades, which is fair.
No same-day reviews = fair, because you can do that in Anki.
Same-day reviews, but only use grades = fair, because you can do that in Anki.
Same-day reviews with intervals expressed as fractions = unfair, you can't do that in Anki.

@user1823
Copy link
Contributor

user1823 commented Sep 15, 2024

Same-day reviews with intervals expressed as fractions = unfair, you can't do that in Anki.

Well, I don't think that it's unfair. I think that the goal of this benchmark experiment should be to compare algorithms when they are giving their best performance. We can easily remove the technical limitations imposed by Anki if the benefits turn out to be significant.

@Expertium
Copy link
Contributor

We can easily remove the technical limitations imposed by Anki if the benefits turn out to be significant.

According to Dae, not easily at all

@user1823
Copy link
Contributor

Even if it is not easy to do in Anki, I think that we should still allow algorithms to access all the data they need.

I think that the goal of the benchmark is to find out the full potential of the algorithm, not limited by "artificial" constraints.

@L-M-Sherlock
Copy link
Member Author

L-M-Sherlock commented Sep 22, 2024

Here are the results of some hype-parameters searching experiments:

Initial Half-life (days) Alpha Beta LogLoss RSME(bins) AUC
1 1 1 1.1444±0.4159 0.4878±0.1032 0.6319±0.0978
7 1 1 0.7542±0.2644 0.3500±0.0899 0.6070±0.0881
7 2 2 0.8816±0.3470 0.3924±0.1046 0.5787±0.0953
14 1 1 0.6524±0.2373 0.3018±0.0868 0.5986±0.0898
28 1 1 0.5778±0.2251 0.2578±0.0854 0.5920±0.0925
56 1 1 0.5335±0.2329 0.2203±0.0856 0.5868±0.0940
112 1 1 0.5205±0.2638 0.1920±0.0877 0.5829±0.0952
224 1 1 0.5376±0.3148 0.1745±0.0919 0.5800±0.0955
448 1 1 0.5817±0.3792 0.1668±0.0992 0.5779±0.0957
448 0.5 0.5 0.5328±0.3218 0.1626±0.0908 0.5902±0.0920
512 0.5 0.5 0.5412±0.3331 0.1618±0.0925 0.5895±0.0919
448 0.2 0.2 0.4913±0.2626 0.1636±0.0791 0.6072±0.0871
512 0.2 0.2 0.4949±0.2701 0.1620±0.0808 0.6060±0.0876
448 0.1 0.1 0.4838±0.2390 0.1673±0.0707 0.6187±0.0832

For reference:

Model LogLoss RSME(bins) AUC
HLR 0.4912±0.2753 0.1386±0.0812 0.6351±0.0947
SM-2 0.8693±0.8719 0.2174±0.1358 0.6022±0.0957

Some thoughts:

  1. Alpha and Beta should be smaller to make the model more aggressive.
  2. The initial Half-life should be longer because Ebisu is too conservative to increase the half-life vastly.

Good news: Ebisu-v2 could outperform SM-2.

Bad news: Ebisu-v2 cannot outperform HLR.

@Expertium
Copy link
Contributor

Expertium commented Sep 22, 2024

@fasiha seems like we need Ebisu v3 to make the competition more exciting. But in all seriousness though, if initial half-life, alpha and beta cannot be fine-tuned for each user individually, there is no way Ebisu can end up close to the top by any metric. You really should figure out how to run gradient descent on them.

@L-M-Sherlock
Copy link
Member Author

L-M-Sherlock commented Sep 22, 2024

If we cannot benchmark Ebisu v3 soon, which set of hyper-parameters do you want to use in the benchmark of Ebisu-v2?

@Expertium
Copy link
Contributor

Expertium commented Sep 22, 2024

Whichever results in lowest RMSE

@L-M-Sherlock
Copy link
Member Author

Initial Half-life = 448
Alpha = Beta = 0.5

@Expertium
Copy link
Contributor

Expertium commented Sep 22, 2024

Actually, on second thought, use Initial Half-life = 448 and Alpha = Beta = 0.2. RMSE is almost the same, but log-loss and AUC are better

@L-M-Sherlock
Copy link
Member Author

L-M-Sherlock commented Sep 22, 2024

After fine-tuning the hyper-parameters, I find that Initial Half-life = 512, Alpha = Beta = 0.2 is better. I decide to use this set of hyper-parameters.

@fasiha
Copy link

fasiha commented Sep 22, 2024

A few points!

1. Using fractional time of quiz

I agree with @user1823:

That's just how Anki works. You can't access the exact intervals of same-day reviews

If that's the case, I recommend you publish two sets of tables: in addition to the benchmarks you already have, consider a second set, giving all algorithms the fractional time between reviews. Given this repo is called "SRS Benchmark" and not "Anki's unfortunate design decisions Benchmark" 😝 I think that would be interesting and useful for quiz authors to understand various algorithms' behavior without being limited by Anki's silliness.

Because, I've been experimenting with Ebisu v2 and v3 (split algorithm) models with fractional times in the FSRS dataset and there's a lot of really interesting behavior from an Ebisu perspective. As a small example, here's the ROC for four models for the first million flashcards in the FSRS dataset picking 10% of users:

split-auc-1727021709 7899158

model AUC (higher is better) focal loss (higher is better)
a nice Ebisu 3-split model 0.74 -1e4
a nice Ebisu v2 model 0.73 -5e4
Your recently-proposed "best" v2 model 0.70 -2e4
Your initial proposed v2 model 0.71 -2e4

The performance improvements we see here due to better data (AUC of 0.7+ vs the 0.6 in your table) are almost certainly going to be shared by all the other algorithms.

2. Data cleanup

Now. If you choose to consider training/benchmarking with this higher-quality data, you'll run into a problem. Very often you have data where a user will fail a quiz, and then fail it again right away. Random example is https://huggingface.co/datasets/open-spaced-repetition/FSRS-Anki-20k/blob/main/dataset/1/30.csv (line 6886):

card_id review_th delta_t rating state duration delta_t_sec
462 2298 1 1 2 5514 84658
462 2305 0 1 3 7546 794
462 2330 0 1 3 5161 294
462 2341 0 2 3 3835 88

This is normal, my Anki data (from 2013~) has the same behavior: sometimes you just hit "fail" accidentally, or you didn't want to review your failure right then so you came back a few minutes later in the review. Anki doesn't have an easy way to "undo" an accidental fail review since it schedules future quizzes, so there's no incentive to keep quiz history clean.

But this hurts Ebisu a lot because the algorithm is very surprised to see failures so quickly after a review. In fact, giving fractional elapsed data can yield worse Ebisu performance (for the same models!) because, instead of you mapping "0 days" to 0.01 days = 14 minutes in your code, Ebisu gets the "real" elapsed time of, for example, 88 seconds 😖, and aggressively updates the model, thereby ruining it for future quizzes.

I've always handled this by

  1. group together quizzes that happen in the same four hour block
  2. keep only the first failure in the block (ignoring all successes before the first failure)
  3. and keep any successes after the last failure in that block

So if you saw this sequence of quiz results:

  • A: 50 hours later, successes
  • B: 5 minutes later, failure
  • C: 10 minutes later, success
  • D: 2 minutes later, failure
  • E: 15 minutes later, success
  • F: 24 hours later, success

you'd collapse this to just B, E, and F (keep only the first failure and all successes after the last failure in the 4 hour window). Intuitively we ignore A because this chunk is a failure: the algorithm will be poisoned if you tell it there was a success 50 hours after last review and then a failure 5 minutes later. (Of course you convert all data to absolute time and recalculate the deltas after you decide which quizzes to keep.)

I picked this approach because of Anki's daily approach to quizzes: grouping together quizzes that happened within four hours ensures that we're well within the one-day quantization that Anki uses. This is ad hoc but I didn't tailor this behavior to work with this benchmark, I promise! Here's the same four hour interval code from 2021.

The above ROC curve is with this "cleanup" of the data applied. Here's the fsrs_anki_20k_reader.py helper I wrote to apply this idea to the FSRS dataset with fractional intervals.

You can argue that it's up to algorithms to handle this kind of behavior, that users of Anki and Anki-like apps shouldn't need to modify their behavior to prioritize the integrity of their quiz history. That may be a reasonable stance. But I feel low-parameter models like Ebisu shouldn't be punished for being unable to deal with messy data. A 300-parameter neural network has plenty of free parameters to learn to deal with Anki-specific complications: such a neural network could easily learn to group closely-spaced failures, and hedge its predictions for quizzes very close to failures, and other elaborate coping mechanisms to deal with integer-day intervals. Ebisu has zero excess parameters to learn such behavior. It relies on the data being clean from the start.

If you agree to run the benchmarks with fractional interval data, I hope you'll also consider some kind of cleanup step, to avoid telling algorithms that 2+ independent failures happened in rapid succession, and somehow "group" these together. My ad hoc approach outlined above is unlikely to be the best!

If you decide you don't want to consider fractional interval data and/or you don't want to do any cleanup, then maybe there should be a footnote, "The data includes back-to-back failures that we report to Ebisu as failures with 0.01 day intervals" or something sad like this 😢

3. Unbalanced data and log loss

Totally unrelated question here. When you optimize other models, do you run stochastic gradient descent to minimize log loss (cross entropy)? We've talked about this before in other issues but I think this is wrong because optimizing log-loss produces results that are skewed by unbalanced data (far more successes than failures). That's why I prefer to use focal loss, a very simple tweak on cross entropy, introduced by Facebook researchers here.

Specifically, in your example of searching for better Ebisu initial parameters, you found that a very long initial halflife results in best log-loss. However, this only happens because the sum of Bernoulli log likelihoods (cross entropy) is improved by making the model overconfident on the larger set of quiz successes at the expense of ruining performance on the smaller set of quiz failures. This is absolutely not the behavior we want: if anything we want a model that prioritizes getting failures correct and is willing to sacrifice performance on successes.

If you instead optimize focal loss (or some other parameter that handles unbalanced training data), you'll get different optima. When I run a grid search of Ebisu v2 on focal loss for just my Anki quiz data (370 flashcards), I see that the best-performing Ebisu v2 model happens at halflife much smaller than 400 days (actually around 150 hours), because the metric isn't impressed by easily-classified successes.

focal-loss-ebisu2

(The AUC for the above grid of Ebisu v2 parameters also peaks far away from your "400 days optimum". And I'm starting to think this is what I want to optimize rather than focal loss, since I really really don't care about how accurate predictRecall is, I care a lot more about the relative ranking of predictRecalls called on different cards, and AUC is "what is the likelihood that an average successful quiz got higher predictRecall than an average failure quiz".)

auc-ebisu2

Since both AUC and focal loss both peak at close-ish parameters, I think that emphasizes the importance of dealing with training imbalance.

4. AUC vs "true positive rate at 20% false positive rate"

Overall, I'm not happy with AUC in the 70% range. In the ROC curves I shared above, the best-performing models achieved only 50% true positive rate (TPR) for a 20% false positive rate (FPR). That is, if we threshold the predicted recall probability such that 20% of the quizzes we predicted to succeed actually failed, then only 50% of the successful quizzes are predicted to have succeed. This is a miserable classifier 😭 I would prefer to see 80%, 90% TPR at 20% FPR 😕.

I think it'd be really interesting to compare all the algorithms against not just AUC (area under the entire curve) but also the TPR at 20% FPR? If two algorithms both have the same AUC, their TPR at 20% FPR would be a great way to further compare their ranking capacity. The one with better TPR at 20% FPR would be better in that, assuming we were OK with a 20% FPR, that algorithm predicted more successful quizzes as successful.

(And I'm not saying 20% FPR is tolerable. Depending on the distribution of failed quizzes, maybe TPR at 10% FPR is what we should optimize. 10% FPR means 10% of the failed quizzes we incorrectly predicted to be successful 😫 that's still so high it hurts my soul.)

Thanks for reading, sorry for so many questions and ideas.

@L-M-Sherlock
Copy link
Member Author

If that's the case, I recommend you publish two sets of tables: in addition to the benchmarks you already have, consider a second set, giving all algorithms the fractional time between reviews.

I'm considering it. In fact, I have benchmarked two models with fractional time.

But this hurts Ebisu a lot because the algorithm is very surprised to see failures so quickly after a review

I think it's a weakness of Ebisu, instead of a problem of the data. And you cannot prevent users from generating unclean data in practice. So it should be handled by the algorithm.

That's why I prefer to use focal loss, a very simple tweak on cross entropy, introduced by Facebook researchers here.

I knew it, but it will skew the algorithm to underestimate the stability. Because focal loss tend to give more weights to failed reviews, which will induce the algorithm to decrease the stability. Then the user will have higher true retention than their desired retention.

@Expertium
Copy link
Contributor

Expertium commented Sep 23, 2024

I don't like that focal loss has two hyper-parameters that have to be fine-tuned, but ok, I can benchmark it. I wonder if for some combination of hyper-parameters, minimizing focal loss instead of log-loss will result in lower RMSE, that would be a very interesting find.

And btw, I still think that the number one priority should be running gradient descent on initial half-life, alpha and beta for each user.

@L-M-Sherlock
Copy link
Member Author

The discussion has been to verbose. I'd like to merge this PR at first, and then open a new issue to discuss how to improve Ebisu.

@L-M-Sherlock L-M-Sherlock merged commit 7a901db into main Sep 23, 2024
@L-M-Sherlock L-M-Sherlock deleted the benchmark-Ebisu-v2 branch September 23, 2024 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ebisu?
4 participants