Skip to content

Fix key backup cached event ordering#5281

Open
gumadeiras wants to merge 3 commits intomatrix-org:developfrom
gumadeiras:codex/fix-backup-key-cache-event-order
Open

Fix key backup cached event ordering#5281
gumadeiras wants to merge 3 commits intomatrix-org:developfrom
gumadeiras:codex/fix-backup-key-cache-event-order

Conversation

@gumadeiras
Copy link
Copy Markdown

@gumadeiras gumadeiras commented Apr 16, 2026

Summary

Fixes a Rust crypto key-backup event ordering bug where creating a new backup could emit CryptoEvent.KeyBackupDecryptionKeyCached before local backup state was coherent for listeners.

Before this change, setupKeyBackup() stored the generated backup key through saveBackupDecryptionKey(), which immediately emitted the cached-key event. PerSessionKeyBackupDownloader listens to that event and may immediately query backup configuration, seeing stale or undefined backup info and logging Unsupported algorithm undefined.

This PR keeps generated-backup setup quiet, then has RustCrypto.resetKeyBackup() store the recovery key and use the successful backup creation response to cache server backup info, enable the matching local backup, and only then emit the cached-key event. Received backup secrets now use the same coherent activation path after validating that the secret matches the current server-side backup.

Testing

pnpm test spec/unit/rust-crypto/backup.spec.ts spec/unit/rust-crypto/PerSessionKeyBackupDownloader.spec.ts -- --runInBand
pnpm test spec/unit/rust-crypto/rust-crypto.spec.ts -- --runInBand
pnpm test spec/integ/crypto/verification.spec.ts -t "Should accept the backup decryption key gossip if valid" -- --runInBand
pnpm test spec/integ/crypto/megolm-backup.spec.ts -t "storeSessionBackupPrivateKey|loadSessionBackupPrivateKeyFromSecretStorage|checkKeyBackupAndEnable" -- --runInBand
pnpm lint:types
pnpm exec eslint src/rust-crypto/backup.ts src/rust-crypto/rust-crypto.ts spec/unit/rust-crypto/backup.spec.ts spec/unit/rust-crypto/rust-crypto.spec.ts
pnpm exec prettier --check src/rust-crypto/backup.ts src/rust-crypto/rust-crypto.ts spec/unit/rust-crypto/backup.spec.ts spec/unit/rust-crypto/rust-crypto.spec.ts

Notes

Notes: Fix key-backup cached event timing so listeners do not observe stale or undefined backup info immediately after creating a new backup.

Maintainer label suggestion: T-Defect. The Preview Changelog workflow currently requires one release-type label, and this PR fixes an observable key-backup event ordering defect.

Signed-off-by: Gustavo Madeira Santana fromweb1@gumadeiras.com

Checklist

  • Tests written for new code (and old code if feasible).
  • New or updated public/exported symbols have accurate TSDoc documentation.
  • Linter and other CI checks pass.
  • Sign-off given on the changes.

Delay the key-backup cached notification for newly-created backups until
resetKeyBackup has stored the recovery key and confirmed the server-side
backup is observable. Cache current backup info before emitting for
received backup secrets so listeners do not query stale backup state.

This prevents the per-session backup downloader from waking on a
freshly-created backup before backup metadata is available, while
preserving the cached-key event for resetKeyBackup after the backup check
completes.

Signed-off-by: Gustavo Madeira Santana <fromweb1@gumadeiras.com>
@gumadeiras gumadeiras requested a review from a team as a code owner April 16, 2026 23:33
@github-actions github-actions Bot added the Z-Community-PR Issue is solved by a community member's PR label Apr 16, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts Rust crypto key-backup event sequencing so CryptoEvent.KeyBackupDecryptionKeyCached is emitted only when listeners can immediately observe coherent, up-to-date server backup state (avoiding transient “Unsupported algorithm undefined” situations).

Changes:

  • Delay cached-key event emission during resetKeyBackup() until the new backup is confirmed observable server-side.
  • Split “store key” vs “store + emit event” in RustBackupManager, and ensure received secrets update cached backup info before emitting.
  • Add unit tests covering the new event ordering and state coherence.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/rust-crypto/rust-crypto.ts Defers cached-key event emission after reset until server backup visibility is confirmed.
src/rust-crypto/backup.ts Separates key storage from event emission; updates cached backup info before emitting on received secret.
spec/unit/rust-crypto/rust-crypto.spec.ts Adds test asserting reset emits cached-key only after backup is observable.
spec/unit/rust-crypto/backup.spec.ts Adds tests ensuring no event during backup creation and that received-secret emission has coherent state.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/rust-crypto/rust-crypto.ts Outdated
Use the successful key-backup creation response to make local backup state
coherent before emitting the cached-key event. This avoids racing a
follow-up backup-version GET against stale cached no-backup state.

Add regression coverage for resetKeyBackup after a stale no-backup check,
asserting listeners observe both the active backup version and server
backup info for the newly-created backup.

Signed-off-by: Gustavo Madeira Santana <fromweb1@gumadeiras.com>
@gumadeiras gumadeiras force-pushed the codex/fix-backup-key-cache-event-order branch from 9d49a3b to 3a076f4 Compare April 16, 2026 23:45
Reuse one helper for known usable backups so both backup creation and
received backup secrets cache server info, enable the matching backup, and
only then emit cached-key events. This keeps KeyBackupStatus and cached-key
listeners from seeing a stale server backup or null active backup.

Make the cached-key emitter private now that callers no longer need to
trigger it directly, and cover the status-event ordering in the backup unit
test.

Signed-off-by: Gustavo Madeira Santana <fromweb1@gumadeiras.com>
@andybalaam
Copy link
Copy Markdown
Member

Hi @gumadeiras thank you for the PR! I'm having a little trouble understanding the description - did you write it yourself or use AI? If AI, please could you explain in your own words? (I'm struggling to understand e.g. "Received backup secrets now use the same coherent activation path".)

These questions might help clarify:

  • What problem is this fixing? How would a user experience this problem?
  • What code paths have been changed? What user actions would trigger those code paths?

(Please don't copy and paste answers from an AI btw.)

@gumadeiras
Copy link
Copy Markdown
Author

gumadeiras commented Apr 17, 2026

Hi @gumadeiras thank you for the PR! I'm having a little trouble understanding the description - did you write it yourself or use AI? If AI, please could you explain in your own words? (I'm struggling to understand e.g. "Received backup secrets now use the same coherent activation path".)

These questions might help clarify:

  • What problem is this fixing? How would a user experience this problem?
  • What code paths have been changed? What user actions would trigger those code paths?

(Please don't copy and paste answers from an AI btw.)

hi Andy, sorry for the lack of clarity!

the problem is a race condition when creating a new encrypted-message backup
the SDK currently does two things: (1) saves a new backup key and (2) updates its local state so it knows which backup version on the server is active

when (1) finishes it immeadiately triggers "backup key is cached", before (2) finishes
so when we are setting up or doing a backup reset we were getting warnings from the SDK (from PerSessionKeyBackupDownloader)
logs were showing:
"Unsupported algorithm undefined"
"Got current backup version from server: undefined"

the fix i'm proposing is just delaying the "backup key cached" message to the end of the cycle to avoid the race condition

and this is really just fixing a warning that happens even if everything is successful

just adding code refs for what i said (how it works today):
(1) creating a new backup goes through saveBackupDecryptionKey() from setupKeyBackup() and triggers CryptoEvent.KeyBackupDecryptionKeyCached (this is all in backup.ts)
then PerSessionKeyBackupDownloader gets triggered because it listens for KeyBackupDecryptionKeyCached (these are in PerSessionKeyBackupDownloader.ts)

the code that changes here:

resetKeyBackup(): user sets up or resets backup. code path: CryptoApi.resetKeyBackup → RustCrypto.resetKeyBackup() → enableKeyBackupFromCreation()

setupKeyBackup(): internal part of creating the backup version. It now only stores the generated key at creation time (this is in backup.ts)

handleBackupSecretReceived(): user verifies/signs in another device and receives the backup key via secret sharing. API docs says this is one way to do backup restore (crypto-api/index.ts), and the changed code path for this part is handleBackupSecretReceived(); this path is also now fixed for that same race condition

let me know if i can provide any other details!

@andybalaam
Copy link
Copy Markdown
Member

@gumadeiras thank you for taking the time to help me understand. I will take another look and also hand over to @richvdh who said he had some thoughts about the organisation of this code.

Copy link
Copy Markdown
Member

@andybalaam andybalaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to ask @richvdh to take over this review since he is more familiar with this code, but I'm still struggling to understand. Perhaps this change does multiple things at once, or maybe I'm just having a bad day...?

I understand that there's a race where we are emitting an event that results in other places checking our internal state before we've updated it, but I'm not understanding how that description fits the code yet.

}

// we can check and start async
this.checkKeyBackupAndEnable();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why it's OK to remove this call to checkKeyBackupAndEnable. I don't think I see equivalent code in enableKeyBackupFromCreation. The old code ends up in doCheckKeyBackup which does a bunch of stuff that presumably is useful?

Copy link
Copy Markdown
Author

@gumadeiras gumadeiras Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this removes it from the just created backup path in resetKeyBackup() only; it is still called through other paths

i dropped it because resetKeyBackup() has just created the backup itself so my assumption is that it doesn't need to be rechecked since it was just created? if the server accepted the create request and returned the new version, we can use that creation result as the current backup state instead of immediately doing a second check

the new helper still does the important local enable work: cache serverBackupInfo, mark checkedForBackup, call the existing enableKeyBackup(...), set activeBackupVersion, emit KeyBackupStatus, and start backupKeysLoop()

so the intent is not to remove backup enabling; only to avoid the extra check before triggering KeyBackupDecryptionKeyCached

if checkKeyBackupAndEnable() has another required function which i may have totally missed, then yeah i should rework this back in; let me know!

paths that still call RustCrypto.checkKeyBackupAndEnable:
startup calls RustCrypto.checkKeyBackupAndEnable()
public API calls RustCrypto.checkKeyBackupAndEnable()
own-user trust changes call RustCrypto.checkKeyBackupAndEnable()
getServerBackupInfo() calls RustBackupManager.checkKeyBackupAndEnable(false)
upload recovery calls RustBackupManager.checkKeyBackupAndEnable(true)

@gumadeiras
Copy link
Copy Markdown
Author

gumadeiras commented Apr 20, 2026

I'm going to ask @richvdh to take over this review since he is more familiar with this code, but I'm still struggling to understand. Perhaps this change does multiple things at once, or maybe I'm just having a bad day...?

I understand that there's a race where we are emitting an event that results in other places checking our internal state before we've updated it, but I'm not understanding how that description fits the code yet.

hopefully this helps a bit but let me know if not

just to recap; the race i am trying to fix is:

  1. setupKeyBackup() creates a new backup
  2. it saves the new backup key
  3. saving the key immediately emits KeyBackupDecryptionKeyCached
  4. PerSessionKeyBackupDownloader hears that event and checks backup state
  5. but resetKeyBackup() has not yet finished enabling/caching the newly-created backup locally

so the listener can see serverBackupInfo / activeBackupVersion before they match the backup that was just created. that is where the Unsupported algorithm undefined log came from

the fix i was trying to make is: for a backup we just created, first update local backup state from the creation result, then emit KeyBackupDecryptionKeyCached

file changes

src/rust-crypto/rust-crypto.ts
changes resetKeyBackup() to call enableKeyBackupFromCreation(backupInfo) instead of kicking off checkKeyBackupAndEnable() in the background
why: the backup was just created, so this uses the create response to enable/cache it immediately instead of doing a second async discovery check

src/rust-crypto/backup.ts: separates “store backup key” from “emit cached-key event”; makes new-backup creation store the key without emitting; adds the helper that caches/enables a known usable backup before emitting; applies the same ordering to received backup secrets. reason: both new backup creation and received backup secrets can emit KeyBackupDecryptionKeyCached, so both need backup state first

adds enableKeyBackupFromCreation() and enableKeyBackupFromInfo().
why: this is the replacement for the just-created-backup path. it sets serverBackupInfo, marks checkedForBackup, calls existing enableKeyBackup(), then emits KeyBackupDecryptionKeyCached

changes handleBackupSecretReceived() to use the same ordered path
why: received backup secrets emit the same cached-key event, so they should also only emit it after matching backup state is coherent (serverBackupInfo, activeBackupVersion, and the emitted event version all refer to the same backup)

spec/unit/rust-crypto/rust-crypto.spec.ts
adds a regression test for resetKeyBackup() after stale “no backup” state
why: proves the cached-key event sees active version, event version, and server version all matching

spec/unit/rust-crypto/backup.spec.ts: proves setupKeyBackup() does not emit early; proves received backup secrets emit after backup state is cached/enabled; verifies local backup enabling happens. reason: covers the event-ordering behavior changed in backup.ts

spec/unit/rust-crypto/backup.spec.ts
adds tests that setupKeyBackup() does not emit the cached-key event early, and that received backup secrets emit only after backup state is cached/enabled
why: covers the two event-ordering paths changed in backup.ts

edit: also if you believe this is the wrong direction or want anything reworked; please let me know i'd be happy to iterate as needed. i can also test any other solution in our production code to test other approaches

Copy link
Copy Markdown
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gumadeiras sorry for taking a while to get back to you on this.

Firstly, thanks very much for the contribution. I agree that the current code looks wrong, though I'm surprised we haven't heard more reports about it. Did you see any symptoms of it other than the log lines? Which application are you using (Element-Web, or something else?)

Indeed, it might be helpful if you could open an issue (against matrix-js-sdk or the application) describing exactly what symptoms you are seeing, without going into details about the underlying cause or potential fixes, so that we can be clear what we're trying to fix here. Including logs might also be helpful.

Now, regarding your fix. I'm rather worried that it makes an area of the code that is already a bit complicated even more complicated. It leads to weird scenarios where we cache the backup decryption key without emitting the KeyBackupDecryptionKeyCached event. It also increases coupling between RustCrypto and RustBackupManager, in that we now rely on RustBackupManager.enableKeyBackupFromCreation being called after RustBackupManager.setupKeyBackup to ensure that the KeyBackupDecryptionKeyCached event is emitted.

My main suggestion is, rather than having resetKeyBackup make separate calls to setupKeyBackup and enableKeyBackupFromCreation, we just combine the two -- in other words, move the call toenableKeyBackupFromCreation into setupKeyBackup. That would mean we could update serverBackupInfo before calling saveBackupDecryptionKey, and thus sidestep much of the issue.

Some other points:

  • One of the existing integration tests seems to be failing. Please take a look at it and see if you can figure out why.
  • It would be nice to see an integration test which demonstrates that the issue has been fixed.

I've also made a number of comments on the code, some of which might become somewhat irrelevant if you refactor it a bit.

Comment thread src/rust-crypto/backup.ts
Comment on lines +217 to +222
private async storeBackupDecryptionKey(
backupDecryptionKey: RustSdkCryptoJs.BackupDecryptionKey,
version: string,
): Promise<void> {
await this.olmMachine.saveBackupDecryptionKey(backupDecryptionKey, version);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any point in creating a dedicated function for this single line of code.

Comment thread src/rust-crypto/backup.ts
Comment on lines +234 to 236
private emitBackupDecryptionKeyCached(version: string): void {
this.emit(CryptoEvent.KeyBackupDecryptionKeyCached, version);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise: this.emit(CryptoEvent.KeyBackupDecryptionKeyCached, version); really isn't very much longer than this.emitBackupDecryptionKeyCached(version);

Comment thread src/rust-crypto/backup.ts
);

await this.saveBackupDecryptionKey(randomKey, res.version);
await this.storeBackupDecryptionKey(randomKey, res.version);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this change is the main point of the PR. It avoids emitting a KeyBackupDecryptionKeyCached event, meaning that PerSessionKeyBackupDownloader will not immediately query for the new backup state.

Comment thread src/rust-crypto/backup.ts
`handleBackupSecretReceived: Valid decryption key for the current server-side backup version (${latestBackupInfo.version}) received`,
);
await this.saveBackupDecryptionKey(backupDecryptionKey, latestBackupInfo.version);
await this.saveBackupDecryptionKeyForCurrentBackup(backupDecryptionKey, latestBackupInfo);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really following why this needs to change. Your description only mentions a problem happening during setupKeyBackup, but this is unrelated to setupKeyBackup. Does a similar problem also affect receiving the backup decryption key from other devices? I'm not sure why it would, because we don't update RustBackupManager.serverBackupInfo here (as far as I can tell), so there is nothing to race against?

Comment thread src/rust-crypto/backup.ts
Comment on lines +283 to +286
private async saveBackupDecryptionKeyForCurrentBackup(
backupDecryptionKey: RustSdkCryptoJs.BackupDecryptionKey,
backupInfo: KeyBackupInfo,
): Promise<void> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could do with a doc comment. How does it differ from saveBackupDecryptionKey, and why would I want one rather than the other?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it seems to be a helper for handleBackupSecretReceived, I'd inline it, which means you can get rid of the check that backupInfo.version is defined.

Comment thread src/rust-crypto/backup.ts
this.emitBackupDecryptionKeyCached(backupInfo.version);
}

private async enableKeyBackupFromInfo(backupInfo: KeyBackupInfo): Promise<void> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could also do with a doc comment. How does it differ from enableKeyBackup (which also takes a backupInfo)?

Comment thread src/rust-crypto/backup.ts
Comment on lines +269 to +275
const activeVersion = await this.getActiveBackupVersion();
if (activeVersion === null) {
await this.enableKeyBackup(backupInfo);
} else if (activeVersion !== version) {
await this.disableKeyBackup();
await this.enableKeyBackup(backupInfo);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is (a subset of) the same logic as in doCheckKeyBackup, right? Can we factor it out and reuse it?

Comment thread src/rust-crypto/backup.ts
Comment on lines +259 to +261
if (!version) {
throw new Error("Cannot enable key backup without version");
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can enforce this via the type system: declare the backupInfo parameter to have the type KeyBackupInfo & { version: string }

Comment thread src/rust-crypto/backup.ts
await this.enableKeyBackup(backupInfo);
}
} catch (e) {
this.serverBackupInfo = previousServerBackupInfo;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somewhat wonder if this is necessary, given we know that these are in fact the details of the backup on the server.

Comment thread src/rust-crypto/backup.ts
Comment on lines +269 to +275
const activeVersion = await this.getActiveBackupVersion();
if (activeVersion === null) {
await this.enableKeyBackup(backupInfo);
} else if (activeVersion !== version) {
await this.disableKeyBackup();
await this.enableKeyBackup(backupInfo);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if this races with an ongoing call to doCheckKeyBackup, or another call to doCheckKeyBackup happens while this is in progress? You might have to wait for keyBackupCheckInProgress to complete, or something.

@richvdh
Copy link
Copy Markdown
Member

richvdh commented Apr 22, 2026

@gumadeiras some other quick bits of feedback on your PR:

  1. I found this comment really hard to read. Some capital letters and full stops/periods would go a long way, as would using backticks for identifiers. The random line breaks don't help either.

  2. As a general rule, it's not the case that more words == better. I appreciate your efforts in trying to help us understand the change, but I think you could do with being a bit more succinct in your answers: I found I spent quite a long time reading your descriptions!

  3. On which note... I share @andybalaam's suspicion that a lot of this text, and code, has been generated by AI. I suspect that the reason you've given these long answers is because most of them have been generated by AI? Using AI isn't necessarily a problem, but you as the human need to take responsibility for your submission, and I'd much rather read a short explanation in your own words about the reason you feel a change is necessary, rather than the AI's explanation for its own change.

  4. In this comment you go into detail about different parts of the change. Unfortunately it's rather hard for me to correlate with the code. If you feel that sort of context is helpful, I'd recommend adding it as comments on the relevant parts of the diff (https://github.com/matrix-org/matrix-js-sdk/pull/5281/changes) rather than as a big block of text.

@gumadeiras
Copy link
Copy Markdown
Author

Hi @gumadeiras sorry for taking a while to get back to you on this.

Firstly, thanks very much for the contribution. I agree that the current code looks wrong, though I'm surprised we haven't heard more reports about it. Did you see any symptoms of it other than the log lines? Which application are you using (Element-Web, or something else?)

no worries; thanks for the feedback! I'll go through it as soon as possible

I primarily use Element Desktop and ElementX on iOS, and the matrix channel for openclaw (I'm the maintainer for Matrix)

i haven't seen any symptoms other than the logs; what i originally thought was a problem turned out to just be the noisy logs

@gumadeiras
Copy link
Copy Markdown
Author

@gumadeiras some other quick bits of feedback on your PR:

  1. I found this comment really hard to read. Some capital letters and full stops/periods would go a long way, as would using backticks for identifiers. The random line breaks don't help either.
  2. As a general rule, it's not the case that more words == better. I appreciate your efforts in trying to help us understand the change, but I think you could do with being a bit more succinct in your answers: I found I spent quite a long time reading your descriptions!
  3. On which note... I share @andybalaam's suspicion that a lot of this text, and code, has been generated by AI. I suspect that the reason you've given these long answers is because most of them have been generated by AI? Using AI isn't necessarily a problem, but you as the human need to take responsibility for your submission, and I'd much rather read a short explanation in your own words about the reason you feel a change is necessary, rather than the AI's explanation for its own change.
  4. In this comment you go into detail about different parts of the change. Unfortunately it's rather hard for me to correlate with the code. If you feel that sort of context is helpful, I'd recommend adding it as comments on the relevant parts of the diff (https://github.com/matrix-org/matrix-js-sdk/pull/5281/changes) rather than as a big block of text.

The original PR body was generated with AI and I reviewed/edited. All follow up messages were fully written by me. They weren't long because of AI (I wrote them), I was just trying to be more thorough with code refs to help understand the changes I made, since @andybalaam wanted to better understand why I made the changes I made. Similarly for the code changes, some parts were AI assisted but I fully reviewed/edited everything. I'll leave the next round of comments directly in the code to help with clarity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T-Defect Z-Community-PR Issue is solved by a community member's PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants