Fix key backup cached event ordering#5281
Fix key backup cached event ordering#5281gumadeiras wants to merge 3 commits intomatrix-org:developfrom
Conversation
Delay the key-backup cached notification for newly-created backups until resetKeyBackup has stored the recovery key and confirmed the server-side backup is observable. Cache current backup info before emitting for received backup secrets so listeners do not query stale backup state. This prevents the per-session backup downloader from waking on a freshly-created backup before backup metadata is available, while preserving the cached-key event for resetKeyBackup after the backup check completes. Signed-off-by: Gustavo Madeira Santana <fromweb1@gumadeiras.com>
There was a problem hiding this comment.
Pull request overview
Adjusts Rust crypto key-backup event sequencing so CryptoEvent.KeyBackupDecryptionKeyCached is emitted only when listeners can immediately observe coherent, up-to-date server backup state (avoiding transient “Unsupported algorithm undefined” situations).
Changes:
- Delay cached-key event emission during
resetKeyBackup()until the new backup is confirmed observable server-side. - Split “store key” vs “store + emit event” in
RustBackupManager, and ensure received secrets update cached backup info before emitting. - Add unit tests covering the new event ordering and state coherence.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/rust-crypto/rust-crypto.ts | Defers cached-key event emission after reset until server backup visibility is confirmed. |
| src/rust-crypto/backup.ts | Separates key storage from event emission; updates cached backup info before emitting on received secret. |
| spec/unit/rust-crypto/rust-crypto.spec.ts | Adds test asserting reset emits cached-key only after backup is observable. |
| spec/unit/rust-crypto/backup.spec.ts | Adds tests ensuring no event during backup creation and that received-secret emission has coherent state. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Use the successful key-backup creation response to make local backup state coherent before emitting the cached-key event. This avoids racing a follow-up backup-version GET against stale cached no-backup state. Add regression coverage for resetKeyBackup after a stale no-backup check, asserting listeners observe both the active backup version and server backup info for the newly-created backup. Signed-off-by: Gustavo Madeira Santana <fromweb1@gumadeiras.com>
9d49a3b to
3a076f4
Compare
Reuse one helper for known usable backups so both backup creation and received backup secrets cache server info, enable the matching backup, and only then emit cached-key events. This keeps KeyBackupStatus and cached-key listeners from seeing a stale server backup or null active backup. Make the cached-key emitter private now that callers no longer need to trigger it directly, and cover the status-event ordering in the backup unit test. Signed-off-by: Gustavo Madeira Santana <fromweb1@gumadeiras.com>
|
Hi @gumadeiras thank you for the PR! I'm having a little trouble understanding the description - did you write it yourself or use AI? If AI, please could you explain in your own words? (I'm struggling to understand e.g. "Received backup secrets now use the same coherent activation path".) These questions might help clarify:
(Please don't copy and paste answers from an AI btw.) |
hi Andy, sorry for the lack of clarity! the problem is a race condition when creating a new encrypted-message backup when (1) finishes it immeadiately triggers "backup key is cached", before (2) finishes the fix i'm proposing is just delaying the "backup key cached" message to the end of the cycle to avoid the race condition and this is really just fixing a warning that happens even if everything is successful just adding code refs for what i said (how it works today): the code that changes here: resetKeyBackup(): user sets up or resets backup. code path: CryptoApi.resetKeyBackup → RustCrypto.resetKeyBackup() → enableKeyBackupFromCreation() setupKeyBackup(): internal part of creating the backup version. It now only stores the generated key at creation time (this is in backup.ts) handleBackupSecretReceived(): user verifies/signs in another device and receives the backup key via secret sharing. API docs says this is one way to do backup restore (crypto-api/index.ts), and the changed code path for this part is handleBackupSecretReceived(); this path is also now fixed for that same race condition let me know if i can provide any other details! |
|
@gumadeiras thank you for taking the time to help me understand. I will take another look and also hand over to @richvdh who said he had some thoughts about the organisation of this code. |
andybalaam
left a comment
There was a problem hiding this comment.
I'm going to ask @richvdh to take over this review since he is more familiar with this code, but I'm still struggling to understand. Perhaps this change does multiple things at once, or maybe I'm just having a bad day...?
I understand that there's a race where we are emitting an event that results in other places checking our internal state before we've updated it, but I'm not understanding how that description fits the code yet.
| } | ||
|
|
||
| // we can check and start async | ||
| this.checkKeyBackupAndEnable(); |
There was a problem hiding this comment.
I'm not sure why it's OK to remove this call to checkKeyBackupAndEnable. I don't think I see equivalent code in enableKeyBackupFromCreation. The old code ends up in doCheckKeyBackup which does a bunch of stuff that presumably is useful?
There was a problem hiding this comment.
this removes it from the just created backup path in resetKeyBackup() only; it is still called through other paths
i dropped it because resetKeyBackup() has just created the backup itself so my assumption is that it doesn't need to be rechecked since it was just created? if the server accepted the create request and returned the new version, we can use that creation result as the current backup state instead of immediately doing a second check
the new helper still does the important local enable work: cache serverBackupInfo, mark checkedForBackup, call the existing enableKeyBackup(...), set activeBackupVersion, emit KeyBackupStatus, and start backupKeysLoop()
so the intent is not to remove backup enabling; only to avoid the extra check before triggering KeyBackupDecryptionKeyCached
if checkKeyBackupAndEnable() has another required function which i may have totally missed, then yeah i should rework this back in; let me know!
paths that still call RustCrypto.checkKeyBackupAndEnable:
startup calls RustCrypto.checkKeyBackupAndEnable()
public API calls RustCrypto.checkKeyBackupAndEnable()
own-user trust changes call RustCrypto.checkKeyBackupAndEnable()
getServerBackupInfo() calls RustBackupManager.checkKeyBackupAndEnable(false)
upload recovery calls RustBackupManager.checkKeyBackupAndEnable(true)
hopefully this helps a bit but let me know if not just to recap; the race i am trying to fix is:
so the listener can see the fix i was trying to make is: for a backup we just created, first update local backup state from the creation result, then emit file changes
adds changes
edit: also if you believe this is the wrong direction or want anything reworked; please let me know i'd be happy to iterate as needed. i can also test any other solution in our production code to test other approaches |
There was a problem hiding this comment.
Hi @gumadeiras sorry for taking a while to get back to you on this.
Firstly, thanks very much for the contribution. I agree that the current code looks wrong, though I'm surprised we haven't heard more reports about it. Did you see any symptoms of it other than the log lines? Which application are you using (Element-Web, or something else?)
Indeed, it might be helpful if you could open an issue (against matrix-js-sdk or the application) describing exactly what symptoms you are seeing, without going into details about the underlying cause or potential fixes, so that we can be clear what we're trying to fix here. Including logs might also be helpful.
Now, regarding your fix. I'm rather worried that it makes an area of the code that is already a bit complicated even more complicated. It leads to weird scenarios where we cache the backup decryption key without emitting the KeyBackupDecryptionKeyCached event. It also increases coupling between RustCrypto and RustBackupManager, in that we now rely on RustBackupManager.enableKeyBackupFromCreation being called after RustBackupManager.setupKeyBackup to ensure that the KeyBackupDecryptionKeyCached event is emitted.
My main suggestion is, rather than having resetKeyBackup make separate calls to setupKeyBackup and enableKeyBackupFromCreation, we just combine the two -- in other words, move the call toenableKeyBackupFromCreation into setupKeyBackup. That would mean we could update serverBackupInfo before calling saveBackupDecryptionKey, and thus sidestep much of the issue.
Some other points:
- One of the existing integration tests seems to be failing. Please take a look at it and see if you can figure out why.
- It would be nice to see an integration test which demonstrates that the issue has been fixed.
I've also made a number of comments on the code, some of which might become somewhat irrelevant if you refactor it a bit.
| private async storeBackupDecryptionKey( | ||
| backupDecryptionKey: RustSdkCryptoJs.BackupDecryptionKey, | ||
| version: string, | ||
| ): Promise<void> { | ||
| await this.olmMachine.saveBackupDecryptionKey(backupDecryptionKey, version); | ||
| } |
There was a problem hiding this comment.
I don't think there's any point in creating a dedicated function for this single line of code.
| private emitBackupDecryptionKeyCached(version: string): void { | ||
| this.emit(CryptoEvent.KeyBackupDecryptionKeyCached, version); | ||
| } |
There was a problem hiding this comment.
likewise: this.emit(CryptoEvent.KeyBackupDecryptionKeyCached, version); really isn't very much longer than this.emitBackupDecryptionKeyCached(version);
| ); | ||
|
|
||
| await this.saveBackupDecryptionKey(randomKey, res.version); | ||
| await this.storeBackupDecryptionKey(randomKey, res.version); |
There was a problem hiding this comment.
Note to self: this change is the main point of the PR. It avoids emitting a KeyBackupDecryptionKeyCached event, meaning that PerSessionKeyBackupDownloader will not immediately query for the new backup state.
| `handleBackupSecretReceived: Valid decryption key for the current server-side backup version (${latestBackupInfo.version}) received`, | ||
| ); | ||
| await this.saveBackupDecryptionKey(backupDecryptionKey, latestBackupInfo.version); | ||
| await this.saveBackupDecryptionKeyForCurrentBackup(backupDecryptionKey, latestBackupInfo); |
There was a problem hiding this comment.
I'm not really following why this needs to change. Your description only mentions a problem happening during setupKeyBackup, but this is unrelated to setupKeyBackup. Does a similar problem also affect receiving the backup decryption key from other devices? I'm not sure why it would, because we don't update RustBackupManager.serverBackupInfo here (as far as I can tell), so there is nothing to race against?
| private async saveBackupDecryptionKeyForCurrentBackup( | ||
| backupDecryptionKey: RustSdkCryptoJs.BackupDecryptionKey, | ||
| backupInfo: KeyBackupInfo, | ||
| ): Promise<void> { |
There was a problem hiding this comment.
This could do with a doc comment. How does it differ from saveBackupDecryptionKey, and why would I want one rather than the other?
There was a problem hiding this comment.
Since it seems to be a helper for handleBackupSecretReceived, I'd inline it, which means you can get rid of the check that backupInfo.version is defined.
| this.emitBackupDecryptionKeyCached(backupInfo.version); | ||
| } | ||
|
|
||
| private async enableKeyBackupFromInfo(backupInfo: KeyBackupInfo): Promise<void> { |
There was a problem hiding this comment.
this could also do with a doc comment. How does it differ from enableKeyBackup (which also takes a backupInfo)?
| const activeVersion = await this.getActiveBackupVersion(); | ||
| if (activeVersion === null) { | ||
| await this.enableKeyBackup(backupInfo); | ||
| } else if (activeVersion !== version) { | ||
| await this.disableKeyBackup(); | ||
| await this.enableKeyBackup(backupInfo); | ||
| } |
There was a problem hiding this comment.
This is (a subset of) the same logic as in doCheckKeyBackup, right? Can we factor it out and reuse it?
| if (!version) { | ||
| throw new Error("Cannot enable key backup without version"); | ||
| } |
There was a problem hiding this comment.
I think you can enforce this via the type system: declare the backupInfo parameter to have the type KeyBackupInfo & { version: string }
| await this.enableKeyBackup(backupInfo); | ||
| } | ||
| } catch (e) { | ||
| this.serverBackupInfo = previousServerBackupInfo; |
There was a problem hiding this comment.
I somewhat wonder if this is necessary, given we know that these are in fact the details of the backup on the server.
| const activeVersion = await this.getActiveBackupVersion(); | ||
| if (activeVersion === null) { | ||
| await this.enableKeyBackup(backupInfo); | ||
| } else if (activeVersion !== version) { | ||
| await this.disableKeyBackup(); | ||
| await this.enableKeyBackup(backupInfo); | ||
| } |
There was a problem hiding this comment.
What will happen if this races with an ongoing call to doCheckKeyBackup, or another call to doCheckKeyBackup happens while this is in progress? You might have to wait for keyBackupCheckInProgress to complete, or something.
|
@gumadeiras some other quick bits of feedback on your PR:
|
no worries; thanks for the feedback! I'll go through it as soon as possible I primarily use Element Desktop and ElementX on iOS, and the matrix channel for openclaw (I'm the maintainer for Matrix) i haven't seen any symptoms other than the logs; what i originally thought was a problem turned out to just be the noisy logs |
The original PR body was generated with AI and I reviewed/edited. All follow up messages were fully written by me. They weren't long because of AI (I wrote them), I was just trying to be more thorough with code refs to help understand the changes I made, since @andybalaam wanted to better understand why I made the changes I made. Similarly for the code changes, some parts were AI assisted but I fully reviewed/edited everything. I'll leave the next round of comments directly in the code to help with clarity |
Summary
Fixes a Rust crypto key-backup event ordering bug where creating a new backup could emit
CryptoEvent.KeyBackupDecryptionKeyCachedbefore local backup state was coherent for listeners.Before this change,
setupKeyBackup()stored the generated backup key throughsaveBackupDecryptionKey(), which immediately emitted the cached-key event.PerSessionKeyBackupDownloaderlistens to that event and may immediately query backup configuration, seeing stale or undefined backup info and loggingUnsupported algorithm undefined.This PR keeps generated-backup setup quiet, then has
RustCrypto.resetKeyBackup()store the recovery key and use the successful backup creation response to cache server backup info, enable the matching local backup, and only then emit the cached-key event. Received backup secrets now use the same coherent activation path after validating that the secret matches the current server-side backup.Testing
Notes
Notes: Fix key-backup cached event timing so listeners do not observe stale or undefined backup info immediately after creating a new backup.
Maintainer label suggestion:
T-Defect. The Preview Changelog workflow currently requires one release-type label, and this PR fixes an observable key-backup event ordering defect.Signed-off-by: Gustavo Madeira Santana fromweb1@gumadeiras.com
Checklist
public/exportedsymbols have accurate TSDoc documentation.