Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: move block admin events out of the WAL #916

Open
Tracked by #895
asubiotto opened this issue Jun 26, 2024 · 0 comments
Open
Tracked by #895

*: move block admin events out of the WAL #916

asubiotto opened this issue Jun 26, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@asubiotto
Copy link
Member

Currently, we record rotations and block creations in the WAL in order to determine which writes to discard because they have been persisted, which writes were included in a failed persistence attempt that needs to be retried, and which writes are still in the active memory block.

Most of the time, this works correctly, but deterministic simulation testing has shown that since these records are written asynchronously, there are cases where the protocol does not work, and even leads to data loss. Consider the following scenario:

  • x is written to active block b1
  • b1 is rotated, b2 becomes the active block before this has been written to
    the WAL.
  • A snapshot occurs of b2, the snapshot contains no data.
  • A crash happens, on restart, the snapshot loads b2, an empty block.

If NewTableBlock was synchronously written in the above case, b1 would not be overwritten by b2 until b2's creation has been synchronously written to the WAL which can subsequently be observed in recovery. There could still be an issue if the snapshot truncates the WAL, since we could lose this "admin" event.

I attempted to write a commit that writes NewTableBlock synchronously, but it is not easy to shoehorn synchronous writes into a fundamentally asynchronous WAL, and DST found some more failure scenarios and deadlocks that do not give me confidence in this approach.

We have always considered that we should move these types of admin events out of the WAL for performance reasons, and it is clear now that we should also do this for correctness reasons. The only bytes that should be written to the WAL are those related to writes.

On recovery, the "true" state of the database is reconstructed by observing the admin records and the order in which they occur. We should explore whether this "true" state can be reconstructed by simply adding the txn of the highest write to persisted blocks. In this case, we could easily tell which transactions were actually persisted for a given table, and discard any in-memory writes with lower transactions. This seems like the best solution since we would reconstruct the true state of the database based on which blocks actually show up as persisted.

@asubiotto asubiotto added the bug Something isn't working label Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant