You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we record rotations and block creations in the WAL in order to determine which writes to discard because they have been persisted, which writes were included in a failed persistence attempt that needs to be retried, and which writes are still in the active memory block.
Most of the time, this works correctly, but deterministic simulation testing has shown that since these records are written asynchronously, there are cases where the protocol does not work, and even leads to data loss. Consider the following scenario:
x is written to active block b1
b1 is rotated, b2 becomes the active block before this has been written to
the WAL.
A snapshot occurs of b2, the snapshot contains no data.
A crash happens, on restart, the snapshot loads b2, an empty block.
If NewTableBlock was synchronously written in the above case, b1 would not be overwritten by b2 until b2's creation has been synchronously written to the WAL which can subsequently be observed in recovery. There could still be an issue if the snapshot truncates the WAL, since we could lose this "admin" event.
I attempted to write a commit that writes NewTableBlock synchronously, but it is not easy to shoehorn synchronous writes into a fundamentally asynchronous WAL, and DST found some more failure scenarios and deadlocks that do not give me confidence in this approach.
We have always considered that we should move these types of admin events out of the WAL for performance reasons, and it is clear now that we should also do this for correctness reasons. The only bytes that should be written to the WAL are those related to writes.
On recovery, the "true" state of the database is reconstructed by observing the admin records and the order in which they occur. We should explore whether this "true" state can be reconstructed by simply adding the txn of the highest write to persisted blocks. In this case, we could easily tell which transactions were actually persisted for a given table, and discard any in-memory writes with lower transactions. This seems like the best solution since we would reconstruct the true state of the database based on which blocks actually show up as persisted.
The text was updated successfully, but these errors were encountered:
Currently, we record rotations and block creations in the WAL in order to determine which writes to discard because they have been persisted, which writes were included in a failed persistence attempt that needs to be retried, and which writes are still in the active memory block.
Most of the time, this works correctly, but deterministic simulation testing has shown that since these records are written asynchronously, there are cases where the protocol does not work, and even leads to data loss. Consider the following scenario:
the WAL.
If
NewTableBlock
was synchronously written in the above case, b1 would not be overwritten by b2 until b2's creation has been synchronously written to the WAL which can subsequently be observed in recovery. There could still be an issue if the snapshot truncates the WAL, since we could lose this "admin" event.I attempted to write a commit that writes
NewTableBlock
synchronously, but it is not easy to shoehorn synchronous writes into a fundamentally asynchronous WAL, and DST found some more failure scenarios and deadlocks that do not give me confidence in this approach.We have always considered that we should move these types of admin events out of the WAL for performance reasons, and it is clear now that we should also do this for correctness reasons. The only bytes that should be written to the WAL are those related to writes.
On recovery, the "true" state of the database is reconstructed by observing the admin records and the order in which they occur. We should explore whether this "true" state can be reconstructed by simply adding the txn of the highest write to persisted blocks. In this case, we could easily tell which transactions were actually persisted for a given table, and discard any in-memory writes with lower transactions. This seems like the best solution since we would reconstruct the true state of the database based on which blocks actually show up as persisted.
The text was updated successfully, but these errors were encountered: