Skip to content

Redo Logs

Hongzheng Shi edited this page Dec 3, 2018 · 4 revisions

Overview

Redo logs are used by AresDB to recover after a server shutdown. They are implemented at table level instead of at database level after the following considerations.

  • Transactional support is no higher than row level, requiring no cross-table mutation atomicity.
  • Each table (fact or dimension) has its own archiving/snapshot schedule, requiring redo log purging at different delay.
  • Using separate redo logs makes table deletion easier.

Redo logs are appended with upsert batches. A new log file is created periodically (every 2 hour for instance, according to the archiving interval) for each table. The file is named by arrival time of first upsert (which needs to be synchronized in a distributed setup). Purging is achieved by simply deleting redo log files that have been completely archived/snapshotted.ted.

Upsert Batch

An upsert batch contains multiple upsert mutations to a table. The upserts are applied to a subset of the columns. Primary key columns and time columns for fact tables must be specified with non-NULL values. Unrecognized columns (deleted for instance) are ignored. The data (values and nulls) are represented as uncompressed columnar vectors.

An upsert batch is serialized into the following format:

[uint32] magic_number
[uint32] buffer_size

<begin of buffer>
[uint32] version_number
[int32] num_of_rows
[uint16] num_of_columns
<reserved 14 bytes>
[uint32] arrival_time

[uint32] column_offset_0 ... [uint32] column_offset_x
[uint32] column_reserved_field1_0 ... [uint32] column_reserved_field1_x
[uint32] column_reserved_field2_0 ... [uint32] column_reserved_field2_x
[uint32] column_data_type_0 ... [uint32] column_data_type_x
[uint16] column_id_0 ... [uint16] column_id_x
[uint8] column_mode_0 ... [uint8] column_mode_x

(optional) null_vector_0
(optional) [padding to 4 byte alignment] offset_vector_0
[padding for 8 byte alignment] value_vector_0
...

[padding for 8 byte alignment]
<end of buffer>

This format is used for both client-server communication as well as redo logging. All serialized numbers are written in little-endian. NumRows (batch size) should be reasonably large (>= 256 for instance) and preferably a multiple of 64 for this format to be efficient.

Field Description
magic_number Verification header of value 0xADDAFEED.
buffer_size The size of the buffer which starts from the num_of_rows field till the end of buffer including any trailing paddings.
version_number upsert batch version number 0xFEED0001.
num_of_rows The number of rows in the redo log.
num_of_columns The total number of columns in the redo log.
arrival_time The arrival time of upsert batch.
column_offset The offsets (from the beginning of buffer) to the beginning of the data section of each column. The total size of the offset vector is num_of_columns + 1 where the last element points to the end of the last column data section.
column_data_type The data type for each column. See details below.
column_id The logical id of each column.
column_mode The encoding mode of each column. See details below.
null vector If present, the validity vector of each value in a column.
offset vector If present, the offset (from 0) to each value in the value vector. The total size of offset vector is num_of_rows + 1 where the last element points to the end of the last value. This is needed for variable length values (arrays).
value vector The value buffer for a column.

column_data_type

column_data_type is a 4-byte integer that stores the type info of a column. It consists of 3 parts: column_data_type & 0x0000FFFF: The width of the data type in bits. column_data_type & 0x00FF0000 >> 16: The base type of the enum. column_data_type & 0xFF000000 >> 24: Reserved for supporting variable length values (array).

The type enum values and their widths:

Enum Value Name Width in bits
0 bool 1
1 int8 8
2 uint8 8
3 int16 16
4 uint16 16
5 int32 32
6 uint32 32
7 float32 32
8 small_enum 8
9 big_enum 16
10 uuid 128

column_mode consists of three parts:

  • The lowest 3 bit is used for data encoding, it can be one of the following values; (0x0007)

    • 0 means all values are null and the null vector for the column is omitted.
    • 1 means all values are valid and the null vector for the column is omitted.
    • 2 means the null vector is present and there may be values of null in the column.
  • The middle 3 bit will be used for update operation, now it support following operators (>>3 & 0x0007)

    • 0 (default) will overwrite existing value if new value is NOT null, otherwise just skip
    • 1 will simply overwrite existing value even when new data is null
    • 2 addition, add existing value with incoming value
    • 3 min, take the minimum of existing and incoming vaule
    • 4 max, take the maximum of existing and incoming value
  • the high 2 bit is reserved

Recovery

During recovery for fact tables, upsert batches are replayed to populate primary key hash, as well as live batches of the in memory vector store, just like when upserts arrive from clients. (This may be out of date: after the replays, archiving needs to be triggered immediately to handle late arrivals in redo log, in order to avoid over counting. )

For dimension tables, the last snapshot is replayed first to populate primary key and the vector store; then the redo logs are replayed to apply patches. This assumes upserts are idempotent.