-
Notifications
You must be signed in to change notification settings - Fork 234
Dimension Vector
Jeremy H. Shi edited this page Jul 19, 2019
·
5 revisions
this documentation describes the format of dimension
- Dimension vector is to hold all dimension values and their validities before sort and reduction phase.
- Dimension value size can be 1 (int8), 2 (int16), 4 (int32, float) bytes.
- For each row, the dimension values will
- Each row will have the validity bytes (1 byte for each dimension value to represent True/False) following dimension values
- Each row will be padded into 4-byte multiple
| dim1 | dim2 | dim3 | validity1 | validity2 | validity3|
eg.
city_id (uint16) | status (uint16) | vvid (uint32) |
---|---|---|
1 | 0 | 1 |
2 | 1 | null |
1 | null | 2 |
null | null | 3 |
will be packed into
1(int32) | 0(int16) | 1(int16) | true(int8) | true(int8) | true(int8) | padding |
---|---|---|---|---|---|---|
0 | 1 | 2 | false | true | true | padding |
2 | 0 | 1 | true | false | true | padding |
3 | 0 | 0 | true | fasle | false | padding |
The total bytes used for each row in the example is 4 + 2 + 2 + 1 * 3 = 11 + 5(padding) = 16
To optimize read performance we rewrote dimension vector to column oriented format. What not changed:
- dimension value size (1,2,4 bytes)
- validity bytes: 1 byte each value, indicating whether it's valid or null what's changed:
- instead of row oriented layout, we write all dimension values first, 1 dimension then another (dimensions are sorted in descending order of data width)
- then we write validity bytes, in the same dimension order
eg.
city_id (uint16) | status (uint16) | vvid (uint32) |
---|---|---|
1 | 0 | 1 |
2 | 1 | null |
1 | null | 2 |
null | null | 3 |
one valid packing will be:
vvid values (uint32) | city_id values (uint16) | status values (uint16) | vvid validity bytes (byte) | city_id validity bytes (byte) | status validity bytes (byte) |
---|---|---|---|---|---|
4*4 | 2*4 | 2*4 | 1*4 | 1*4 | 1*4 |
the total number of bytes needed will be 16 + 8 + 8 + 4*3 = 44
number of bytes needed for each row is 4 + 2 + 2 + 3 = 11