-
Notifications
You must be signed in to change notification settings - Fork 234
Schema
An AresDB instance organizes data as tables. This is similar to most relational databases with the following differences:
- There is no database namespace, tables are managed directly by AresDB instance.
- Tables are explicitly defined as either fact tables or dimension tables.
- Each table must have a primary key (over 1+ columns). The primary key is implemented as an unsorted index.
In addition, each table is configured with a batch size.
Each table consists of a list of columns. Each column is identified by a user specified string name, and also by an auto-assigned ID (ranges from 0 to 65535). Column renaming or data type changes are not supported. Data type changes can be achieved via converting data from an old column to a new column, and deleting the old column if possible. New columns can be added at any time. An existing column can be deleted only if it is not part of the primary key, and it's not the designated time column of a fact table. The primary key definition cannot change for any table.
Users can specify default value for a column when creating the table or add a new column. Default value is specified as string value and will be parsed according to data type. Enum values will be translated from string to uint8 or uint16 and new enum cases will be added to metastore while setting the default value. Adding or changing default value to existing columns is not allowed. If default value is not specified, it's equivalent to setting default value as null. Thereby we don't allow specify default value as null
A fact table (for time series events) have the following specifications in the schema in addition:
- The first column (ID: 0) is always the designated time column, with a data type of uint32 (seconds since Epoc). This column cannot be changed in any way. Values for this column must be clean and non-null, should not be older than
now - archiving_delay
(until backfill is supported, see next), and should not be updated to a different value, especially when the old and new values belong to different UTC days. - Data archiving delay: the amount of time to wait until data becomes stable (typically 24-48 hours) for archiving.
- Data archiving interval: how often should data archiving run.
- Archiving sort columns: a list of columns that specifies how the records in a archive batch should be sorted before run-length compression. The list does not have to cover all columns. For Uber's business this typically starts with city_id, since it is also used for filtering a sub-range of data for processing. When a column is deleted it remains on this sort order as a placeholder. Sub-range filtering must match from the beginning of this sort order, and terminate at any such placeholder. The list cannot be modified for now.
Primary key of a fact table is also periodically archived to disk and purged from memory, this means the deduplication support is limited to a time window.
For dimension tables, a snapshot threshold can be specified as the number of upserts to accumulate before snapshot is triggered.
AresDB schema can be stored internally as a file per table with the data files. The purpose of this storage is to serve as meta data for the records stored, therefore it makes sense to store it together with the data. Schema integration is a little out of scope for the core AresDB project, but it worths discussing the surrounding landscape for a better understanding of the bigger picture.
First, AQL validation (initial schema validation, bad query prevention), TSQ (time series query) transformation and caching should be handled by Pythia outside the scope of AresDB. Queries received by AresDB should be TSQ-compatible and validated against cached and bad queries. Additional schema validation is always needed since there could be schema discrepancy.
An external schema self-serving service allows users to create and modify AresDB tables, as well as to setup streaming/backfill ingestions against the schema. The schema is stored externally as the desired state. It can be pushed/pulled to ingestion, Pythia and AresDB separately. Upon receiving a new version of the schema, AresDB needs to diff it against its internal copy, making corresponding data changes (e.g., deleting a column), and updates its internal copy to the new one. This procedure can fail when incompatible changes are requested (e.g., deleting a primary key column).
Path | GET | POST | PUT | DELETE |
---|---|---|---|---|
/schema/tables | ListTables | AddTable | - | - |
/schema/tables/{table} | GetTable | - | UpdateTable | DeleteTable |
/schema/tables/{table}/columns | - | AddColumn | - | - |
/schema/tables/{table}/columns/{column} | - | - | UpdateColumn | DeleteColumn |
/schema/tables/{table}/columns/{column}/enum-cases | ListEnumCases | AddEnumCase | - | - |