-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cassandra database design #6
Comments
The database schema in Marco Argenti's thesis (figure 5.1) is a good place to start. In that, we have the following tables and columns:
These can be easily written in Cassandra as column families. However, Cassandra is a schemaless, distributed database, which raises few problems. Let's assume that we want to look for environmental data with a certain trial ID. If we were to query Cassandra with something like SELECT * FROM EnvironmentData WHERE TrialID = ***, we may end up having to read several nodes of our Cassandra cluster, which has a significant impact on performance.
Attached is a quick sketch showing Marco's database schema, plus the additional indexes on the side. |
@Cerfoglg thank you for the description. Some notes and TODOs:
We should also consider when it makes sense to add Materialized View to access data in a more performant way: http://www.datastax.com/dev/blog/new-in-cassandra-3-0-materialized-views |
Actually, looking at materialised views, if I'm understanding it correctly, we may just want to use those instead of creating explicit inverted indexes, as they help solve the same issue. Opinions? |
@Cerfoglg it seems. Give a look at the following three references, in order to came up with the right approach according to our requirements:
Write down PROs and CONs of both of the approaches. First clearly write down the requirement we have in writing and reading data. |
Attaching here the code to create the Cassandra database. To use this, run cqlsh with option -f to pass this file. CREATE KEYSPACE benchflow
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
USE benchflow;
CREATE TABLE environment_data (
environment_data_id uuid,
container_properties_id uuid,
read_time text,
cpu_percpu_usage list<bigint>,
cpu_total_usage bigint,
cpu_percent_usage float,
cpu_throttling_data map<text, bigint>,
memory_usage bigint,
memory_max_usage bigint,
network_interfaces map<text, uuid>,
experiment_id text,
trial_id text,
PRIMARY KEY ((trial_id, experiment_id), environment_data_id, container_properties_id)
);
CREATE TABLE network_interface_data (
network_interface_data_id uuid,
network_rx_dropped bigint,
network_rx_bytes bigint,
network_rx_errors bigint,
network_tx_packets bigint,
network_tx_dropped bigint,
network_rx_packets bigint,
network_tx_errors bigint,
network_tx_bytes bigint,
PRIMARY KEY (network_interface_data_id)
);
CREATE TABLE experiment (
experiment_id text,
replication_num int,
trial_id text,
PRIMARY KEY ((trial_id, experiment_id))
);
CREATE TABLE process (
process_instance_id uuid,
source_process_instance_id text,
process_definition_id text,
start_time timestamp,
duration bigint,
end_time timestamp,
experiment_id text,
trial_id text,
PRIMARY KEY ((trial_id, experiment_id), process_instance_id)
);
CREATE TABLE construct (
construct_instance_id uuid,
source_construct_instance_id text,
construct_type text,
construct_name text,
start_time timestamp,
duration bigint,
end_time timestamp,
process_instance_id uuid,
experiment_id text,
trial_id text,
PRIMARY KEY ((trial_id, experiment_id), construct_instance_id, process_instance_id)
);
CREATE TABLE container_properties (
container_properties_id uuid,
container_id text,
experiment_id text,
trial_id text,
host_id uuid,
environment map<text, text>,
image text,
labels list<text>,
links list<text>,
log_driver text,
u_limits map<text, int>,
volume_driver text,
volumes_from list<text>,
cpu_shares int,
cpu_set_cpus text,
cpu_set_mems text,
cpu_quota int,
cpu_period int,
blkio_weight int,
mem_limit text,
mem_swap_limit text,
mem_reservation_limit text,
mem_kernel_limit text,
memory_swappiness int,
oom_kill_disable boolean,
privileged boolean,
read_only boolean,
restart text,
user text,
name text,
network text,
restart_policy text,
PRIMARY KEY ((trial_id, experiment_id), container_properties_id, host_id)
);
CREATE TABLE host_properties (
host_id uuid,
cpu_cfs_period boolean,
cpu_cfs_quota boolean,
debug boolean,
discovery_backend text,
docker_root_dir text,
driver text,
driver_status list<text>,
execution_driver text,
experimental_build boolean,
http_proxy text,
https_proxy text,
ipv4_forwarding boolean,
index_server_address text,
init_path text,
init_sha1 text,
kernel_version text,
labels list<text>,
mem_total bigint,
memory_limit boolean,
n_cpu int,
n_events_listener int,
n_fd int,
n_goroutines int,
name text,
no_proxy text,
oom_kill_disable boolean,
operating_system text,
swap_limit boolean,
system_time text,
server_version text,
docker_version text,
docker_os text,
docker_kernel_version text,
docker_go_version text,
docker_git_commit text,
docker_arch text,
docker_api_version text,
docker_experimental boolean,
alias text,
external_ip text,
internal_ip text,
hostname text,
purpose text,
PRIMARY KEY(host_id)
); Definitions:
Setup:
|
@VincenzoFerme
|
@Cerfoglg
|
@VincenzoFerme
|
@Cerfoglg
|
|
|
8.1
...
"networks": {
"eth0": {
"rx_bytes": 5338,
"rx_dropped": 0,
"rx_errors": 0,
"rx_packets": 36,
"tx_bytes": 648,
"tx_dropped": 0,
"tx_errors": 0,
"tx_packets": 8
},
"eth5": {
"rx_bytes": 4641,
"rx_dropped": 0,
"rx_errors": 0,
"rx_packets": 26,
"tx_bytes": 690,
"tx_dropped": 0,
"tx_errors": 0,
"tx_packets": 9
}
}
...
...
"cpu_stats" : {
"cpu_usage" : {
"percpu_usage" : [
16970827,
1839451,
7107380,
10571290
],
...
"total_usage" : 36488948,
...
},
"system_cpu_usage" : 20091722000000000,
"throttling_data" : {}
}
... |
8.2 |
@VincenzoFerme |
@Cerfoglg
|
@VincenzoFerme
|
@Cerfoglg thank you for the update. Some remaining points:
|
@VincenzoFerme |
@VincenzoFerme |
@VincenzoFerme
|
@Cerfoglg
|
@VincenzoFerme |
@Cerfoglg thank you for the effort. I made the following changes:
If you need to change something, please do it on: https://gist.github.com/VincenzoFerme/c3f142935cc0f89a99c9 (the gist has been removed because out of date) |
@Cerfoglg something we should evaluate to enhance the performance of our schema definition:
If you need to change something, please do it on: https://gist.github.com/VincenzoFerme/c3f142935cc0f89a99c9 (the gist has been removed because out of date) |
|
|
Discuss the design of the Cassandra database, similar to what has be done for Minio in #4
The text was updated successfully, but these errors were encountered: