[Q] Should benchmarks be added? #225

ablearthy · 2021-06-09T20:23:55Z

The performance is really impressive. Thanks a lot!

I wrote the simple benchmark:

from clickhouse_driver import Client

import random
import string
import time
# import datetime

def get_metric_name():
    return ''.join(random.choices(string.ascii_letters, k=random.randint(4, 16)))

def get_fake_data(metrics_count: int, duration: int):
    start_time = random.randint(1, int(time.time()))
    datetimes = [start_time + i for i in range(duration)]
    metric_names = [get_metric_name() for _ in range(metrics_count)]

    data = []
    for t in datetimes:
        for m in metric_names:
            data.append(('fake_string', t, m, random.uniform(-1e4, 1e4)))
            # data.append(('fake_string', datetime.datetime.fromtimestamp(t), m, random.uniform(-1e4, 1e4)))
    
    return data

def main():
    deltas = []
    client = Client('localhost')

    for i in range(100):
        data = get_fake_data(200, 12 * 60)

        start_time = time.perf_counter()

        client.execute("INSERT INTO bench.insert (ClusterName, Time, Metric, Value) VALUES", data)

        elapsed = time.perf_counter() - start_time
        print(f"#{i} took {elapsed * 1000: .3f} ms")
        deltas.append(elapsed)

    client.disconnect()

    print(f"Average is {sum(deltas) / len(deltas) * 1000: .4f} ms")

Table: CREATE TABLE bench.insert (ClusterName String, Metric String, Time DateTime, Value Float32) ENGINE = MergeTree PARTITION BY (ClusterName, toYYYYMM(Time)) ORDER BY (ClusterName, Metric, Time)
The average time needed to insert the data is 211 ms. Compared to C++ (clickhouse-cpp) where same operation takes around 114 ms, it's really impressive.

However, putting datetime.datetime instead of int increases average time up to 267 ms.

I also tried to insert data by columns, but the script isn't finished. (data = {'ClusterName': [...], 'Metric': [...], ...}; ...; client.execute("INSERT INTO bench.insert (ClusterName, Time, Metric, Value) VALUES", data, columnar=True))

Additionally, I tried to insert dataframe. The result is 3.4 s. (df = pd.DataFrame(get_fake_data(200, 720), columns=['ClusterName', 'Time', 'Metric', 'Value']); ...; client.insert_dataframe("INSERT INTO bench.insert (ClusterName, Time, Metric, Value) VALUES", data))

I'm not an expert on ClickHouse benchmarks, but possibly we should add them (this article is not enough)? Probably, we can compare it with other engines.

A little bit about performance

I read from there that

For most data types driver uses binary pack() / unpack() for serialization / deserialization.

The problem is that struct.pack/struct.unpack is slow compared to array.array.tobytes

python -m timeit -s "import array, random; arr = array.array('f', [random.uniform(-1e4, 1e4) for _ in range(1000000)])" -n 100 "arr.tobytes()"
# 100 loops, best of 5: 1.48 msec per loop
python -m timeit -s "import struct, random; lst = [random.uniform(-1e4, 1e4) for _ in range(1000000)]" -n 100 "struct.pack(f'<{len(lst)}f', *lst)"
# 100 loops, best of 5: 28.6 msec per loop

As I know, today struct.pack/unpack is not used, so there's no problem.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q] Should benchmarks be added? #225

[Q] Should benchmarks be added? #225

ablearthy commented Jun 9, 2021

[Q] Should benchmarks be added? #225

[Q] Should benchmarks be added? #225

Comments

ablearthy commented Jun 9, 2021