Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Q] Should benchmarks be added? #225

Open
ablearthy opened this issue Jun 9, 2021 · 0 comments
Open

[Q] Should benchmarks be added? #225

ablearthy opened this issue Jun 9, 2021 · 0 comments

Comments

@ablearthy
Copy link

The performance is really impressive. Thanks a lot!

I wrote the simple benchmark:

from clickhouse_driver import Client

import random
import string
import time
# import datetime

def get_metric_name():
    return ''.join(random.choices(string.ascii_letters, k=random.randint(4, 16)))

def get_fake_data(metrics_count: int, duration: int):
    start_time = random.randint(1, int(time.time()))
    datetimes = [start_time + i for i in range(duration)]
    metric_names = [get_metric_name() for _ in range(metrics_count)]

    data = []
    for t in datetimes:
        for m in metric_names:
            data.append(('fake_string', t, m, random.uniform(-1e4, 1e4)))
            # data.append(('fake_string', datetime.datetime.fromtimestamp(t), m, random.uniform(-1e4, 1e4)))
    
    return data

def main():
    deltas = []
    client = Client('localhost')

    for i in range(100):
        data = get_fake_data(200, 12 * 60)

        start_time = time.perf_counter()

        client.execute("INSERT INTO bench.insert (ClusterName, Time, Metric, Value) VALUES", data)

        elapsed = time.perf_counter() - start_time
        print(f"#{i} took {elapsed * 1000: .3f} ms")
        deltas.append(elapsed)

    client.disconnect()

    print(f"Average is {sum(deltas) / len(deltas) * 1000: .4f} ms")

Table: CREATE TABLE bench.insert (ClusterName String, Metric String, Time DateTime, Value Float32) ENGINE = MergeTree PARTITION BY (ClusterName, toYYYYMM(Time)) ORDER BY (ClusterName, Metric, Time)
The average time needed to insert the data is 211 ms. Compared to C++ (clickhouse-cpp) where same operation takes around 114 ms, it's really impressive.

However, putting datetime.datetime instead of int increases average time up to 267 ms.

I also tried to insert data by columns, but the script isn't finished. (data = {'ClusterName': [...], 'Metric': [...], ...}; ...; client.execute("INSERT INTO bench.insert (ClusterName, Time, Metric, Value) VALUES", data, columnar=True))

Additionally, I tried to insert dataframe. The result is 3.4 s. (df = pd.DataFrame(get_fake_data(200, 720), columns=['ClusterName', 'Time', 'Metric', 'Value']); ...; client.insert_dataframe("INSERT INTO bench.insert (ClusterName, Time, Metric, Value) VALUES", data))

I'm not an expert on ClickHouse benchmarks, but possibly we should add them (this article is not enough)? Probably, we can compare it with other engines.


A little bit about performance

I read from there that

For most data types driver uses binary pack() / unpack() for serialization / deserialization.

The problem is that struct.pack/struct.unpack is slow compared to array.array.tobytes

python -m timeit -s "import array, random; arr = array.array('f', [random.uniform(-1e4, 1e4) for _ in range(1000000)])" -n 100 "arr.tobytes()"
# 100 loops, best of 5: 1.48 msec per loop
python -m timeit -s "import struct, random; lst = [random.uniform(-1e4, 1e4) for _ in range(1000000)]" -n 100 "struct.pack(f'<{len(lst)}f', *lst)"
# 100 loops, best of 5: 28.6 msec per loop

As I know, today struct.pack/unpack is not used, so there's no problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant