Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(performance): change loader type #9484

Merged
merged 1 commit into from
Feb 20, 2025

Conversation

juliayakovlev
Copy link
Contributor

@juliayakovlev juliayakovlev commented Dec 4, 2024

Run the elasticity / operations / upgrade tests with c-s image v3.17.3.

Testing

  • [ ]

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@juliayakovlev juliayakovlev force-pushed the elasticity_loader_change branch 3 times, most recently from 3f7c381 to ed5cc0c Compare December 4, 2024 16:13
@@ -32,8 +32,10 @@ print_kernel_callstack: true

store_perf_results: true
email_recipients: ["[email protected]"]
use_prepared_loaders: true
#use_prepared_loaders: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we are putting this in a comment ?

@fruch
Copy link
Contributor

fruch commented Dec 11, 2024

@juliayakovlev

can you point to the results with this one ?
i.e. did it helped in anything it was supposed to help with ?

@roydahan
Copy link
Contributor

@juliayakovlev

can you point to the results with this one ? i.e. did it helped in anything it was supposed to help with ?

For the elasticity issue it solved the imbalance issue and got better latencies.
It was shown in my testing here

fruch
fruch previously approved these changes Dec 16, 2024
Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@juliayakovlev just lets get rid of the commented out configurations

@juliayakovlev
Copy link
Contributor Author

LGTM

@juliayakovlev just lets get rid of the commented out configurations

done

use_hdr_cs_histogram: true
use_placement_group: true
use_capacity_reservation: true
email_subject_postfix: 'latency during operations'
stress_image:
cassandra-stress: 'scylladb/cassandra-stress:3.17.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you compare results?

stress_image:
cassandra-stress: 'scylladb/cassandra-stress:3.13.0'
cassandra-stress: 'scylladb/cassandra-stress:3.17.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliayakovlev IIUC the results with this client are lower than original, right?
If so, we what are we doing with it?

Copy link
Contributor Author

@juliayakovlev juliayakovlev Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roydahan

With 2024.3.0, read load and tablets enabled we received better result.
In all other cases max throughput is less and latency is higher

NOTE: Every throughput number below is link to this result in excel

Mixed load test

  1. Scylla version 2024.1.0, tablets disabled:
c-s version Max throughput P99 read
3.13.0 561309 482.61
3.17.0 553260 515.38
  1. Scylla version 2024.3.0~dev-20241106, tablets enabled:
c-s version Max throughput P99 read
3.13.0 505753 370.15
3.17.0 483284 495.19

Read load test

  1. Scylla version 2024.1.0, tablets disabled:
c-s version Max throughput P99 read
3.13.0 877512 46.73
3.17.0 853437 47.78
  1. Scylla version 2024.3.0~dev-20241106, tablets enabled:
c-s version Max throughput P99 read
3.13.0 687654 41.32
3.17.0 762589 39.29

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing in this change that is supposed to affect vnodes case.

We see here ~24k ops, in both directions, which seems a bit contradicting...

Can you share exact links to all those runs, so we can cross check it ?

Sound like we'll need to gather more results to understand what's going on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fruch @roydahan
I re-ran the test with Scylla version 2024.3.0~dev-20241106, tablets enabled and c-s version 3.17.0.

  1. Max throughput in mixed load:
Max throughput P99 read link to run
541625 340.26 https://argus.scylladb.com/tests/scylla-cluster-tests/8217747c-4120-436c-aac7-40c3c659f6c6
525913 916.98 https://argus.scylladb.com/tests/scylla-cluster-tests/12342e92-d839-4aab-aeb5-fdf323453b4e
559173 355.99 https://argus.scylladb.com/tests/scylla-cluster-tests/e1ddb22d-7ad0-4080-9343-8558a6ba5772

In last 3 runs the max throughput is improved in mixed load + tablets enabled + c-s 3.17.0.
Max throughput in run with c-s 3.13.0 is 505753 - see #9484 (comment)

  1. Max throughput in read load:
Max throughput P99 read link to run
677463 42.83 https://argus.scylladb.com/tests/scylla-cluster-tests/b6cf3667-39ec-4662-9aee-22b3f3d2c8d5
782369 39.98 https://argus.scylladb.com/tests/scylla-cluster-tests/5747dfe0-61c9-450a-a357-097a13c3186c

The results with c-s 3.17.0 are not consistent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results with 3.13.0 are proved to be consistent, right?
Can you link please summarize them side by side and open a new issue to our cassandra-stress repo?
We can't switch to this release till we find out why it's not consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I suggest 3 (or more consecutive runs on of unthrottled) on the same cluster to see if it's consistent on the same cluster.
(Best would be 3 runs with 3.13.0 and 3 runs with 3.17.0 on the same cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roydahan will do when back

@juliayakovlev
Copy link
Contributor Author

@roydahan @fruch

Scylla version 2024.3.0~dev-20241106
Every test ran 3 untrottled cycles on the same cluster

Tablets disabled

  1. Max throughput in mixed load:
c-s version cycle 1 cycle 2 cycle 3 link
3.13.0 561599 561712 557011 https://argus.scylladb.com/tests/scylla-cluster-tests/516bfdc7-215b-4bdd-bd1c-be795c389134
3.17.0 533610 550041 553213 https://argus.scylladb.com/tests/scylla-cluster-tests/f667b4e3-bb37-43cf-aa40-edb6930480aa
  1. Max throughput in read load:
c-s version cycle 1 cycle 2 cycle 3 link
3.13.0 899256 885138 881635 https://argus.scylladb.com/tests/scylla-cluster-tests/0d7b9ac8-c847-456a-ae47-1e8e312cdac8
3.17.0 873856 854535 875792 https://argus.scylladb.com/tests/scylla-cluster-tests/f1e28be1-e166-423e-bde5-86852f5a3f02

Tablets enabled

  1. Max throughput in mixed load:
c-s version cycle 1 cycle 2 cycle 3 link
3.13.0 543966 542706 543258 https://argus.scylladb.com/tests/scylla-cluster-tests/aacd976a-21b8-4c95-a00c-cdeea51d7ba8
3.17.0 525962 523390 525102 https://argus.scylladb.com/tests/scylla-cluster-tests/b449b191-8816-445f-8f48-86ff6451835f
  1. Max throughput in read load:
c-s version cycle 1 cycle 2 cycle 3 link
3.13.0 780703 786261 782665 https://argus.scylladb.com/tests/scylla-cluster-tests/076805d0-3cd8-4b12-8697-49599cafc43d
3.17.0 644307 639954 640790 https://argus.scylladb.com/tests/scylla-cluster-tests/54f0e5d0-af63-46cf-ba97-4b71057a1f16

Max throughput is higher (better) when the test runs with c-s 3.13.0 than with c-s 3.17.0

I'll open the issue in c-s repo

@roydahan
Copy link
Contributor

An easier (and cheaper) approach to prove it would be to run the test on the same cluster (once with 3.13 and then to switch to 3.17).

Anyway, you've made enough runs to prove that it can't be a coincidence and the issue is with 3.17

@juliayakovlev
Copy link
Contributor Author

juliayakovlev commented Dec 30, 2024

@juliayakovlev
Copy link
Contributor Author

There is new driver version with fix. The test is running

@juliayakovlev
Copy link
Contributor Author

juliayakovlev commented Feb 19, 2025

This is new driver test results scylladb/cassandra-stress#38 (comment)
@fruch @roydahan do we want to merge this?

@roydahan
Copy link
Contributor

In all the tests that we don't measure max throughput (e.g. elasticity, operations, upgrades) we can change the loader version to 3.17.3.

@juliayakovlev juliayakovlev force-pushed the elasticity_loader_change branch from c2b075a to 4472fa5 Compare February 20, 2025 12:47
@juliayakovlev juliayakovlev changed the title fix(elasticity): change loader type fix(performance): change loader type Feb 20, 2025
Run the elasticity / operations / upgrade tests with c-s image v3.17.3.
@juliayakovlev juliayakovlev force-pushed the elasticity_loader_change branch from 4472fa5 to 60e8133 Compare February 20, 2025 12:52
@juliayakovlev
Copy link
Contributor Author

@roydahan @fruch please, review

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fruch fruch merged commit 3cc432e into scylladb:master Feb 20, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants