Test cluster initial state transfer after Infinispan 14.0.13 upgrade #438

ahus1 · 2023-07-21T09:42:25Z

With Infinispan 14.0.10, we've seen Pods not starting up with an initial state transfer. Some fixes have been added to 14.0.13, so it would be worth having an extended run with killing pods.

Details on how much load we want to add, how many pods and how many active sessions we want to have need to be added as part of a refinement.

Possible pre-requisite: #440

Suggestion:

Succeed with the state transfer with 100 chaos-killed pods under reasonable load (for example 200 logins per second)
Check if metrics can be collected on healthy nodes while one node restarts (this had problems / timeouts in earlier versions)
To verify no new bug came up with the migration to Autora

kami619 · 2023-07-28T14:32:57Z

Succeed with the state transfer with 100 chaos-killed pods under reasonable load (for example 200 logins per second)

Performed this run in which, we did the below,

Cluster Config:

External Aurora configured
Disabled the sticky sessions
15 DB Connections in the pool for each Keycloak pod
5 Cpu request/limit for each Keycloak pod
2 GB memory request and 3 GB memory limit set for each Keycloak pod
OTEL disabled
6 Keycloak pods
Legacy store
eu-west-1 AWS region used for both Keycloak ROSA cluster as well as Aurora PostgreSQL Cluster

Keycloak config:

Default password hash iteration algorithm and number of hashes
100,000 users and clients created in realm-0
100,000 user sessions created before the load run

Load run config:

extended load run with 200 ups ran for 4600 seconds
kc-chaos.sh script running for about 1.2 hrs killing one pod at a time for 100 iterations
collect logs at the end of the run

Observations:

We got new errors thread pool in Keycloak pod during restarts which we are dealing with in Configure Keycloak worker thread pool to be aligned with JGroup thread pool size #452
Throughput stayed the same from previous runs
Response times remained good

Gatling result:

Pod recovery time:

Check if metrics can be collected on healthy nodes while one node restarts (this had problems / timeouts in earlier versions)

We can collect metrics from all nodes during the restarts, didnt observe any data loss for other pods when one of the pods in the cluster is terminated forcefully.

To verify no new bug came up with the migration to Aurora

No issues observed here, the performance was comparable to that of the internally hosted PostgreSQL pod.

kami619 · 2023-07-28T14:33:30Z

@ahus1 I think based on this information, we can close this ticket for testing the changes for Infinispan fixes and Aurora DB being externalized.

ahus1 · 2023-07-28T14:34:10Z

Agreed. Thanks!

ahus1 mentioned this issue Jul 21, 2023

Optimise failure scenario #393

Closed

ahus1 assigned martin-kanis and kami619 Jul 24, 2023

kami619 closed this as completed Jul 28, 2023

ahus1 mentioned this issue Sep 4, 2023

Re-run failure test to verify JGroups thread findings #520

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test cluster initial state transfer after Infinispan 14.0.13 upgrade #438

Test cluster initial state transfer after Infinispan 14.0.13 upgrade #438

ahus1 commented Jul 21, 2023 •

edited

Loading

kami619 commented Jul 28, 2023

kami619 commented Jul 28, 2023

ahus1 commented Jul 28, 2023

Test cluster initial state transfer after Infinispan 14.0.13 upgrade #438

Test cluster initial state transfer after Infinispan 14.0.13 upgrade #438

Comments

ahus1 commented Jul 21, 2023 • edited Loading

kami619 commented Jul 28, 2023

kami619 commented Jul 28, 2023

ahus1 commented Jul 28, 2023

ahus1 commented Jul 21, 2023 •

edited

Loading