Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test cluster initial state transfer after Infinispan 14.0.13 upgrade #438

Closed
ahus1 opened this issue Jul 21, 2023 · 3 comments
Closed

Test cluster initial state transfer after Infinispan 14.0.13 upgrade #438

ahus1 opened this issue Jul 21, 2023 · 3 comments
Assignees

Comments

@ahus1
Copy link
Contributor

ahus1 commented Jul 21, 2023

With Infinispan 14.0.10, we've seen Pods not starting up with an initial state transfer. Some fixes have been added to 14.0.13, so it would be worth having an extended run with killing pods.

Details on how much load we want to add, how many pods and how many active sessions we want to have need to be added as part of a refinement.

Possible pre-requisite: #440

Suggestion:

  • Succeed with the state transfer with 100 chaos-killed pods under reasonable load (for example 200 logins per second)
  • Check if metrics can be collected on healthy nodes while one node restarts (this had problems / timeouts in earlier versions)
  • To verify no new bug came up with the migration to Autora
@kami619
Copy link
Contributor

kami619 commented Jul 28, 2023

Succeed with the state transfer with 100 chaos-killed pods under reasonable load (for example 200 logins per second)

Performed this run in which, we did the below,

Cluster Config:

  • External Aurora configured
  • Disabled the sticky sessions
  • 15 DB Connections in the pool for each Keycloak pod
  • 5 Cpu request/limit for each Keycloak pod
  • 2 GB memory request and 3 GB memory limit set for each Keycloak pod
  • OTEL disabled
  • 6 Keycloak pods
  • Legacy store
  • eu-west-1 AWS region used for both Keycloak ROSA cluster as well as Aurora PostgreSQL Cluster

Keycloak config:

  • Default password hash iteration algorithm and number of hashes
  • 100,000 users and clients created in realm-0
  • 100,000 user sessions created before the load run

Load run config:

  • extended load run with 200 ups ran for 4600 seconds
  • kc-chaos.sh script running for about 1.2 hrs killing one pod at a time for 100 iterations
  • collect logs at the end of the run

Observations:

Gatling result:
image

Pod recovery time:
image

Check if metrics can be collected on healthy nodes while one node restarts (this had problems / timeouts in earlier versions)

We can collect metrics from all nodes during the restarts, didnt observe any data loss for other pods when one of the pods in the cluster is terminated forcefully.

To verify no new bug came up with the migration to Aurora

No issues observed here, the performance was comparable to that of the internally hosted PostgreSQL pod.

@kami619
Copy link
Contributor

kami619 commented Jul 28, 2023

@ahus1 I think based on this information, we can close this ticket for testing the changes for Infinispan fixes and Aurora DB being externalized.

@ahus1
Copy link
Contributor Author

ahus1 commented Jul 28, 2023

Agreed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants