Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.16] OCPBUGS-36489: parallelize member health checks #1290

Commits on Jul 3, 2024

  1. OCPBUGS-36301: parallelize member health checks

    https://issues.redhat.com/browse/OCPBUGS-36301
    
    Currently,
    member health is checked in serial with a 30s timeout per member.
    3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process.
    Because of this,
    a slow check on one member could exhaust the timeout for the entire GetMemberHealth function,
    and thus cause later-checked members
    to report as unhealthy even though they were fine.
    
    With this commit,
    I am dropping the internal 30s timeout from GetMemberHealth,
    and instead letting the caller set the timeout.
    Also, the code now checks the health of all members in parallel.
    This will prevent a single slow member
    from affecting the health reporting of other members.
    
    I also added a timeout to the context used in IsMemberHealthy
    which calls GetMemberHealth.
    Neither Trevor nor I were sure why a default timeout wasn't present there,
    though one was present in all other callsites.
    AlexVulaj authored and hasbro17 committed Jul 3, 2024
    Configuration menu
    Copy the full SHA
    94d3821 View commit details
    Browse the repository at this point in the history