Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SearchBackPressure Policy/Decider Generic Framework #2

Open
wants to merge 77 commits into
base: main
Choose a base branch
from

Conversation

CoderJeffrey
Copy link
Owner

@CoderJeffrey CoderJeffrey commented Jul 13, 2023

Is your feature request related to a problem? Please provide an existing Issue # , or describe.

Previous work related to this PR:
For Search Back Pressure Service, previous PR (opensearch-project/performance-analyzer#483) (opensearch-project#427) (opensearch-project#437) have established a pipeline using Collector to read SearchBackPressure Cancellation Stats, using Reader to consume the metrics and populate to the sqlitedb, using Cluster/Node level RCA to consume stats in sqlitedb and compare stats with thresholds to generate Healthy/Unhealthy Flow Unit.

Problem:
We nee a Policy/Decider framework to suggest actions based on cluster level RCAs to tune the thresholds for SearchBackPressure Service.

Describe the solution you are proposing
This PR added the Policy/Decider Framework that consumed the cluster level RCA and suggestion actions to downstream classes (e.g. Users).

Task Breakdown:

  • Policy Class to determine when we need to suggest actions to user to auto-tune the thresholds for Search Back Pressure Service.
  • Decider Class to consume the cluster-level RCA and use the Policy to send the actions to downstream classes.
  • PolicyConfig file to read key thresholds/settings from Config files to improve extensibility for SearchBackPressureDecider.
  • (Generic) Action Class for actions that will be sent to downstream classes to suggest user to tune the thresholds.

Describe alternatives you've considered
N/A

Additional context
Currently, the framework only suggest actions to tune heap related thresholds for Search Back Pressure Service.
Final policy logic to determine when to send out actions to users have not been decided yet. This PR mainly offers an extensible Decider/Policy framework.

Check List

  • New functionality includes testing.
  • All tests pass
  • New functionality has been documented.
  • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…e level (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>
}

public static class Constants {
public static final String INCREASE_STR = "increase";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping the var name as INCREASE should be sufficient.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to INCREASE

* serch-back-pressure-policy should be enabled "hour-breach-threshold": 30, // threshold for hourly
* received unhealthy cluster level rca flow units, if above, then the below thresholds should be
* modified "threshold_count": 2, // how many thresholds to be changed, in this case
* search-heap-threshold and search-task-heap-threshold "search_task_heap_stepsize_in_percentage":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need these percentages individually for all the heap related settings. We can keep a single value which implies the percentage increase in all of the heap related SBP settings.

Copy link
Owner Author

@CoderJeffrey CoderJeffrey Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    private static final String SEARCHBP_HEAP_STEPSIZE_IN_PERCENTAGE =
            "searchbp-heap-stepsize-in-percentage";

is now the value to imply the percentage increase in all of heap related SBP settings. Comments (javadoc) alsol changed accordingly.

"hour-monitor-window-size-minutes";
private static final String HOUR_MONITOR_BUCKET_SIZE_MINUTES =
"hour-monitor-bucket-size-minutes";
private static final String SEARCHBP_HEAP_STEPSIZE_IN_PERCENTAGE =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be redundant.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed and now getMonitorWindowSizeMinutes() and getHourMonitorBucketSizeMinutes() return default values.

public int getHourMonitorWindowSizeMinutes() {
        return DEFAULT_HOUR_MONITOR_WINDOW_SIZE_MINUTES;
    }

    public int getHourMonitorBucketSizeMinutes() {
        return DEFAULT_HOUR_MONITOR_BUCKET_SIZE_MINUTES;
    }

SearchBackPressurePolicy searchBackPressurePolicy;

private int currentIteration = 0;
private SearchBackPressureClusterRCA searchBackPressureClusterRCA;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information is redundant since we are not using it anywhere, I would suggest to remove this member variable from the class.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The currentIteration makes sure new decision will be emitted only per period

private void record(HotResourceSummary issue) {
LOG.info("SearchBackPressurePolicy#record()");
if (HEAP_SEARCHBP_SHARD_SIGNALS.contains(issue.getResource())) {
LOG.info("Recording shard-level issue");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can refactor the nested if blocks into meaningful methods here. e,g; recordShardTaskIssue

}

initialize();
LOG.info(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we logging this here ?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

private long coolOffPeriodInMillis;
private String thresholdName;
private String dimension;
private String direction;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an enum to limit the possible values for this var.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean an enum like this? but in the constructor, the field should still be a String?

public enum ThresholdName {
     HEAP_THRESHOLD("searchbp-heap-threshold"),
     // ...

     private final String value;
     ThresholdName(final String value) {
         this.value = value;
     }
     @Override
     public String toString() {
         return value;
     }
 }


recordIssues();

if (shardHeapThresholdIsTooSmall()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shard and SearchTask related breach should be done separately to avoid a potential bug when both of them needs to be adjusted.

@Override
public void next(SlidingWindowData e) {
boolean pollFirstCondition;
if (isMinSlidingWindow) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic can be refactored into a lambda function and have that lambda as a member variable.

* threshold for search back pressure service stats
* currently, only consider the percentage of JVM Usage cancellation count compared to the total cancellation count
*/
this.heapShardCancellationIncreaseMaxThreshold =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets discuss about these

Signed-off-by: CoderJeffrey <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants