Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SearchBackPressure Service Node/Cluster RCA #437

Merged
merged 40 commits into from
Jul 20, 2023

Conversation

CoderJeffrey
Copy link
Contributor

@CoderJeffrey CoderJeffrey commented Jul 8, 2023

Is your feature request related to a problem? Please provide an existing Issue # , or describe.
After Reader Component of RCA digest Search Back Pressure service related stats and populate the stats in sqlitedb, we need Cluster/Node level RCA to analyze the stats and send node/cluster level Resource Flow Units to deciders. Then, deciders can use the resource flow units to suggest actions for downstream classes.

Describe the solution you are proposing
Create Cluster/Node Level RCA to generate Resource FlowUnit by comparing stats of SearchBackPressure Service and threshold, and added corresponding UTs.

Search Back Pressure RCA analyze the stats of Searchbp_Stats Table in sqlitedb from previous PR of Reader Component for RCA (#427). Example stats (e.g heap_cancellation_count with threshold, generate healthy/unhealthy resource flow units) are used to determine the health status of a node.

Describe alternatives you've considered
N/A

Additional context
The logic to decide whether the node level rca healthiness is not decided yet, we will talk with other teams (e.g. SearchBackPressure team) to discuss the final threshold for RCA, and the logic will likely change in near future.

Check List

  • New functionality includes testing.
  • All tests pass
  • New functionality has been documented.
  • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…e level (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>
build.gradle Outdated
@@ -371,7 +371,7 @@ dependencies {
implementation("org.mockito:mockito-core") {
version {
strictly "2.23.0"
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid trailing space

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


// cancellationCount due to heap is less than 30% of all task cancellations in task level
public static final int DEFAULT_TASK_MIN_HEAP_CANCELLATION_THRESHOLD = 30;
private Integer minTaskHeapCancellationPercentageThreshold;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the values of these thresholds be modified during runtime? If not we can make them final as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these values like minTaskHeapCancellationPercentageThreshold will read values from config file in runtime thus we cannot mark them as final

minHeapDecreasePercentageThreshold = conf.readRcaConfig( CONFIG_NAME, RCA_CONF_KEY_CONSTANTS.MAX_HEAP_USAGE_DECREASE_FIELD, DEFAULT_MIN_HEAP_DECREASE_THRESHOLD, (s) -> s >= 0 && s <= 100, Integer.class);

}

// name for the configuration field
public static class RCA_CONF_KEY_CONSTANTS {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to use enum instead of public static class here to define a list of related constants.
In general, enums offer better readability, type safety, and if we need to iterating over those constants we can simply use the values() method. This is more straightforward than manually iterating over constants in a public static final class.
you can do something like:

    public enum RCA_CONF_KEY_CONSTANTS {
        MAX_HEAP_USAGE_INCREASE_FIELD("max-heap-usage-increase"),
        MAX_SHARD_HEAP_CANCELLATION_PERCENTAGE_FIELD("max-shard-heap-cancellation-percentage");
        // ...

        private final String value;
        RCA_CONF_KEY_CONSTANTS(final String value) {
            this.value = value;
        }
        @Override
        public String toString() {
            return value;
        }
    }

Copy link
Contributor Author

@CoderJeffrey CoderJeffrey Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the suggestion. I have changed the design from static class to Enum.

RcaConsts.RcaTagConstants.TAG_AGGREGATE_UPSTREAM,
RcaConsts.RcaTagConstants.LOCUS_DATA_NODE);

// To Do SearchBackPressure Decider
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use TODO instead of To Do for get better IDE support

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -250,4 +250,61 @@ public double readMin() {
return Double.NaN;
}
}

/**
* Sliding window to check the max/min olg gen usage within a given time frame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: olg -> old

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


@Override
public void next(SlidingWindowData e) {
boolean pollFirstCondition;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable pollFirstCondition is not used. We can remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

}

/*
* operate() is used for local build
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operate() is used for local build

is this related to this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the operate() line is a dead comment for that function, removed.


updateAllSlidingWindows(searchBackPressureRCAMetric, currentTimeMillis);

LOG.info("SearchBackPressureRCA currentIterationNumber is {}", currentIterationNumber);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can change some of the LOG.info messages to LOG.debug. LOG.info should only contains essential information to store in the service log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change unnecessary log message level to debug

@khushbr khushbr self-requested a review July 20, 2023 22:05
@ansjcy ansjcy merged commit 2b970e3 into opensearch-project:main Jul 20, 2023
4 of 7 checks passed
@khushbr khushbr added v2.11.0 Issues targeting release v2.11.0 backport 2.x labels Oct 5, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 5, 2023
* Remove log files and add DCO (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Remove extra files (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Remove styling difference (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Remove unnecessary file changes (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add RCA_Decider (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Extract Heap Usage from SQlitedb (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Extract required searchbp metrics for deciders (signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add SearchBackPressureRCA Metric (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Use SearchBackPressureRCAMetrics to aggregate metrics (signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add the conf file extracted part for SearchBackPressureRcaConfig.java (signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add MinMaxSlidingWindow in OldGenRca (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Rename SearchBackPressureClusterRCA and add it to AnalysisGraph (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add basic UTs for SearchBackPressureRCA cluster/node level (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add unhealthy/healthy stats UTs for SearchBackPressureRCA cluster/node level (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add healthy resource unit UT (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add UT s both shard/task level (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add a new SearchBp Resource Unit (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add UTs to test shard/task level resource include-ness (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Remove styling changes for Version.java (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add metadata to resourceSummary (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Update to more general framework (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Refactor the MinMaxSlidingWindow and bug fix (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Refactor Heap Stats Metrics Getter(Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Refactor HeapUsed and HeapMax Getters (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Refactor operate() (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Refactor operate() and remove dead comments (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Merged Main (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Merged Main (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* remove trailing space in build.gradle (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* nit javadoc update (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* nit javadoc updates (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* nit javadoc updates (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Remove dead comments  (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* update javadoc  (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* LOG Level Change  (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Change from static class to enum (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

---------

Signed-off-by: CoderJeffrey <[email protected]>
(cherry picked from commit 2b970e3)
khushbr pushed a commit that referenced this pull request Oct 5, 2023
* Remove log files and add DCO (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Remove extra files (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Remove styling difference (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Remove unnecessary file changes (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add RCA_Decider (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Extract Heap Usage from SQlitedb (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Extract required searchbp metrics for deciders (signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add SearchBackPressureRCA Metric (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Use SearchBackPressureRCAMetrics to aggregate metrics (signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add the conf file extracted part for SearchBackPressureRcaConfig.java (signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add MinMaxSlidingWindow in OldGenRca (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Rename SearchBackPressureClusterRCA and add it to AnalysisGraph (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add basic UTs for SearchBackPressureRCA cluster/node level (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add unhealthy/healthy stats UTs for SearchBackPressureRCA cluster/node level (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add healthy resource unit UT (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add UT s both shard/task level (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add a new SearchBp Resource Unit (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add UTs to test shard/task level resource include-ness (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Remove styling changes for Version.java (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Add metadata to resourceSummary (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Update to more general framework (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Refactor the MinMaxSlidingWindow and bug fix (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Refactor Heap Stats Metrics Getter(Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Refactor HeapUsed and HeapMax Getters (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Refactor operate() (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Refactor operate() and remove dead comments (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Merged Main (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Merged Main (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* remove trailing space in build.gradle (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* nit javadoc update (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* nit javadoc updates (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* nit javadoc updates (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Remove dead comments  (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* update javadoc  (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* LOG Level Change  (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

* Change from static class to enum (Signed-off-by: Jeffrey Liu [email protected])

Signed-off-by: CoderJeffrey <[email protected]>

---------

Signed-off-by: CoderJeffrey <[email protected]>
(cherry picked from commit 2b970e3)

Co-authored-by: Jeffrey Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x v2.11.0 Issues targeting release v2.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants