-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Regression Indicators #186
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
zscore = regression.ZScoreIndicator( | ||
entry, | ||
lts_entries, | ||
settings_filter={"image": "tensorflow:2023.1"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try to make this dynamic?
common.Matrix.settings["image"]
should contain all the available images
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you try to do the filtering outside of the regression.ZScoreIndicator
function ? 🤔
I think it would be clearer if you could pass only the relevant lts_entries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add the filtering outside first. It should be as simple as removing the settings_filter
option. How do you mean dynamic though? As in, we want to run a regression analysis for each unique "image"?
entry, | ||
lts_entries, | ||
settings_filter={"image": "tensorflow:2023.1"}, | ||
combine_funcs={"notebook_performance_benchmark_time": np.mean} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use notebook_performance_benchmark_min_time
, which already provides an "aggregated" value?
I think min()
is more relevant than mean()
, but that's open to discussion
if ever we chose to use mean
, then we should expose it as a KPI instead of _min
(the idea is to expose as a KPI the value we're using for the regression analyses)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have explained this better. The combine_funcs={"notebook_performance_benchmark_time": np.mean}
is just an optional way of handling KPIs that provide either a list[int]
or list[float]
. The combine_funcs
option can be omitted and any KPIs that use a list of values will be skipped. All the other KPIs such as notebook_performance_benchmark_min_time
are analyzed as normal.
for lts_entry in common.LTS_Matrix.all_records(): pass | ||
lts_entries_raw = list(common.LTS_Matrix.all_records()) | ||
# Temporarily skip the gathered results | ||
lts_entries = list(filter(lambda entry: type(entry.results) is not list, lts_entries_raw)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try to take the 1st value of the gathered entries?
in theory, all the entries in entry.results
should contain the same result +/- noise, as they have been executed in the same conditions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll just do that. I previously tried flattening them, but I wasn't able to deepcopy MatrixEntry's
|
||
logging.info(f"Received {common.LTS_Matrix.count_records()} historic LTS entries") | ||
for lts_entry in common.LTS_Matrix.all_records(): pass | ||
lts_entries_raw = list(common.LTS_Matrix.all_records()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't really need the list()
here. Without it, lst_entries_row
is a generator, that is processed (only once) at the line below
removing list()
saves memory and processing time, especially if the list gets large
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point!
/test rhoai-plot notebooks https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/184/pull-ci-openshift-psap-topsail-main-rhoai-light/1750426742431944704/artifacts/light/test/artifacts/002__sutest_notebooks__benchmark_performance/ |
@drewrip something went wrong with build (matrixbenchmarking commit wasn't available) |
/test rhoai-plot notebooks https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/184/pull-ci-openshift-psap-topsail-main-rhoai-light/1750426742431944704/artifacts/light/test/artifacts/002__sutest_notebooks__benchmark_performance/ |
/test rhoai-plot notebooks https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/184/pull-ci-openshift-psap-topsail-main-rhoai-light/1750426742431944704/artifacts/light/test/artifacts/002__sutest_notebooks__benchmark_performance/ |
1 similar comment
/test rhoai-plot notebooks https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/184/pull-ci-openshift-psap-topsail-main-rhoai-light/1750426742431944704/artifacts/light/test/artifacts/002__sutest_notebooks__benchmark_performance/ |
Looks like the analysis took place as expected. No regressions were found but there were a number of issues regrading payload validation. Obviously we need to make sure the payload schemas are consistent, but I'm wondering why the matrix entry location was |
good :)
We'll need to find a way to save the
it should appear in the HTML document |
/test rhoai-plot notebooks https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/184/pull-ci-openshift-psap-topsail-main-rhoai-light/1750426742431944704/artifacts/light/test/artifacts/002__sutest_notebooks__benchmark_performance/ |
1 similar comment
/test rhoai-plot notebooks https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/184/pull-ci-openshift-psap-topsail-main-rhoai-light/1750426742431944704/artifacts/light/test/artifacts/002__sutest_notebooks__benchmark_performance/ |
regression += [html.H2("regression analysis")] | ||
regression += [html.Code(yaml.dump(entry.results.lts.regression_results, default_flow_style=False), style={"white-space": "pre-wrap"})] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regression += [html.H2("regression analysis")] | |
regression += [html.Code(yaml.dump(entry.results.lts.regression_results, default_flow_style=False), style={"white-space": "pre-wrap"})] | |
regression += [html.H1("regression analysis")] | |
regression += [html.Code(yaml.dump(lts.regression_results, default_flow_style=False), style={"white-space": "pre-wrap"})] |
this adjusts the title size, to make it top level, and takes the regression_results
field from lts
, which is the focus of this file
@drewrip , the regression results wasn't included (or was empty) in this document: and we'll need to define a model defining the format of the |
regression: NotebookPerformanceRegression | ||
|
||
NotebookPerformanceKPIs = matbench_models.getKPIsModel("NotebookPerformanceKPIs", __name__, kpi.KPIs, kpi.NotebookPerformanceKPI) | ||
NotebookPerformanceRegression = matbench_models.RegressionResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't necessary, if you don't modify matbench_models.RegressionResult
, no need to rename it here
|
||
class Payload(matbench_models.ExclusiveModel): | ||
schema_name: matbench_models.create_schema_field("rhods-notebooks-perf") = Field(alias="$schema") | ||
metadata: Metadata | ||
results: Results | ||
kpis: Optional[NotebookPerformanceKPIs] | ||
regression_results: Optional[Any] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you let this here, in the Payload
? in MatrixBenchmarking, overall, the results
are workload specific, but the payload/metadata/kpis
have a predefined structure (that can be extended, but the core is always the same) , so the regression results belong there. (I need to browse these entries to upload them as KPIs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I'll move them under the Payload
) | ||
zscore_ind = zscore.ZScoreIndicator(entry, controlled_lts_entries) | ||
regression_results[check_setting] = zscore_ind.analyze() | ||
number_of_failures += sum(map(lambda x: x["regression"]["status"], regression_results[check_setting])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in Python, list comprehension are preferred to functional programming:
sum([x[["regression"]["status"]] for x in regression_results[check_setting]])
and would be nice to use the model here, instead of x[["regression"]["status"]
--> x.regression.status
(the model or types.SimpleNamespace
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and, I'm not sure you can sum
the statuses, if you have [-1, 1]
, you end up with 0 failures, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to start moving from functional to Pythonic with dict
and list
comprehensions, I'll convert that.
and, I'm not sure you can sum the statuses, if you have [-1, 1], you end up with 0 failures, no?
In the current way its implemented this is fine. The statuses are really just boolean flags that are set to 1 if there is some anomaly detected. A simple sum
wouldn't be possible if we would like a more complex status, but more details regarding what the regression looks like, and why it is being reported is contained in the RegressionResult
structure. If for some reason we want to roll up that info into a more complex "return code" status, then the sum likely wouldn't work but I suppose we'd want to discuss how we want to report regression.
logging.info(f"Received {common.Matrix.count_records()} new entries") | ||
|
||
number_of_failures = 0 | ||
settings_to_check = ["version", "repeat"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not use to understand exactly what's being checked here, ie how you build the reference data used for the regression checking. We'll need to clarify that, it's important to have that clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, I don't see clearly where you define the KPI being used for regression. Do you test all of them? 🤔
they won't necessarily have the same direction "of failure", eg a latency
going higher is a failure, but a bandwidth
going lower is a failure, so you need to define the regression properties for each of them individually, I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not use to understand exactly what's being checked here, ie how you build the reference data used for the regression checking. We'll need to clarify that, it's important to have that clear
I can try to make that clear. Or we can add a note once we finalize it. At the moment it is considering all of the LTS datapoints with the same settings (minus the image version) to be the historical data. The settings_to_check
is just the settings that we want to check for regression against. We iterate over them to perform the regression testing against each.
also, I don't see clearly where you define the KPI being used for regression. Do you test all of them? 🤔
they won't necessarily have the same direction "of failure", eg a latency going higher is a failure, but a bandwidth going lower is a failure, so you need to define the regression properties for each of them individually, I guess
Each indicator has an optional parameter called kpi
. There you can specify a list of KPI's that you'd like to check. I suppose this difference in direction could be implemented in each RegressionIndicator
we create. However, in the basic ZScoreIndicator
there is no distinction made. There are two reasons in this simple instance:
- I figured statistically significant variation in either direction was notable. Having a new image version run significantly faster seems like it would be suspicious? It wouldn't be wrong or bad necessarily but it would be good to make sure we know about them.
- We also track the direction withing the
RegressionResults
, we can still recover that information if we'd like by filtering out the entries that exhibit regression in the correct direction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a new image version run significantly faster seems like it would be suspicious? It wouldn't be wrong or bad necessarily but it would be good to make sure we know about them.
this is linked with the must remain stable
vs should remain stable
distinction that I mentioned here and there.
- If
must remain stable
, any positive or negative variation should be flagged. The binary is the same (the hardware also), so the performance shouldn't change significantly. - If
should_remain_stable
, the binary will change, like version bumps, so the new version can be faster, without need to raise a red flag for it. But it can be a flashy green flag :D
2f6c1a9
to
b527dfa
Compare
/test rhoai-light kserve quick |
/test rhoai-plot notebooks https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/195/pull-ci-openshift-psap-topsail-main-rhoai-light/1753099883524395008/artifacts/light/test/artifacts/002__sutest_notebooks__benchmark_performance/ |
1 similar comment
/test rhoai-plot notebooks https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/195/pull-ci-openshift-psap-topsail-main-rhoai-light/1753099883524395008/artifacts/light/test/artifacts/002__sutest_notebooks__benchmark_performance/ |
/test rhoai-plot notebooks https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/195/pull-ci-openshift-psap-topsail-main-rhoai-light/1753099883524395008/artifacts/light/test/artifacts/002__sutest_notebooks__benchmark_performance/ |
2 similar comments
/test rhoai-plot notebooks https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/195/pull-ci-openshift-psap-topsail-main-rhoai-light/1753099883524395008/artifacts/light/test/artifacts/002__sutest_notebooks__benchmark_performance/ |
/test rhoai-plot notebooks https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/195/pull-ci-openshift-psap-topsail-main-rhoai-light/1753099883524395008/artifacts/light/test/artifacts/002__sutest_notebooks__benchmark_performance/ |
/test rhoai-plot notebooks https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/195/pull-ci-openshift-psap-topsail-main-rhoai-light/1753099883524395008/artifacts/light/test/artifacts/002__sutest_notebooks__benchmark_performance/ |
/test rhoai-plot notebooks https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/195/pull-ci-openshift-psap-topsail-main-rhoai-light/1753099883524395008/artifacts/light/test/artifacts/002__sutest_notebooks__benchmark_performance/ |
Regression workflow working now -> the results are appearing in the LTS Documentation report. We needed another |
/test rhoai-plot notebooks https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/195/pull-ci-openshift-psap-topsail-main-rhoai-light/1753099883524395008/artifacts/light/test/artifacts/002__sutest_notebooks__benchmark_performance/ |
Example of Regression results can be seen here |
This sets up the notebooks performance test to create and use a RegressionIndicator from this matrix-benchmarking PR.
The idea is that we can create a number of different indicators from the base RegressionIndicator that provides a generic and flexible result that we can save and process upstream. I created a simple ZScoreIndicator to demonstrate this, it triggers when the new payload's KPI values are more than 3 standard deviations away from the mean value for that KPI among the LTS results. Hopefully we can use this to create a simple wrapper for datastax/Hunter.
There are still 2 work in progress issues to resolve with this PR:
Missing Settings: Some of the settings for an individual MatrixEntry are only present in the Matrix itself. In our testing payloads, an example would be
version
. I'm not sure why these settings aren't present in the MatrixEntry itself, but as currently written we can only restrict the settings that are present in the MatrixEntry.Gathered Results: As discussed some of the LTS payloads have "gathered" results. I wasn't sure how to turn this off and couldn't flatten the
Payload
s since I couldn't deepcopy them.