-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[dbm] add optional execution_indicator to rule out false positive query metrics move #20037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
datadog_checks_base/datadog_checks/base/utils/db/statement_metrics.py
Outdated
Show resolved
Hide resolved
datadog_checks_base/datadog_checks/base/utils/db/statement_metrics.py
Outdated
Show resolved
Hide resolved
# 3. Execution indicators: If execution_indicators is specified, only consider a query as changed if at | ||
# least one of the execution indicator metrics has changed. This helps filter out cases where an old or | ||
# less frequently executed normalized query was evicted due to the stats table being full, and then | ||
# re-inserted to the stats table with a small call count and slight duration change. In this case, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we detect the case when the query was re-inserted as opposed to the case where the metrics increase, but executions did not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was already taken care of. any negative metrics diff are discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
integrations-core/datadog_checks_base/datadog_checks/base/utils/db/statement_metrics.py
Lines 93 to 98 in fba1d66
# Check for negative values, but only in the columns used for metrics | |
if any(diffed_row[k] < 0 for k in metric_columns): | |
# A "break" might be expected here instead of "continue," but there are cases where a subset of rows | |
# are removed. To avoid situations where all results are discarded every check run, we err on the side | |
# of potentially including truncated rows that exceed previous run counts. | |
continue |
# If execution_indicators is specified, check if any of the execution indicator metrics have changed | ||
if execution_indicators: | ||
indicator_columns = execution_indicators & metric_columns | ||
if not any(diffed_row[k] > 0 for k in indicator_columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this fully solves the problem. Example execution order:
queryid: -12345
count: 1
duration: 1500
-->
query eviction for -12345 occurs
-->
queryid: -12345
count: 3
duration: 4000
In this case, the count diff is greater than 0 even though an eviction has occurred, so the metric will be reported, but it will have incorrect values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we chat about this offline. when the count diff is greater than 0 after a normalized query eviction and re-insertion, we have no reliable to tell if a previous eviction is happened. in this case, the metrics for this normalized will be incorrect (mostly inflation) for the first check run after eviction.
What does this PR do?
This PR added execution indicators to
StatementMetrics
.execution_indicators
parameter toStatementMetrics.compute_derivative_rows
execution_indicators
is specified, only consider a query as executed if at least one of the specified metrics has changed.execution_indicators
behaves like the original implementationFor example, in PostgreSQL we can use 'calls' as the execution indicator
Motivation
https://datadoghq.atlassian.net/browse/DBMON-4170
In database query metrics monitoring, we've observed cases where duration metrics (like
total_time
) change slightly between check runs while execution counts remain the same. These small changes are often due to the prior old or less frequently used normalized query being evicted from the stats table (i.e. pg_stat_statements) then re-inserted with the same call count (usually 1) and slightly different duration. In this case, the newly inserted normalized query should be treated as the baseline metric for future diffs.Below is an example of a normalized query being evicted and re-inserted.
The query has same
queryid:-6478076666730767487
because they are structurally the same. We can tell it's an eviction + re-insertion by the differenttraceparent
value in the comment. When this happens, we will see a call count = 0 but slightly different execution duration change due to the diff between1015.0188
ms (after) and1014.6086999999999
ms (before).Review checklist (to be filled by reviewers)
qa/skip-qa
label if the PR doesn't need to be tested during QA.backport/<branch-name>
label to the PR and it will automatically open a backport PR once this one is merged