[dbm] add optional execution_indicator to rule out false positive query metrics move #20037

lu-zhengda · 2025-04-08T19:21:28Z

What does this PR do?

This PR added execution indicators to StatementMetrics.

Added an optional execution_indicators parameter to StatementMetrics.compute_derivative_rows
When execution_indicators is specified, only consider a query as executed if at least one of the specified metrics has changed.
backward compatibility: empty execution_indicators behaves like the original implementation

For example, in PostgreSQL we can use 'calls' as the execution indicator

statement_metrics.compute_derivative_rows(
    rows=rows,
    metrics=['calls', 'total_time', 'rows'],
    key=key_func,
    execution_indicators=['calls']  # Only consider queries as executed if call count changed
)

Motivation

https://datadoghq.atlassian.net/browse/DBMON-4170
In database query metrics monitoring, we've observed cases where duration metrics (like total_time) change slightly between check runs while execution counts remain the same. These small changes are often due to the prior old or less frequently used normalized query being evicted from the stats table (i.e. pg_stat_statements) then re-inserted with the same call count (usually 1) and slightly different duration. In this case, the newly inserted normalized query should be treated as the baseline metric for future diffs.
Below is an example of a normalized query being evicted and re-inserted.

postgres | 16672 | -6478076666730767487 |     1 | 1014.6086999999999 | /*dddbs='orders-app',ddps='orders-app',traceparent='00-00000000000000005d46df6263f055fa-5d46df6263f055
fa-00'*/ INSERT INTO inventory_views (view_date)                      +
          |       |                      |       |                    |         SELECT generate_series(                                                                       
                                                                      +
          |       |                      |       |                    |                 $1::timestamp,                                                                        
                                                                      +
          |       |                      |       |                    |                 $2::timestamp,                                                                        
                                                                      +
          |       |                      |       |                    |                 $3::interval                                                                          
                                                                      +
          |       |                      |       |                    |         )


 postgres | 16672 | -6478076666730767487 |     1 |       1015.0188 | /*dddbs='orders-app',ddps='orders-app',traceparent='00-0000000000000000788bd2d62ee202f9-788bd
2d62ee202f9-00'*/ INSERT INTO inventory_views (view_date)                      +
          |       |                      |       |                 |         SELECT generate_series(                                                              
                                                                               +
          |       |                      |       |                 |                 $1::timestamp,                                                               
                                                                               +
          |       |                      |       |                 |                 $2::timestamp,                                                               
                                                                               +
          |       |                      |       |                 |                 $3::interval                                                                 
                                                                               +
          |       |                      |       |                 |         )

The query has same queryid:-6478076666730767487 because they are structurally the same. We can tell it's an eviction + re-insertion by the different traceparent value in the comment. When this happens, we will see a call count = 0 but slightly different execution duration change due to the diff between 1015.0188 ms (after) and 1014.6086999999999 ms (before).

Review checklist (to be filled by reviewers)

Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

…rics move

codecov · 2025-04-08T19:26:49Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.74%. Comparing base (183a411) to head (252713b).
Report is 28 commits behind head on master.

Additional details and impacted files

Flag	Coverage Δ
active_directory	`93.18% <ø> (ø)`
activemq	`52.80% <ø> (ø)`
activemq_xml	`82.20% <ø> (ø)`
amazon_msk	`89.50% <ø> (ø)`
ambari	`85.75% <ø> (ø)`
apache	`95.08% <ø> (ø)`
appgate_sdp	`93.93% <ø> (ø)`
arangodb	`98.23% <ø> (ø)`
argo_rollouts	`90.00% <ø> (ø)`
argo_workflows	`89.61% <ø> (ø)`
argocd	`87.23% <ø> (ø)`
aspdotnet	`100.00% <ø> (ø)`
avi_vantage	`93.83% <ø> (ø)`
aws_neuron	`92.42% <ø> (ø)`
azure_iot_edge	`82.08% <ø> (ø)`
boundary	`100.00% <ø> (ø)`
btrfs	`83.33% <ø> (ø)`
cacti	`87.90% <ø> (ø)`
calico	`84.61% <ø> (ø)`
cassandra	`66.66% <ø> (ø)`
celery	`95.45% <ø> (ø)`
cert_manager	`77.41% <ø> (ø)`
cisco_aci	`89.52% <ø> (ø)`
citrix_hypervisor	`87.45% <ø> (ø)`
cloud_foundry_api	`96.11% <ø> (ø)`
cloudera	`99.51% <ø> (ø)`
cockroachdb	`92.98% <ø> (ø)`
consul	`91.92% <ø> (ø)`
coredns	`95.65% <ø> (ø)`
crio	`89.79% <ø> (ø)`
datadog_checks_base	`89.07% <100.00%> (-0.05%)`	⬇️
datadog_checks_dev	`77.58% <ø> (ø)`
datadog_checks_downloader	`81.37% <ø> (+3.22%)`	⬆️
datadog_cluster_agent	`90.19% <ø> (ø)`
dcgm	`93.54% <ø> (ø)`
ddev	`87.17% <ø> (+0.01%)`	⬆️
directory	`96.88% <ø> (ø)`
disk	`85.68% <ø> (ø)`
dns_check	`93.84% <ø> (ø)`
druid	`97.70% <ø> (ø)`
duckdb	`84.53% <ø> (ø)`
ecs_fargate	`83.71% <ø> (ø)`
eks_fargate	`94.05% <ø> (ø)`
esxi	`93.98% <ø> (+0.05%)`	⬆️
etcd	`95.56% <ø> (ø)`
external_dns	`89.28% <ø> (ø)`
fluentd	`84.21% <ø> (ø)`
fluxcd	`88.31% <ø> (ø)`
fly_io	`97.13% <ø> (ø)`
foundationdb	`82.64% <ø> (ø)`
gitlab_runner	`92.76% <ø> (ø)`
glusterfs	`80.00% <ø> (ø)`
go_expvar	`92.66% <ø> (ø)`
gunicorn	`92.91% <ø> (+0.74%)`	⬆️
hazelcast	`92.30% <ø> (ø)`
hdfs_datanode	`89.63% <ø> (ø)`
hdfs_namenode	`86.60% <ø> (ø)`
hive	`51.42% <ø> (ø)`
hivemq	`61.90% <ø> (ø)`
http_check	`94.26% <ø> (ø)`
hudi	`73.91% <ø> (?)`
ibm_db2	`86.29% <ø> (ø)`
ibm_i	`82.36% <ø> (ø)`
ignite	`46.66% <ø> (ø)`
impala	`97.97% <ø> (ø)`
infiniband	`93.71% <ø> (ø)`
istio	`77.86% <ø> (ø)`
jboss_wildfly	`47.36% <ø> (ø)`
kafka	`64.70% <ø> (ø)`
karpenter	`95.06% <ø> (ø)`
keda	`88.05% <ø> (ø)`
kube_apiserver_metrics	`97.75% <ø> (ø)`
kube_controller_manager	`97.88% <ø> (ø)`
kube_dns	`95.94% <ø> (ø)`
kube_metrics_server	`94.87% <ø> (ø)`
kube_proxy	`96.80% <ø> (ø)`
kube_scheduler	`97.92% <ø> (ø)`
kubeflow	`93.22% <ø> (ø)`
kubelet	`91.09% <ø> (ø)`
kubernetes_cluster_autoscaler	`93.22% <ø> (ø)`
kubernetes_state	`89.49% <ø> (ø)`
kubevirt_api	`82.75% <ø> (ø)`
kubevirt_controller	`85.36% <ø> (ø)`
kubevirt_handler	`91.32% <ø> (ø)`
kyototycoon	`85.96% <ø> (ø)`
lighttpd	`83.64% <ø> (ø)`
linkerd	`84.70% <ø> (ø)`
linux_proc_extras	`96.20% <ø> (ø)`
mapr	`82.70% <ø> (ø)`
mapreduce	`81.99% <ø> (ø)`
marathon	`83.06% <ø> (ø)`
mcache	`93.99% <ø> (ø)`
mesos_master	`89.71% <ø> (ø)`
milvus	`92.30% <ø> (ø)`
nagios	`89.01% <ø> (ø)`
network	`93.89% <ø> (ø)`
nfsstat	`95.20% <ø> (ø)`
nginx_ingress_controller	`98.55% <ø> (ø)`
nvidia_nim	`93.10% <ø> (ø)`
nvidia_triton	`88.52% <ø> (ø)`
octopus_deploy	`99.25% <ø> (ø)`
openldap	`96.33% <ø> (ø)`
openmetrics	`98.05% <ø> (ø)`
openstack	`55.11% <ø> (ø)`
php_fpm	`90.45% <ø> (ø)`
postfix	`88.04% <ø> (ø)`
powerdns_recursor	`96.65% <ø> (ø)`
presto	`59.09% <ø> (ø)`
process	`85.99% <ø> (ø)`
prometheus	`94.17% <ø> (ø)`
proxysql	`98.97% <ø> (ø)`
pulsar	`100.00% <ø> (ø)`
quarkus	`100.00% <ø> (ø)`
rethinkdb	`98.27% <ø> (ø)`
riak	`99.21% <ø> (ø)`
riakcs	`88.82% <ø> (ø)`
silk	`93.91% <ø> (ø)`
silverstripe_cms	`76.00% <ø> (ø)`
singlestore	`90.81% <ø> (ø)`
slurm	`90.59% <ø> (ø)`
snowflake	`96.27% <ø> (ø)`
solr	`56.25% <ø> (ø)`
sonatype_nexus	`81.88% <ø> (ø)`
squid	`100.00% <ø> (ø)`
ssh_check	`92.20% <ø> (ø)`
statsd	`87.36% <ø> (ø)`
strimzi	`89.78% <ø> (ø)`
supabase	`93.97% <ø> (ø)`
supervisord	`90.14% <ø> (ø)`
system_core	`92.52% <ø> (ø)`
system_swap	`98.30% <ø> (ø)`
tcp_check	`90.72% <ø> (ø)`
tekton	`82.45% <ø> (?)`
teleport	`98.16% <ø> (ø)`
temporal	`100.00% <ø> (ø)`
teradata	`94.27% <ø> (ø)`
tibco_ems	`91.98% <ø> (ø)`
tls	`90.26% <ø> (ø)`
torchserve	`97.32% <ø> (ø)`
traefik_mesh	`76.75% <ø> (ø)`
traffic_server	`96.13% <ø> (ø)`
twemproxy	`79.45% <ø> (ø)`
twistlock	`80.41% <ø> (ø)`
varnish	`84.22% <ø> (ø)`
velero	`85.00% <ø> (ø)`
vllm	`94.44% <ø> (ø)`
weaviate	`76.27% <ø> (?)`
win32_event_log	`86.54% <ø> (ø)`
wmi_check	`92.91% <ø> (ø)`
yarn	`89.93% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

datadog_checks_base/datadog_checks/base/utils/db/statement_metrics.py

nenadnoveljic · 2025-04-09T16:14:51Z

datadog_checks_base/datadog_checks/base/utils/db/statement_metrics.py

+            # 3. Execution indicators: If execution_indicators is specified, only consider a query as changed if at
+            #    least one of the execution indicator metrics has changed. This helps filter out cases where an old or
+            #    less frequently executed normalized query was evicted due to the stats table being full, and then
+            #    re-inserted to the stats table with a small call count and slight duration change. In this case,


Can we detect the case when the query was re-inserted as opposed to the case where the metrics increase, but executions did not?

that was already taken care of. any negative metrics diff are discarded.

integrations-core/datadog_checks_base/datadog_checks/base/utils/db/statement_metrics.py

Lines 93 to 98 in fba1d66

# Check for negative values, but only in the columns used for metrics

if any(diffed_row[k] < 0 for k in metric_columns):

# A "break" might be expected here instead of "continue," but there are cases where a subset of rows

# are removed. To avoid situations where all results are discarded every check run, we err on the side

# of potentially including truncated rows that exceed previous run counts.

continue

amw-zero · 2025-04-09T21:49:41Z

datadog_checks_base/datadog_checks/base/utils/db/statement_metrics.py

+            # If execution_indicators is specified, check if any of the execution indicator metrics have changed
+            if execution_indicators:
+                indicator_columns = execution_indicators & metric_columns
+                if not any(diffed_row[k] > 0 for k in indicator_columns):


I don't think this fully solves the problem. Example execution order:

queryid: -12345 count: 1 duration: 1500 --> query eviction for -12345 occurs --> queryid: -12345 count: 3 duration: 4000

In this case, the count diff is greater than 0 even though an eviction has occurred, so the metric will be reported, but it will have incorrect values.

we chat about this offline. when the count diff is greater than 0 after a normalized query eviction and re-insertion, we have no reliable to tell if a previous eviction is happened. in this case, the metrics for this normalized will be incorrect (mostly inflation) for the first check run after eviction.

add optional execution_indicator to rule out false positive query met…

661807c

…rics move

temporal-github-worker-1 bot added agent/review-requested ecosystems/review-requested product/review-requested labels Apr 8, 2025

datadog-agent-integrations-bot bot added the base_package label Apr 8, 2025

add changelog

67e4788

lu-zhengda added the qa/skip-qa Automatically skip this PR for the next QA label Apr 8, 2025

lu-zhengda added 2 commits April 9, 2025 13:29

update comment

4a391be

update changelog

fba1d66

lu-zhengda marked this pull request as ready for review April 9, 2025 15:50

lu-zhengda requested review from a team as code owners April 9, 2025 15:50

datadog-agent-integrations-bot bot added team/agent-integrations team/database-monitoring-agent labels Apr 9, 2025

nenadnoveljic reviewed Apr 9, 2025

View reviewed changes

remove oracle and db2 from comments

252713b

amw-zero reviewed Apr 9, 2025

View reviewed changes

lu-zhengda requested a review from nenadnoveljic April 10, 2025 13:40

nenadnoveljic approved these changes Apr 10, 2025

View reviewed changes

dkirov-dd approved these changes Apr 14, 2025

View reviewed changes

temporal-github-worker-1 bot added agent/approved and removed agent/review-requested labels Apr 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dbm] add optional execution_indicator to rule out false positive query metrics move #20037

[dbm] add optional execution_indicator to rule out false positive query metrics move #20037

lu-zhengda commented Apr 8, 2025 •

edited

Loading

codecov bot commented Apr 8, 2025 •

edited

Loading

nenadnoveljic Apr 9, 2025

lu-zhengda Apr 9, 2025

lu-zhengda Apr 9, 2025

amw-zero Apr 9, 2025

lu-zhengda Apr 10, 2025

	# Check for negative values, but only in the columns used for metrics
	if any(diffed_row[k] < 0 for k in metric_columns):
	# A "break" might be expected here instead of "continue," but there are cases where a subset of rows
	# are removed. To avoid situations where all results are discarded every check run, we err on the side
	# of potentially including truncated rows that exceed previous run counts.
	continue

[dbm] add optional execution_indicator to rule out false positive query metrics move #20037

Are you sure you want to change the base?

[dbm] add optional execution_indicator to rule out false positive query metrics move #20037

Conversation

lu-zhengda commented Apr 8, 2025 • edited Loading

What does this PR do?

Motivation

Review checklist (to be filled by reviewers)

codecov bot commented Apr 8, 2025 • edited Loading

Codecov Report

nenadnoveljic Apr 9, 2025

Choose a reason for hiding this comment

lu-zhengda Apr 9, 2025

Choose a reason for hiding this comment

lu-zhengda Apr 9, 2025

Choose a reason for hiding this comment

amw-zero Apr 9, 2025

Choose a reason for hiding this comment

lu-zhengda Apr 10, 2025

Choose a reason for hiding this comment

lu-zhengda commented Apr 8, 2025 •

edited

Loading

codecov bot commented Apr 8, 2025 •

edited

Loading