Skip to content

[dbm] add optional execution_indicator to rule out false positive query metrics move #20037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

lu-zhengda
Copy link
Contributor

@lu-zhengda lu-zhengda commented Apr 8, 2025

What does this PR do?

This PR added execution indicators to StatementMetrics.

  • Added an optional execution_indicators parameter to StatementMetrics.compute_derivative_rows
  • When execution_indicators is specified, only consider a query as executed if at least one of the specified metrics has changed.
  • backward compatibility: empty execution_indicators behaves like the original implementation

For example, in PostgreSQL we can use 'calls' as the execution indicator

statement_metrics.compute_derivative_rows(
    rows=rows,
    metrics=['calls', 'total_time', 'rows'],
    key=key_func,
    execution_indicators=['calls']  # Only consider queries as executed if call count changed
)

Motivation

https://datadoghq.atlassian.net/browse/DBMON-4170
In database query metrics monitoring, we've observed cases where duration metrics (like total_time) change slightly between check runs while execution counts remain the same. These small changes are often due to the prior old or less frequently used normalized query being evicted from the stats table (i.e. pg_stat_statements) then re-inserted with the same call count (usually 1) and slightly different duration. In this case, the newly inserted normalized query should be treated as the baseline metric for future diffs.
Below is an example of a normalized query being evicted and re-inserted.

postgres | 16672 | -6478076666730767487 |     1 | 1014.6086999999999 | /*dddbs='orders-app',ddps='orders-app',traceparent='00-00000000000000005d46df6263f055fa-5d46df6263f055
fa-00'*/ INSERT INTO inventory_views (view_date)                      +
          |       |                      |       |                    |         SELECT generate_series(                                                                       
                                                                      +
          |       |                      |       |                    |                 $1::timestamp,                                                                        
                                                                      +
          |       |                      |       |                    |                 $2::timestamp,                                                                        
                                                                      +
          |       |                      |       |                    |                 $3::interval                                                                          
                                                                      +
          |       |                      |       |                    |         )


 postgres | 16672 | -6478076666730767487 |     1 |       1015.0188 | /*dddbs='orders-app',ddps='orders-app',traceparent='00-0000000000000000788bd2d62ee202f9-788bd
2d62ee202f9-00'*/ INSERT INTO inventory_views (view_date)                      +
          |       |                      |       |                 |         SELECT generate_series(                                                              
                                                                               +
          |       |                      |       |                 |                 $1::timestamp,                                                               
                                                                               +
          |       |                      |       |                 |                 $2::timestamp,                                                               
                                                                               +
          |       |                      |       |                 |                 $3::interval                                                                 
                                                                               +
          |       |                      |       |                 |         )

The query has same queryid:-6478076666730767487 because they are structurally the same. We can tell it's an eviction + re-insertion by the different traceparent value in the comment. When this happens, we will see a call count = 0 but slightly different execution duration change due to the diff between 1015.0188 ms (after) and 1014.6086999999999 ms (before).

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Copy link

codecov bot commented Apr 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.74%. Comparing base (183a411) to head (252713b).
Report is 28 commits behind head on master.

Additional details and impacted files
Flag Coverage Δ
active_directory 93.18% <ø> (ø)
activemq 52.80% <ø> (ø)
activemq_xml 82.20% <ø> (ø)
amazon_msk 89.50% <ø> (ø)
ambari 85.75% <ø> (ø)
apache 95.08% <ø> (ø)
appgate_sdp 93.93% <ø> (ø)
arangodb 98.23% <ø> (ø)
argo_rollouts 90.00% <ø> (ø)
argo_workflows 89.61% <ø> (ø)
argocd 87.23% <ø> (ø)
aspdotnet 100.00% <ø> (ø)
avi_vantage 93.83% <ø> (ø)
aws_neuron 92.42% <ø> (ø)
azure_iot_edge 82.08% <ø> (ø)
boundary 100.00% <ø> (ø)
btrfs 83.33% <ø> (ø)
cacti 87.90% <ø> (ø)
calico 84.61% <ø> (ø)
cassandra 66.66% <ø> (ø)
celery 95.45% <ø> (ø)
cert_manager 77.41% <ø> (ø)
cisco_aci 89.52% <ø> (ø)
citrix_hypervisor 87.45% <ø> (ø)
cloud_foundry_api 96.11% <ø> (ø)
cloudera 99.51% <ø> (ø)
cockroachdb 92.98% <ø> (ø)
consul 91.92% <ø> (ø)
coredns 95.65% <ø> (ø)
crio 89.79% <ø> (ø)
datadog_checks_base 89.07% <100.00%> (-0.05%) ⬇️
datadog_checks_dev 77.58% <ø> (ø)
datadog_checks_downloader 81.37% <ø> (+3.22%) ⬆️
datadog_cluster_agent 90.19% <ø> (ø)
dcgm 93.54% <ø> (ø)
ddev 87.17% <ø> (+0.01%) ⬆️
directory 96.88% <ø> (ø)
disk 85.68% <ø> (ø)
dns_check 93.84% <ø> (ø)
druid 97.70% <ø> (ø)
duckdb 84.53% <ø> (ø)
ecs_fargate 83.71% <ø> (ø)
eks_fargate 94.05% <ø> (ø)
esxi 93.98% <ø> (+0.05%) ⬆️
etcd 95.56% <ø> (ø)
external_dns 89.28% <ø> (ø)
fluentd 84.21% <ø> (ø)
fluxcd 88.31% <ø> (ø)
fly_io 97.13% <ø> (ø)
foundationdb 82.64% <ø> (ø)
gitlab_runner 92.76% <ø> (ø)
glusterfs 80.00% <ø> (ø)
go_expvar 92.66% <ø> (ø)
gunicorn 92.91% <ø> (+0.74%) ⬆️
hazelcast 92.30% <ø> (ø)
hdfs_datanode 89.63% <ø> (ø)
hdfs_namenode 86.60% <ø> (ø)
hive 51.42% <ø> (ø)
hivemq 61.90% <ø> (ø)
http_check 94.26% <ø> (ø)
hudi 73.91% <ø> (?)
ibm_db2 86.29% <ø> (ø)
ibm_i 82.36% <ø> (ø)
ignite 46.66% <ø> (ø)
impala 97.97% <ø> (ø)
infiniband 93.71% <ø> (ø)
istio 77.86% <ø> (ø)
jboss_wildfly 47.36% <ø> (ø)
kafka 64.70% <ø> (ø)
karpenter 95.06% <ø> (ø)
keda 88.05% <ø> (ø)
kube_apiserver_metrics 97.75% <ø> (ø)
kube_controller_manager 97.88% <ø> (ø)
kube_dns 95.94% <ø> (ø)
kube_metrics_server 94.87% <ø> (ø)
kube_proxy 96.80% <ø> (ø)
kube_scheduler 97.92% <ø> (ø)
kubeflow 93.22% <ø> (ø)
kubelet 91.09% <ø> (ø)
kubernetes_cluster_autoscaler 93.22% <ø> (ø)
kubernetes_state 89.49% <ø> (ø)
kubevirt_api 82.75% <ø> (ø)
kubevirt_controller 85.36% <ø> (ø)
kubevirt_handler 91.32% <ø> (ø)
kyototycoon 85.96% <ø> (ø)
lighttpd 83.64% <ø> (ø)
linkerd 84.70% <ø> (ø)
linux_proc_extras 96.20% <ø> (ø)
mapr 82.70% <ø> (ø)
mapreduce 81.99% <ø> (ø)
marathon 83.06% <ø> (ø)
mcache 93.99% <ø> (ø)
mesos_master 89.71% <ø> (ø)
milvus 92.30% <ø> (ø)
nagios 89.01% <ø> (ø)
network 93.89% <ø> (ø)
nfsstat 95.20% <ø> (ø)
nginx_ingress_controller 98.55% <ø> (ø)
nvidia_nim 93.10% <ø> (ø)
nvidia_triton 88.52% <ø> (ø)
octopus_deploy 99.25% <ø> (ø)
openldap 96.33% <ø> (ø)
openmetrics 98.05% <ø> (ø)
openstack 55.11% <ø> (ø)
php_fpm 90.45% <ø> (ø)
postfix 88.04% <ø> (ø)
powerdns_recursor 96.65% <ø> (ø)
presto 59.09% <ø> (ø)
process 85.99% <ø> (ø)
prometheus 94.17% <ø> (ø)
proxysql 98.97% <ø> (ø)
pulsar 100.00% <ø> (ø)
quarkus 100.00% <ø> (ø)
rethinkdb 98.27% <ø> (ø)
riak 99.21% <ø> (ø)
riakcs 88.82% <ø> (ø)
silk 93.91% <ø> (ø)
silverstripe_cms 76.00% <ø> (ø)
singlestore 90.81% <ø> (ø)
slurm 90.59% <ø> (ø)
snowflake 96.27% <ø> (ø)
solr 56.25% <ø> (ø)
sonatype_nexus 81.88% <ø> (ø)
squid 100.00% <ø> (ø)
ssh_check 92.20% <ø> (ø)
statsd 87.36% <ø> (ø)
strimzi 89.78% <ø> (ø)
supabase 93.97% <ø> (ø)
supervisord 90.14% <ø> (ø)
system_core 92.52% <ø> (ø)
system_swap 98.30% <ø> (ø)
tcp_check 90.72% <ø> (ø)
tekton 82.45% <ø> (?)
teleport 98.16% <ø> (ø)
temporal 100.00% <ø> (ø)
teradata 94.27% <ø> (ø)
tibco_ems 91.98% <ø> (ø)
tls 90.26% <ø> (ø)
torchserve 97.32% <ø> (ø)
traefik_mesh 76.75% <ø> (ø)
traffic_server 96.13% <ø> (ø)
twemproxy 79.45% <ø> (ø)
twistlock 80.41% <ø> (ø)
varnish 84.22% <ø> (ø)
velero 85.00% <ø> (ø)
vllm 94.44% <ø> (ø)
weaviate 76.27% <ø> (?)
win32_event_log 86.54% <ø> (ø)
wmi_check 92.91% <ø> (ø)
yarn 89.93% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lu-zhengda lu-zhengda added the qa/skip-qa Automatically skip this PR for the next QA label Apr 8, 2025
# 3. Execution indicators: If execution_indicators is specified, only consider a query as changed if at
# least one of the execution indicator metrics has changed. This helps filter out cases where an old or
# less frequently executed normalized query was evicted due to the stats table being full, and then
# re-inserted to the stats table with a small call count and slight duration change. In this case,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we detect the case when the query was re-inserted as opposed to the case where the metrics increase, but executions did not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was already taken care of. any negative metrics diff are discarded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Check for negative values, but only in the columns used for metrics
if any(diffed_row[k] < 0 for k in metric_columns):
# A "break" might be expected here instead of "continue," but there are cases where a subset of rows
# are removed. To avoid situations where all results are discarded every check run, we err on the side
# of potentially including truncated rows that exceed previous run counts.
continue

# If execution_indicators is specified, check if any of the execution indicator metrics have changed
if execution_indicators:
indicator_columns = execution_indicators & metric_columns
if not any(diffed_row[k] > 0 for k in indicator_columns):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this fully solves the problem. Example execution order:

queryid: -12345
count:     1
duration: 1500

-->

query eviction for -12345 occurs

-->

queryid: -12345
count: 3
duration: 4000

In this case, the count diff is greater than 0 even though an eviction has occurred, so the metric will be reported, but it will have incorrect values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we chat about this offline. when the count diff is greater than 0 after a normalized query eviction and re-insertion, we have no reliable to tell if a previous eviction is happened. in this case, the metrics for this normalized will be incorrect (mostly inflation) for the first check run after eviction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants