-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement/allow custom metric buckets #781
base: main
Are you sure you want to change the base?
Enhancement/allow custom metric buckets #781
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, minor suggestion
poetry.lock
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to merge/rebase with main
to fix conflicts
@@ -181,3 +183,54 @@ async def has_log() -> bool: | |||
assert record.levelno == logging.WARNING | |||
assert record.name == f"{logger.name}-sdk_core::temporal_sdk_core::worker::workflow" | |||
assert record.temporal_log.fields["run_id"] == handle.result_run_id # type: ignore | |||
|
|||
|
|||
async def test_prometheus_histogram_bucket_overrides(client: Client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for completeness sake, can you also add a check for custom metric? Basically just make a histogram override for your custom metric too, just assign Runtime
to a var, and in addition to all that you're doing below, use runtime.metric_meter()
to create/record a custom histogram metric value and confirm it too gets the histogram override.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a custom_histogram
and verified the buckets are updated.
This work is actually NOT accomplishing what I want in the ability to control the binning of temporal_workflow_endtoend_latency_bucket
and temporal_activity_execution_latency_[milliseconds]_bucket
Are you able to tell if I will be able to do this in a PR on this repository or if it will require an update to sdk-core
for that functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may have to remove the temporal_
prefix. If that is indeed the case, may be a good thing to document where the override attr is defined.
f656462
to
af83eea
Compare
histogram_overrides = { | ||
"temporal_long_request_latency": [special_value / 2, special_value], | ||
"custom_histogram": [special_value / 2, special_value], | ||
# "temporal_workflow_endtoend_latency": [special_value / 2, special_value], # This still does not work :( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drawing attention here. If i include this in the check it will fail. temporal_workflow_endtoend_latency
still appears in the metrics endpoint but the binning is not updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried "workflow_endtoend_latency": [special_value / 2, special_value]
? And does temporal_long_request_latency
work as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temporal_long_request_latency
does work as expected. This test passes as is.
workflow_endtoend_latency
does not update the buckets. In the metrics endpoint I see
# TYPE temporal_workflow_endtoend_latency histogram
temporal_workflow_endtoend_latency_bucket{namespace="default",service_name="temporal-core-sdk",task_queue="task-queue-5af26b55-afbc-4f20-ad5d-2d900f7fe453",workflow_type="HelloWorkflow",le="100"} 1
I have tried severl variations ... temporal_workflow_endtoend_latency
, workflow_endtoend_latency
. I see in the sdk-core
that the binning is defined as as function and am concerned that the histogram override does not apply in those but am not familiar enough with rust to know definitively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see the problem. You are not passing your client_with_overrides
to the worker, you're passing the client
that comes from the test session that does not use this Runtime
. Change the first parameter of the Worker
to be client_with_overrides
. Also use client_with_overrides
as the one to execute_workflow
instead of client
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the local definition of client
within run_workflow
is already set to client_with_overrides
.
I did try it though but still do not seem to be able to effect binning on the workflow end to end histogram
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this seem related to temporalio/sdk-core#873?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this seem related to temporalio/sdk-core#873?
Seems similar that custom binning is applied to some histograms but not others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know of a reason that the application of custom binning would work for some histograms but not the workflow or activity related ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's not immediately clear to me. I will need to look into that bug when I have a moment, which hopefully would be next week sometime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheCodeWrangler - now that we have upgraded Core with this fix, there should be no histograms missing these buckets. I have merged main
back into this branch. Want to uncomment and see if your test now passes? If so, we can merge.
@TheCodeWrangler - may also need to run |
Working to allow for custom histogram binning to be applied to prometheus metrics. Ideally this would be able to be applied to activities as well as workflows. The current implementation appears to not be applied to workflow end-to-end metrics.
What was changed
Added parameter for "histogram_bucket_overrides" to
PrometheusConfig
as well as the bridge to the sdk-core.Added a test case for checking that custom binning is applied. Confirmed that the metrics endpoint for custom metrics were updated but not updated for workflow end to end latencies.
Why?
I was facing a limitation on viewing long running applications due to the default maximum activity bin being 60 seconds.
Checklist
Closes 777
How was this tested: