Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement/allow custom metric buckets #781

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

TheCodeWrangler
Copy link
Contributor

@TheCodeWrangler TheCodeWrangler commented Mar 3, 2025

Working to allow for custom histogram binning to be applied to prometheus metrics. Ideally this would be able to be applied to activities as well as workflows. The current implementation appears to not be applied to workflow end-to-end metrics.

What was changed

Added parameter for "histogram_bucket_overrides" to PrometheusConfig as well as the bridge to the sdk-core.

Added a test case for checking that custom binning is applied. Confirmed that the metrics endpoint for custom metrics were updated but not updated for workflow end to end latencies.

Why?

I was facing a limitation on viewing long running applications due to the default maximum activity bin being 60 seconds.

Checklist

  1. Closes 777

  2. How was this tested:

  1. Any docs updates needed?

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, minor suggestion

poetry.lock Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to merge/rebase with main to fix conflicts

@@ -181,3 +183,54 @@ async def has_log() -> bool:
assert record.levelno == logging.WARNING
assert record.name == f"{logger.name}-sdk_core::temporal_sdk_core::worker::workflow"
assert record.temporal_log.fields["run_id"] == handle.result_run_id # type: ignore


async def test_prometheus_histogram_bucket_overrides(client: Client):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for completeness sake, can you also add a check for custom metric? Basically just make a histogram override for your custom metric too, just assign Runtime to a var, and in addition to all that you're doing below, use runtime.metric_meter() to create/record a custom histogram metric value and confirm it too gets the histogram override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a custom_histogram and verified the buckets are updated.

This work is actually NOT accomplishing what I want in the ability to control the binning of temporal_workflow_endtoend_latency_bucket and temporal_activity_execution_latency_[milliseconds]_bucket

Are you able to tell if I will be able to do this in a PR on this repository or if it will require an update to sdk-core for that functionality?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may have to remove the temporal_ prefix. If that is indeed the case, may be a good thing to document where the override attr is defined.

@TheCodeWrangler TheCodeWrangler marked this pull request as draft March 7, 2025 13:31
@TheCodeWrangler TheCodeWrangler force-pushed the enhancement/allow-custom-metric-buckets branch from f656462 to af83eea Compare March 7, 2025 13:34
@TheCodeWrangler TheCodeWrangler marked this pull request as ready for review March 7, 2025 14:12
histogram_overrides = {
"temporal_long_request_latency": [special_value / 2, special_value],
"custom_histogram": [special_value / 2, special_value],
# "temporal_workflow_endtoend_latency": [special_value / 2, special_value], # This still does not work :(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drawing attention here. If i include this in the check it will fail. temporal_workflow_endtoend_latency still appears in the metrics endpoint but the binning is not updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried "workflow_endtoend_latency": [special_value / 2, special_value]? And does temporal_long_request_latency work as expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temporal_long_request_latency does work as expected. This test passes as is.

workflow_endtoend_latency does not update the buckets. In the metrics endpoint I see

# TYPE temporal_workflow_endtoend_latency histogram
temporal_workflow_endtoend_latency_bucket{namespace="default",service_name="temporal-core-sdk",task_queue="task-queue-5af26b55-afbc-4f20-ad5d-2d900f7fe453",workflow_type="HelloWorkflow",le="100"} 1

I have tried severl variations ... temporal_workflow_endtoend_latency, workflow_endtoend_latency. I see in the sdk-core that the binning is defined as as function and am concerned that the histogram override does not apply in those but am not familiar enough with rust to know definitively.

Copy link
Member

@cretz cretz Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see the problem. You are not passing your client_with_overrides to the worker, you're passing the client that comes from the test session that does not use this Runtime. Change the first parameter of the Worker to be client_with_overrides. Also use client_with_overrides as the one to execute_workflow instead of client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the local definition of client within run_workflow is already set to client_with_overrides.

I did try it though but still do not seem to be able to effect binning on the workflow end to end histogram

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this seem related to temporalio/sdk-core#873?

Copy link
Contributor Author

@TheCodeWrangler TheCodeWrangler Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this seem related to temporalio/sdk-core#873?

Seems similar that custom binning is applied to some histograms but not others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sushisource

Do you know of a reason that the application of custom binning would work for some histograms but not the workflow or activity related ones?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's not immediately clear to me. I will need to look into that bug when I have a moment, which hopefully would be next week sometime

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheCodeWrangler - now that we have upgraded Core with this fix, there should be no histograms missing these buckets. I have merged main back into this branch. Want to uncomment and see if your test now passes? If so, we can merge.

@cretz
Copy link
Member

cretz commented Apr 3, 2025

@TheCodeWrangler - may also need to run poe format on the source

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants