Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to specify custom rules to set event.outcome #482

Open
michaelhyatt opened this issue Aug 11, 2021 · 12 comments
Open

Ability to specify custom rules to set event.outcome #482

michaelhyatt opened this issue Aug 11, 2021 · 12 comments

Comments

@michaelhyatt
Copy link

Is your feature request related to a problem? Please describe.
A transaction representing a service invocation success or failure can't always be defined by a return status. Sometimes, HTTP response code 2xx is not a sufficient indication of success and there is a need to explore additional attributes to be able to confidently set event.outcome correctly. The same applies for designated a transaction as a failure.

Describe the solution you'd like
As a service developer, I want to be able to define more complex rules to set event.outcome to success or failure based on the additional information either available as transaction attributes, or through an API.

Describe alternatives you've considered
Using public APIs is possible, but negates the value of autodetection completely.

Additional context
Without this ability, the error rate indication in the APM UI is not showing the correct picture of failures.

@eyalkoren
Copy link
Contributor

eyalkoren commented Aug 11, 2021

Our tracing specs, both for spans and transactions, already encourage (but not mandate) agents to expose such an API.

The Java agent, for example, offers the setOutcome, which I believe is what you are looking for. If this is required for a specific agent that doesn't offer such API, it's better to open an issue in the corresponding agent repo.

Using public APIs is possible, but negates the value of autodetection completely.

This is right, but since you can do that wherever you want in your code, most likely you have the relevant info (e.g. status code), right?

@michaelhyatt
Copy link
Author

Thanks @eyalkoren

This and other use cases can be solved by allowing callback functions to be executed at different stages of a Transaction and Span lifecycles. This feature should help enable some additional useful use cases such as defining event.outcome and collecting additional custom metadata (attributes/labels) while not changing the original application code relying on auto instrumentation.

@eyalkoren
Copy link
Contributor

OK, I read it as if a public API for setting the outcome is a valid alternative that you may not be aware of.

@sorenlouv
Copy link
Member

sorenlouv commented Aug 17, 2021

I just created #489 and then stumbled upon this which sounds like exactly the same. I was recently talking to @felixbarny, @axw and @Mpdreamz about the ability to customize the logic for setting event.outcome=failure without requiring manual instrumentation.

My proposal is to have a config option for configuring this. This could be in the form of a regex that matches against http.response.status_code. Eg ^([4-9]\d{2})$

@felixbarny
Copy link
Member

allowing callback functions to be executed at different stages of a Transaction and Span lifecycles

@michaelhyatt several agents have the concept of filters or processors which sounds like that may solve the use case.

Could you describe some of the use cases for that? Why is the built-in logic not good enough in some cases? What are some concrete custom rules you'd want to add?

@sqren while closely related, the use case you're mentioning is a little different in that it's not about adding arbitrarily complex rules. Instead, it's about whether or not client-side errors should be added to the error rate calculation. A config for HTTP status codes is a bit too limiting, I think because it wouldn't work for protocols like gRPC, for example. But we could have a boolean flag that determines whether client-side errors should be considered a success or failure.
However, for that use case, it may be enough to re-introduce the throughput by result (HTTP 2xx, HTTP 4xx, HTTP 5xx).

@sorenlouv
Copy link
Member

sorenlouv commented Aug 17, 2021

we could have a boolean flag that determines whether client-side errors should be considered a success or failure.

I think this would be a great start.

@michaelhyatt
Copy link
Author

I guess my PoV is really rooted in seeing a lot of the implementations that show error rates as blanks despite having errors. So, I am thinking about fixing the way we capture event.outcome in a declarative and easy way with good sensible defaults is the first step. Then, we can probably enhance it with more complex logic to determine whether the event is an error and determining the type of the error.

Examples:
image
image

@felixbarny
Copy link
Member

The empty error rate chart could mean that you're using an old version of the agent that doesn't set the outcome yet. I assume these services are talking over gRPC?

More generally, in some cases, it can be a valid situation that the agent captures errors but that the transaction error rate is 0. In other words, even if an error is captured within a transaction, the whole transaction may still be successful. If the server just logs an error and takes compensating actions, the transaction is not necessarily a failure. For example, if you're requesting a product page in an online shop and the call to the recommendation service fails, the page may still render fine with a 200.

@michaelhyatt
Copy link
Author

@felixbarny yes, some of them are gRPC and the agents can be a bit old.

@felixbarny
Copy link
Member

Assuming that updating the agents fix the issue, would you still need custom rules to set the outcome? Do you have specific use cases in mind? Do the processor/filter APIs and the set outcome APIs fulfill your needs? If not, why and what would we need to change in your view?

@andymjames
Copy link

andymjames commented Aug 20, 2021

Hi folks! @sqren dropped me a link to this because Dream Machine is also interested in having additional response codes be marked as failures, and he and Casper were so kind as to hop on a Zoom with me recently to watch me stumble around APM, teach me a few things, and collect feedback/observations, and this was something that came up.

I think if I wanted to do this on my own, I'd be looking at writing up some new Express middleware function that I could find a way get inserted near the end of the request chain (after the work is done, but before APM sends the transaction data) so that I could examine the response and decide to manually override the event outcome.

I could see Soren's regex concept being a great bandage to get some form of this capability out there, but I think if you can let the user define their own evaluation function that APM coordinates running at an appropriate timing during the request chain, then it gets really nice.

I could see it working by optionally defining a evaluateTransactionOutcome or finalizeTransaction function that I can pass in as a configuration field when I call apm.start() so that instead of me having to plumb this into some middleware, APM provides the hook for me to just define that middleware function, and it takes care of when to run that middleware.

The power of letting the user have a function instead of just defining it by codes, is that the developer can now consider the status code AND something else about the state of the response in deciding success or failure. That said, for now my use case feels strictly based on the code value, but I could imagine having a more complex decision tree at some point.

@felixbarny
Copy link
Member

so that I could examine the response and decide to manually override the event outcome.

Would you need the web framework's response object for that or are the properties that APM collects about the response enough?
If you need the response object, I think writing a custom Express middleware function is actually the way to go. If you just need the APM transaction, you could either also use a middleware function or a transaction filter.

That said, for now my use case feels strictly based on the code value

For you current use case, how would you like the outcome to be mapped? Is it just that 4xx should count as an error for the server-side transaction (not just 5xx)?
Just being curious, what's the specific cause of the errors in your app? Are you using the RUM agent? If so, are the RUM transactions marked as failure?

One of the reasons why we didn't map 4xx to failure is that it's not the server's "fault" and a rogue client could just artificially increase the error rate by calling a non-existing API (404). Also, the transaction rate graph used to show the throughput grouped by HTTP 2xx/4xx/5xx which lets you also analyze 4xx specifically.
If the client app is traced, it would probably show an increased error rate, unless it takes a compensating action. For example, first tries to call a new version of an endpoint, such as /orders/v2 and fall back to /orders/v1 if v2 is not yet available.

I'm not trying to dismiss your use case, just trying to understand it better, explain the reasoning of the current state, and see if the existing APIs are already covering your use case 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants