Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: filter spans based on duration #852

Closed
cdalexndr opened this issue Sep 18, 2019 · 5 comments
Closed

Performance: filter spans based on duration #852

cdalexndr opened this issue Sep 18, 2019 · 5 comments
Labels

Comments

@cdalexndr
Copy link

cdalexndr commented Sep 18, 2019

Using the APM, my span index is full of db spans (span.type:db) containing lots of fast execution queries that are of little interest.
A method to filter these spans based on duration will help to ignore fast queries and trace only important long running queries.
Using sampling can ignore important long queries.

I suggest a dynamic configuration option based on span type and subtype:
span_min_duration_[type]_[subtype]=[ms]

Example:

  • span_min_duration_db=500ms will record only spans with type db and any subtype longer than 500ms
  • span_min_duration_app_maintenance=100ms will record spans with type app and subtype maintenance longer than 100ms

Related: #440

@cdalexndr
Copy link
Author

In my apm span index, from 150.000 db spans, 200 are above 10ms, and 50 are above 50ms.
Using a filter for 50ms will eliminate ‭99.9666666666666% of db spans.

@eyalkoren
Copy link
Contributor

This is doable, but some pitfalls to be aware of (see discussion in #440):

  • We can't discard some spans, like such that transfer execution context (internal- async or external- distributed tracing)
  • We can't discard spans leading to non-discarded spans, otherwise we will have orphan spans
  • We should collect metrics for discarded spans (eg breakdown metrics)
  • This will lead to inconsistent captures- traces of exactly the same execution will look different. You may have transactions doing lots of DB queries, all short, where you won't have any indication of that. Seems a valid concern for the numbers described above.
  • Once we add the service map feature, this will introduce the risk of missing a DB (or other service) if all queries to it in the queried time range are below the threshold. This depends on the service map implementation, but according to the current plan this is a valid risk. Seems a valid concern for the numbers described above.

In addition, not sure whether this dynamic naming of config option is the right way to go.
Maybe adding a threshold without regard to its type is good enough as a start.

@cdalexndr
Copy link
Author

My example scales, in a 2GB span index with 11.5M docs, 95% spans are db type and more than 99% of db span duration is below 4ms. Such a waste of disk space...

Thinking to disable db spans and use postgres logging feature of threshold duration queries, then parse with logstash and manual crossreference log time with transaction to find the source, until this issue is resolved.

@SylvainJuge
Copy link
Member

Hi @cdalexndr , we have an upcoming feature called "compressed spans" that should cover your use-case here.

Here are the related issues:

Thus for now I would suggest that you subscribe to those issues and I'll close this one.
Feel free to comment/re-open if relevant though.

@SylvainJuge
Copy link
Member

Also related, there is another related upcoming feature to handle fast exit spans (which include DB spans):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants