-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index span.db.statement as text #6586
Comments
@matschaffer thanks for opening. Originally this field was intentionally not indexed, to avoid incurring high storage cost. I would like to revisit this decision, ideally with some data to show the cost is worthwhile. Would |
For the purpose I had above, definitely. Though I could imagine cases where ordering of word in the statement is important. For example if it were a sql query, you might be interested in if the table appeared before or after the join. Just conjecture on that though. |
As for the cost, have we gotten hard numbers on with/without? Seems like maybe we could add the mapping to a large index and check before/after. I'm sure ESS internal clusters have lots of large indices we could borrow for such a purpose. |
No, that's really the only thing holding us back at the moment. |
Cool. I was able to simply add the mapping to get the functionality. I didn't check on-disk size before/after, but I could. Is there a particular disk number you think would be most useful? Is the |
I think |
I did a little poking at this today. I think we can probably get the ratio with:
And the store sizing info with:
|
Here's one from us-east-1 staging metrics cluster Before:
|
Wonder if it might make sense to use @dgieselaar 's apm synthtrace to just generate a bunch of random db statements as part of the span data. Not as nice as "real" but looks like our biggest "real" data source doesn't record db statements at all today. |
@matschaffer oh, bummer. Thanks for digging into it anyway.
++ I'd like to use synthtrace (some kind of generator) to generate APM events, pass them through apm-server, and then use the resulting docs in Rally (#6115). Then we can measure the impact of mapping changes like, while keeping the results in sync with the changes in how apm-server structures docs. |
Btw. from 7.15 you can get the actual disk usage per field, which is really useful to test mapping changes. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-disk-usage.html |
Also wanted to note, |
I wanted to note that |
@tobiasstadler we already store |
No, I don't thin so. Sorry I didn't know/forgot that the value is stored already. |
No worries! Thanks for raising your concern anyway. |
I did the following in our test cluster:
and
Now
results in
apm-7.15.0-span-000001 has 13610386 docs with I hope this helps. |
Thanks @tobiasstadler! Those numbers suggest we have nothing to be concerned about, storage wise. It would be helpful to know what the indexing overhead is (I expect it's also fine). |
Is there a way to measure to overhead? |
@tobiasstadler using the Elasticsearch |
Sorry, but I do not have any "before" data. |
No worries, we'll run some experiments. Thanks all the same :) |
This will kinda be handled as a side-effect of #11528, and the general move to dynamic mapping. I say "kinda" because the dynamic mapping rules will map this field as a |
See also #12098 |
The db statement is really useful, but it would be even more amazingly useful if we could search on it.
So by adding this mapping:
We should be able to search for spans which have a certain db query structure.
One gotcha I seem to be hitting in this test is that I can't search for
jvm
orheap
. Seems like maybe I'm catching an "ignore_above" but I didn't think they were supposed to kick in on text fields.Either way, I imagine it'd be quite useful to do things like highlight traces that involved specific query structures (indices, tables, etc).
The text was updated successfully, but these errors were encountered: