Index span.db.statement as text #6586

matschaffer · 2021-11-11T05:07:50Z

The db statement is really useful, but it would be even more amazingly useful if we could search on it.

So by adding this mapping:

We should be able to search for spans which have a certain db query structure.

One gotcha I seem to be hitting in this test is that I can't search for jvm or heap . Seems like maybe I'm catching an "ignore_above" but I didn't think they were supposed to kick in on text fields.

Either way, I imagine it'd be quite useful to do things like highlight traces that involved specific query structures (indices, tables, etc).

The text was updated successfully, but these errors were encountered:

axw · 2021-11-11T07:21:39Z

@matschaffer thanks for opening.

Originally this field was intentionally not indexed, to avoid incurring high storage cost. I would like to revisit this decision, ideally with some data to show the cost is worthwhile.

Would match_only_text be reasonable for your needs?

matschaffer · 2021-11-15T02:21:46Z

For the purpose I had above, definitely.

Though I could imagine cases where ordering of word in the statement is important. For example if it were a sql query, you might be interested in if the table appeared before or after the join.

Just conjecture on that though.

matschaffer · 2021-11-15T02:22:55Z

As for the cost, have we gotten hard numbers on with/without? Seems like maybe we could add the mapping to a large index and check before/after. I'm sure ESS internal clusters have lots of large indices we could borrow for such a purpose.

axw · 2021-11-15T02:27:34Z

As for the cost, have we gotten hard numbers on with/without?

No, that's really the only thing holding us back at the moment.

matschaffer · 2021-11-15T02:30:01Z

Cool. I was able to simply add the mapping to get the functionality. I didn't check on-disk size before/after, but I could. Is there a particular disk number you think would be most useful? Is the store size on https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html sufficient you think?

axw · 2021-11-15T02:36:27Z

I think indexing and store are most interesting. It would also be helpful to know the proportion of docs that have a span.db.statement field.

matschaffer · 2021-11-17T06:32:54Z

I did a little poking at this today. I think we can probably get the ratio with:

GET apm-7.15.1-span-000001/_search
{
  "size": 0,
  "aggs": {
    "wrapper": {
      "terms": {
        "field": "_index",
        "size": 10
      },
      "aggs": {
        "with_db_statement": {
          "filter": {
            "exists": {
              "field": "span.db.statement"
            }
          }
        },
        "total": {
          "filter": {
            "match": {
              "_index": "*"
            }
          }
        },
        "ratio": {
          "bucket_script": {
            "buckets_path": {
              "numerator": "with_db_statement>_count",
              "denominator": "total>_count"
            },
            "script": "params.numerator / params.denominator"
          }
        }
      }
    }
  }
}

And the store sizing info with:

GET apm-7.15.1-span-000001/_stats/indexing,store

matschaffer · 2021-11-17T07:11:39Z

Here's one from us-east-1 staging metrics cluster

Before:

GET apm-7.15.0-span-000048/_stats/indexing,store

{
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      "store" : {
        "size_in_bytes" : 3665198005,
        "total_data_set_size_in_bytes" : 3665198005,
        "reserved_in_bytes" : 0
      },
      "indexing" : {
        "index_total" : 0,
        "index_time_in_millis" : 0,
        "index_current" : 0,
        "index_failed" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : false,
        "throttle_time_in_millis" : 0
      }
    },
    "total" : {
      "store" : {
        "size_in_bytes" : 7330877145,
        "total_data_set_size_in_bytes" : 7330877145,
        "reserved_in_bytes" : 0
      },
      "indexing" : {
        "index_total" : 0,
        "index_time_in_millis" : 0,
        "index_current" : 0,
        "index_failed" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : false,
        "throttle_time_in_millis" : 0
      }
    }
  },
  "indices" : {
    "apm-7.15.0-span-000048" : {
      "uuid" : "KMHvD2_mQ9OYg2Eli4cq9w",
      "primaries" : {
        "store" : {
          "size_in_bytes" : 3665198005,
          "total_data_set_size_in_bytes" : 3665198005,
          "reserved_in_bytes" : 0
        },
        "indexing" : {
          "index_total" : 0,
          "index_time_in_millis" : 0,
          "index_current" : 0,
          "index_failed" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0,
          "noop_update_total" : 0,
          "is_throttled" : false,
          "throttle_time_in_millis" : 0
        }
      },
      "total" : {
        "store" : {
          "size_in_bytes" : 7330877145,
          "total_data_set_size_in_bytes" : 7330877145,
          "reserved_in_bytes" : 0
        },
        "indexing" : {
          "index_total" : 0,
          "index_time_in_millis" : 0,
          "index_current" : 0,
          "index_failed" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0,
          "noop_update_total" : 0,
          "is_throttled" : false,
          "throttle_time_in_millis" : 0
        }
      }
    }
  }
}

Mapping:

PUT apm-7.15.0-span-000048/_mapping

{
  "properties": {
    "span": {
      "properties": {
        "db": {
          "properties": {
            "statement": {
              "type": "text"
            }
          }
        }
      }
    }
  }
}

After:

GET apm-7.15.0-span-000048/_stats/indexing,store

{
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      "store" : {
        "size_in_bytes" : 3665198005,
        "total_data_set_size_in_bytes" : 3665198005,
        "reserved_in_bytes" : 0
      },
      "indexing" : {
        "index_total" : 0,
        "index_time_in_millis" : 0,
        "index_current" : 0,
        "index_failed" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : false,
        "throttle_time_in_millis" : 0
      }
    },
    "total" : {
      "store" : {
        "size_in_bytes" : 7330877145,
        "total_data_set_size_in_bytes" : 7330877145,
        "reserved_in_bytes" : 0
      },
      "indexing" : {
        "index_total" : 0,
        "index_time_in_millis" : 0,
        "index_current" : 0,
        "index_failed" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : false,
        "throttle_time_in_millis" : 0
      }
    }
  },
  "indices" : {
    "apm-7.15.0-span-000048" : {
      "uuid" : "KMHvD2_mQ9OYg2Eli4cq9w",
      "primaries" : {
        "store" : {
          "size_in_bytes" : 3665198005,
          "total_data_set_size_in_bytes" : 3665198005,
          "reserved_in_bytes" : 0
        },
        "indexing" : {
          "index_total" : 0,
          "index_time_in_millis" : 0,
          "index_current" : 0,
          "index_failed" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0,
          "noop_update_total" : 0,
          "is_throttled" : false,
          "throttle_time_in_millis" : 0
        }
      },
      "total" : {
        "store" : {
          "size_in_bytes" : 7330877145,
          "total_data_set_size_in_bytes" : 7330877145,
          "reserved_in_bytes" : 0
        },
        "indexing" : {
          "index_total" : 0,
          "index_time_in_millis" : 0,
          "index_current" : 0,
          "index_failed" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0,
          "noop_update_total" : 0,
          "is_throttled" : false,
          "throttle_time_in_millis" : 0
        }
      }
    }
  }
}

Ratio:

GET apm-7.15.0-span-000048/_search

Request:

{
  "size": 0,
  "aggs": {
    "wrapper": {
      "terms": {
        "field": "_index",
        "size": 10
      },
      "aggs": {
        "with_db_statement": {
          "filter": {
            "exists": {
              "field": "span.db.statement"
            }
          }
        },
        "total": {
          "filter": {
            "match": {
              "_index": "*"
            }
          }
        },
        "ratio": {
          "bucket_script": {
            "buckets_path": {
              "numerator": "with_db_statement>_count",
              "denominator": "total>_count"
            },
            "script": "params.numerator / params.denominator"
          }
        }
      }
    }
  }
}

Response:

{
  "size": 0,
  "aggs": {
    "wrapper": {
      "terms": {
        "field": "_index",
        "size": 10
      },
      "aggs": {
        "with_db_statement": {
          "filter": {
            "exists": {
              "field": "span.db.statement"
            }
          }
        },
        "total": {
          "filter": {
            "match": {
              "_index": "*"
            }
          }
        },
        "ratio": {
          "bucket_script": {
            "buckets_path": {
              "numerator": "with_db_statement>_count",
              "denominator": "total>_count"
            },
            "script": "params.numerator / params.denominator"
          }
        }
      }
    }
  }
}

Sadly it looks like this index had no db statement spans 😆

I'll see if I can dig around for a sample apm span index with more db statements around.

matschaffer · 2021-11-17T07:15:01Z

Wonder if it might make sense to use @dgieselaar 's apm synthtrace to just generate a bunch of random db statements as part of the span data. Not as nice as "real" but looks like our biggest "real" data source doesn't record db statements at all today.

axw · 2021-11-17T07:42:52Z

Sadly it looks like this index had no db statement spans laughing

@matschaffer oh, bummer. Thanks for digging into it anyway.

Wonder if it might make sense to use @dgieselaar 's apm synthtrace to just generate a bunch of random db statements as part of the span data. Not as nice as "real" but looks like our biggest "real" data source doesn't record db statements at all today.

++

I'd like to use synthtrace (some kind of generator) to generate APM events, pass them through apm-server, and then use the resulting docs in Rally (#6115). Then we can measure the impact of mapping changes like, while keeping the results in sync with the changes in how apm-server structures docs.

henrikno · 2021-11-18T01:51:47Z

Btw. from 7.15 you can get the actual disk usage per field, which is really useful to test mapping changes. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-disk-usage.html

henrikno · 2021-11-18T02:26:51Z

Also wanted to note, match_only_text does support phrase queries, it'll just be slower since it needs to look at _source to verify ordering.

tobiasstadler · 2021-11-18T09:20:11Z

I wanted to note that span.db.statement may contain sensitive information.

axw · 2021-11-18T09:43:52Z

@tobiasstadler we already store span.db.statement in the Elasticsearch document, it's just that we don't index it for fast searching. It's already possible to search the field using runtime fields by defining a runtime field in the search request, so I think the only difference would be the speed of the search. Is there something else about indexing it that increases security risk?

tobiasstadler · 2021-11-18T09:50:05Z

No, I don't thin so. Sorry I didn't know/forgot that the value is stored already.

axw · 2021-11-18T09:56:37Z

No worries! Thanks for raising your concern anyway.

tobiasstadler · 2021-11-18T14:13:12Z

I did the following in our test cluster:

PUT /apm-7.15.0-span-000001/_mapping
{
  "properties": {
    "span": {
      "properties": {
        "db": {
          "properties": {
            "statement": {
              "type": "text"
            }
          }
        }
      }
    }
  }
}

and

 POST /apm-7.15.0-span-000001/_update_by_query

Now

POST /apm-7.15.0-span-000001/_disk_usage?run_expensive_tasks=true

results in

...

"span.db.statement" : {
        "total" : "49.9mb",
        "total_in_bytes" : 52365481,
        "inverted_index" : {
          "total" : "35mb",
          "total_in_bytes" : 36720077
        },
        "stored_fields" : "0b",
        "stored_fields_in_bytes" : 0,
        "doc_values" : "0b",
        "doc_values_in_bytes" : 0,
        "points" : "0b",
        "points_in_bytes" : 0,
        "norms" : "14.9mb",
        "norms_in_bytes" : 15645404,
        "term_vectors" : "0b",
        "term_vectors_in_bytes" : 0
      },
...

apm-7.15.0-span-000001 has 13610386 docs with span.db.statement.

I hope this helps.

axw · 2021-11-18T23:05:19Z

Thanks @tobiasstadler! Those numbers suggest we have nothing to be concerned about, storage wise. It would be helpful to know what the indexing overhead is (I expect it's also fine).

tobiasstadler · 2021-11-19T06:42:02Z

Is there a way to measure to overhead?

axw · 2021-11-19T07:25:30Z

@tobiasstadler using the Elasticsearch _stats API like in #6586 (comment) (comparing before & after). However, I just saw that Mat's results in #6586 (comment) and the indexing stats are all zero, so I'm not exactly sure what steps are necessary here.

tobiasstadler · 2021-11-19T07:42:35Z

Sorry, but I do not have any "before" data.

axw · 2021-11-19T07:44:47Z

No worries, we'll run some experiments. Thanks all the same :)

axw · 2023-11-23T07:42:19Z

This will kinda be handled as a side-effect of #11528, and the general move to dynamic mapping. I say "kinda" because the dynamic mapping rules will map this field as a keyword field, and not text. Still, it will be searchable (albeit more slowly), and folks will still be able to override this with a custom component template.

axw · 2023-11-23T07:42:30Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index span.db.statement as text #6586

Index span.db.statement as text #6586

matschaffer commented Nov 11, 2021

axw commented Nov 11, 2021

matschaffer commented Nov 15, 2021

matschaffer commented Nov 15, 2021

axw commented Nov 15, 2021

matschaffer commented Nov 15, 2021

axw commented Nov 15, 2021

matschaffer commented Nov 17, 2021

matschaffer commented Nov 17, 2021

matschaffer commented Nov 17, 2021

axw commented Nov 17, 2021

henrikno commented Nov 18, 2021

henrikno commented Nov 18, 2021

tobiasstadler commented Nov 18, 2021

axw commented Nov 18, 2021

tobiasstadler commented Nov 18, 2021

axw commented Nov 18, 2021

tobiasstadler commented Nov 18, 2021

axw commented Nov 18, 2021

tobiasstadler commented Nov 19, 2021

axw commented Nov 19, 2021

tobiasstadler commented Nov 19, 2021

axw commented Nov 19, 2021

axw commented Nov 23, 2023

axw commented Nov 23, 2023

Index span.db.statement as text #6586

Index span.db.statement as text #6586

Comments

matschaffer commented Nov 11, 2021

axw commented Nov 11, 2021

matschaffer commented Nov 15, 2021

matschaffer commented Nov 15, 2021

axw commented Nov 15, 2021

matschaffer commented Nov 15, 2021

axw commented Nov 15, 2021

matschaffer commented Nov 17, 2021

matschaffer commented Nov 17, 2021

matschaffer commented Nov 17, 2021

axw commented Nov 17, 2021

henrikno commented Nov 18, 2021

henrikno commented Nov 18, 2021

tobiasstadler commented Nov 18, 2021

axw commented Nov 18, 2021

tobiasstadler commented Nov 18, 2021

axw commented Nov 18, 2021

tobiasstadler commented Nov 18, 2021

axw commented Nov 18, 2021

tobiasstadler commented Nov 19, 2021

axw commented Nov 19, 2021

tobiasstadler commented Nov 19, 2021

axw commented Nov 19, 2021

axw commented Nov 23, 2023

axw commented Nov 23, 2023