Ways to create unique ids in elasticsearch index

I'm looking for ways to auto-generate unique IDs in elasticsearch indices. There are different approaches possible, each with its own set of benefits and drawbacks.

Possible characteristics:

IDs may be numeric or not
They may be ordered somehow or not
Uniqueness may need to be ensured outside of elastic or not
The approach may involve only Elasicsearch queries or additional tools (e.g. Logstash, Python, etc.)

Note: Read this post before attempting to overwrite the native elasticsearch _id field. Specifically:

When Elasticsearch is allowed to assign the document identifier at indexing time, it can perform optimizations as it knows the generated identifier can not already exist in the index. This improves indexing performance. For identifiers generated externally and passed in with the document, Elasticsearch must treat this as a potential update and check whether the document identifier already exists in existing index segments, which requires additional work and therefore is slower.

Update query

The update approach always means that the id generation needs to be performed in a separate step, i.e. not on indexing of a document. This can be done e.g. using _reindex or _update_by_query.

Example using `script` query that copies elaticsearch `_id` field

This approach updates the _source with a new field built based on the auto-generated index field. It takes this _id and adds a new id field in the source with identical value (if it does not exist yet).

POST  my-id-field-test-index/_update_by_query?conflicts=proceed&wait_for_completion=false&refresh=true
{
  "script": {
    "source": "ctx._source.id = ctx._id",
    "lang": "painless"
  },
  "query": {
    "bool": {
      "must_not": {
        "exists": {
          "field": "id"
        }
      }
    }
  }
}

Pros

the uniqueness of the generated id is guaranteed by Elasticsearch
internal _id and id in source are identical

Cons

this approach can not work while initially indexing a document, since the internal _id field is only generated at the very end of indexing and could be used in ingestion pipelines
since it can only be user in an _reindex or _update_by_query fashion, this is an extra step that is required to generate an id for existing documents

Example using `script` query that generates id using `java.util.UUID.randomUUID()`

This approach updates the _source with a new field built based on UUID generation. It adds a new uuid field in the source with a generated version 4 (i.e. random type) uuid.

POST  my-id-field-test-index/_update_by_query?conflicts=proceed
{
"script" : {
    "source": "ctx._source.uuid = java.util.UUID.randomUUID().toString()",
    "lang" : "painless"
  },
  "query": {
    "bool": {
      "must_not": {
        "exists": {
          "field": "uuid"
        }
      }
    }
  }
}

Pros

the chance of collisions is statistically very low, so the uniqueness of the generated uuid is likely
this approach can (slightly modified) also be used in ingestion pipelines, see below

Cons

uniqeness is not guaranteed, only statistically likely
when used in an _update_by_query fashion, this is an extra step that is required to generate a uuid for existing documents

ID generation in ingest pipelines

ID generation while ingestion has the charm that it does not require an additional step. However, it limits the available approaches to what can be done with means available in elasticsearch queries.

Example using a painless `script` that generates ids using `java.util.UUID.randomUUID()`

The following approach defines a pipeline that uses a similar script as above to generate a uuid field with a version 4 (i.e. random type) uuid and attach it to incoming documents.

Simulate pipeline

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "script": {
          "description": "Generate (version 4, i.e. random) UUID and store in 'uuid' field.",
          "lang": "painless",
          "source": """
            ctx['uuid'] = java.util.UUID.randomUUID().toString()
          """,
          "params": {
            "delimiter": "-",
            "position": 1
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "fruit": "banana",
        "amount": 10
      }
    }
  ]
}

Actually create pipeline and post data using it

PUT _ingest/pipeline/my-pipeline
{
  "description": "Generate UUID.",
  "processors": [
      {
        "script": {
          "description": "Generate (version 4, i.e. random) UUID and store in 'uuid' field.",
          "lang": "painless",
          "source": """
            ctx['uuid'] = java.util.UUID.randomUUID().toString()
          """,
          "params": {
            "delimiter": "-",
            "position": 1
          }
        }
      }
    ]
}

POST my-id-field-test-index/_bulk?pipeline=my-pipeline
{ "index": {}}
{ "fruit": "Whole new fruit" , "amount": 31}
{ "index": {}}
{ "fruit": "Never seen before fruit" , "amount": 67}

Pros

the chance of collisions is statistically very low, so the uniqueness of the generated uuid is likely
the uuid is generated on indexing, no extra step is required

Cons

uniqeness is not guaranteed, only statistically likely

Example data

POST my-id-field-test-index/_bulk
{ "index": {}}
{ "fruit": "Strawberry", "amount": 1}
{ "index": {}}
{ "fruit": "Pineapple", "amount": 4}

POST my-id-field-test-index/_bulk
{ "index": {}}
{ "fruit": "Dragonfruit", "amount": 42}
{ "index": {}}
{ "amount": 3, "fruit": "Egg fruit" }

POST my-id-field-test-index/_bulk
{ "index": {}}
{ "fruit": "Grapefruit", "amount": 11}
{ "index": {}}
{ "fruit": "Jackfruit" , "amount": 22}

POST my-id-field-test-index/_bulk
{ "index": {}}
{ "fruit": "Whatever fruit" , "amount": 17}
{ "index": {}}
{ "fruit": "New fruit" , "amount": 122}

GET my-id-field-test-index/_search

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add-unique-ids-approaches.md

add-unique-ids-approaches.md

Ways to create unique ids in elasticsearch index

Update query

Example using `script` query that copies elaticsearch `_id` field

Pros

Cons

Example using `script` query that generates id using `java.util.UUID.randomUUID()`

Pros

Cons

ID generation in ingest pipelines

Example using a painless `script` that generates ids using `java.util.UUID.randomUUID()`

Simulate pipeline

Actually create pipeline and post data using it

Pros

Cons

Example data

Files

add-unique-ids-approaches.md

Latest commit

History

add-unique-ids-approaches.md

File metadata and controls

Ways to create unique ids in elasticsearch index

Update query

Example using script query that copies elaticsearch _id field

Pros

Cons

Example using script query that generates id using java.util.UUID.randomUUID()

Pros

Cons

ID generation in ingest pipelines

Example using a painless script that generates ids using java.util.UUID.randomUUID()

Simulate pipeline

Actually create pipeline and post data using it

Pros

Cons

Example data

Example using `script` query that copies elaticsearch `_id` field

Example using `script` query that generates id using `java.util.UUID.randomUUID()`

Example using a painless `script` that generates ids using `java.util.UUID.randomUUID()`