Skip to content

Commit

Permalink
Merge branch 'main' into cjk-width-token-filter
Browse files Browse the repository at this point in the history
  • Loading branch information
kolchfa-aws authored Sep 13, 2024
2 parents e373b7b + 27c02f9 commit 148f3f1
Show file tree
Hide file tree
Showing 31 changed files with 351 additions and 8 deletions.
81 changes: 81 additions & 0 deletions _api-reference/document-apis/bulk-streaming.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
layout: default
title: Streaming bulk
parent: Document APIs
nav_order: 25
redirect_from:
- /opensearch/rest-api/document-apis/bulk/streaming/
---

# Streaming bulk
**Introduced 2.17.0**
{: .label .label-purple }

This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/9065).
{: .warning}

The streaming bulk operation lets you add, update, or delete multiple documents by streaming the request and getting the results as a streaming response. In comparison to the traditional [Bulk API]({{site.url}}{{site.baseurl}}/api-reference/document-apis/bulk/), streaming ingestion eliminates the need to estimate the batch size (which is affected by the cluster operational state at any given time) and naturally applies backpressure between many clients and the cluster. The streaming works over HTTP/2 or HTTP/1.1 (using chunked transfer encoding), depending on the capabilities of the clients and the cluster.

The default HTTP transport method does not support streaming. You must install the [`transport-reactor-netty4`]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/network-settings/#selecting-the-transport) HTTP transport plugin and use it as the default HTTP transport layer. Both the `transport-reactor-netty4` plugin and the Streaming Bulk API are experimental.
{: .note}

## Path and HTTP methods

```json
POST _bulk/stream
POST <index>/_bulk/stream
```

If you specify the index in the path, then you don't need to include it in the [request body chunks]({{site.url}}{{site.baseurl}}/api-reference/document-apis/bulk/#request-body).

OpenSearch also accepts PUT requests to the `_bulk/stream` path, but we highly recommend using POST. The accepted usage of PUT---adding or replacing a single resource on a given path---doesn't make sense for streaming bulk requests.
{: .note }


## Query parameters

The following table lists the available query parameters. All query parameters are optional.

Parameter | Data type | Description
:--- | :--- | :---
`pipeline` | String | The pipeline ID for preprocessing documents.
`refresh` | Enum | Whether to refresh the affected shards after performing the indexing operations. Default is `false`. `true` causes the changes show up in search results immediately but degrades cluster performance. `wait_for` waits for a refresh. Requests take longer to return, but cluster performance isn't degraded.
`require_alias` | Boolean | Set to `true` to require that all actions target an index alias rather than an index. Default is `false`.
`routing` | String | Routes the request to the specified shard.
`timeout` | Time | How long to wait for the request to return. Default is `1m`.
`type` | String | (Deprecated) The default document type for documents that don't specify a type. Default is `_doc`. We highly recommend ignoring this parameter and using the `_doc` type for all indexes.
`wait_for_active_shards` | String | Specifies the number of active shards that must be available before OpenSearch processes the bulk request. Default is `1` (only the primary shard). Set to `all` or a positive integer. Values greater than 1 require replicas. For example, if you specify a value of 3, the index must have 2 replicas distributed across 2 additional nodes in order for the request to succeed.
`batch_interval` | Time | Specifies for how long bulk operations should be accumulated into a batch before sending the batch to data nodes.
`batch_size` | Time | Specifies how many bulk operations should be accumulated into a batch before sending the batch to data nodes. Default is `1`.
{% comment %}_source | List | asdf
`_source_excludes` | List | asdf
`_source_includes` | List | asdf{% endcomment %}

## Request body

The Streaming Bulk API request body is fully compatible with the [Bulk API request body]({{site.url}}{{site.baseurl}}/api-reference/document-apis/bulk/#request-body), where each bulk operation (create/index/update/delete) is sent as a separate chunk.

## Example request

```json
curl -X POST "http://localhost:9200/_bulk/stream" -H "Transfer-Encoding: chunked" -H "Content-Type: application/json" -d'
{ "delete": { "_index": "movies", "_id": "tt2229499" } }
{ "index": { "_index": "movies", "_id": "tt1979320" } }
{ "title": "Rush", "year": 2013 }
{ "create": { "_index": "movies", "_id": "tt1392214" } }
{ "title": "Prisoners", "year": 2013 }
{ "update": { "_index": "movies", "_id": "tt0816711" } }
{ "doc" : { "title": "World War Z" } }
'
```
{% include copy.html %}

## Example response

Depending on the batch settings, each streamed response chunk may report the results of one or many (batch) bulk operations. For example, for the preceding request with no batching (default), the streaming response may appear as follows:

```json
{"took": 11, "errors": false, "items": [ { "index": {"_index": "movies", "_id": "tt1979320", "_version": 1, "result": "created", "_shards": { "total": 2 "successful": 1, "failed": 0 }, "_seq_no": 1, "_primary_term": 1, "status": 201 } } ] }
{"took": 2, "errors": true, "items": [ { "create": { "_index": "movies", "_id": "tt1392214", "status": 409, "error": { "type": "version_conflict_engine_exception", "reason": "[tt1392214]: version conflict, document already exists (current version [1])", "index": "movies", "shard": "0", "index_uuid": "yhizhusbSWmP0G7OJnmcLg" } } } ] }
{"took": 4, "errors": true, "items": [ { "update": { "_index": "movies", "_id": "tt0816711", "status": 404, "error": { "type": "document_missing_exception", "reason": "[_doc][tt0816711]: document missing", "index": "movies", "shard": "0", "index_uuid": "yhizhusbSWmP0G7OJnmcLg" } } } ] }
```
12 changes: 6 additions & 6 deletions _api-reference/document-apis/bulk.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,16 +53,16 @@ All bulk URL parameters are optional.
Parameter | Type | Description
:--- | :--- | :---
pipeline | String | The pipeline ID for preprocessing documents.
refresh | Enum | Whether to refresh the affected shards after performing the indexing operations. Default is `false`. `true` makes the changes show up in search results immediately, but hurts cluster performance. `wait_for` waits for a refresh. Requests take longer to return, but cluster performance doesn't suffer.
refresh | Enum | Whether to refresh the affected shards after performing the indexing operations. Default is `false`. `true` causes the changes show up in search results immediately but degrades cluster performance. `wait_for` waits for a refresh. Requests take longer to return, but cluster performance isn't degraded.
require_alias | Boolean | Set to `true` to require that all actions target an index alias rather than an index. Default is `false`.
routing | String | Routes the request to the specified shard.
timeout | Time | How long to wait for the request to return. Default `1m`.
type | String | (Deprecated) The default document type for documents that don't specify a type. Default is `_doc`. We highly recommend ignoring this parameter and using a type of `_doc` for all indexes.
wait_for_active_shards | String | Specifies the number of active shards that must be available before OpenSearch processes the bulk request. Default is 1 (only the primary shard). Set to `all` or a positive integer. Values greater than 1 require replicas. For example, if you specify a value of 3, the index must have two replicas distributed across two additional nodes for the request to succeed.
timeout | Time | How long to wait for the request to return. Default is `1m`.
type | String | (Deprecated) The default document type for documents that don't specify a type. Default is `_doc`. We highly recommend ignoring this parameter and using the `_doc` type for all indexes.
wait_for_active_shards | String | Specifies the number of active shards that must be available before OpenSearch processes the bulk request. Default is `1` (only the primary shard). Set to `all` or a positive integer. Values greater than 1 require replicas. For example, if you specify a value of 3, the index must have 2 replicas distributed across 2 additional nodes in order for the request to succeed.
batch_size | Integer | **(Deprecated)** Specifies the number of documents to be batched and sent to an ingest pipeline to be processed together. Default is `2147483647` (documents are ingested by an ingest pipeline all at once). If the bulk request doesn't explicitly specify an ingest pipeline or the index doesn't have a default ingest pipeline, then this parameter is ignored. Only documents with `create`, `index`, or `update` actions can be grouped into batches.
{% comment %}_source | List | asdf
_source_excludes | list | asdf
_source_includes | list | asdf{% endcomment %}
_source_excludes | List | asdf
_source_includes | List | asdf{% endcomment %}


## Request body
Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/alias.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ redirect_from:
---

# Alias field type
**Introduced 1.0**
{: .label .label-purple }

An alias field type creates another name for an existing field. You can use aliases in the[search](#using-aliases-in-search-api-operations) and [field capabilities](#using-aliases-in-field-capabilities-api-operations) API operations, with some [exceptions](#exceptions). To set up an [alias](#alias-field), you need to specify the [original field](#original-field) name in the `path` parameter.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/binary.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ redirect_from:
---

# Binary field type
**Introduced 1.0**
{: .label .label-purple }

A binary field type contains a binary value in [Base64](https://en.wikipedia.org/wiki/Base64) encoding that is not searchable.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/boolean.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ redirect_from:
---

# Boolean field type
**Introduced 1.0**
{: .label .label-purple }

A Boolean field type takes `true` or `false` values, or `"true"` or `"false"` strings. You can also pass an empty string (`""`) in place of a `false` value.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/completion.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ redirect_from:
---

# Completion field type
**Introduced 1.0**
{: .label .label-purple }

A completion field type provides autocomplete functionality through a completion suggester. The completion suggester is a prefix suggester, so it matches the beginning of text only. A completion suggester creates an in-memory data structure, which provides faster lookups but leads to increased memory usage. You need to upload a list of all possible completions into the index before using this feature.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/constant-keyword.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ grand_parent: Supported field types
---

# Constant keyword field type
**Introduced 2.14**
{: .label .label-purple }

A constant keyword field uses the same value for all documents in the index.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/date-nanos.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ grand_parent: Supported field types
---

# Date nanoseconds field type
**Introduced 1.0**
{: .label .label-purple }

The `date_nanos` field type is similar to the [`date`]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/date/) field type in that it holds a date. However, `date` stores the date in millisecond resolution, while `date_nanos` stores the date in nanosecond resolution. Dates are stored as `long` values that correspond to nanoseconds since the epoch. Therefore, the range of supported dates is approximately 1970--2262.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/date.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ redirect_from:
---

# Date field type
**Introduced 1.0**
{: .label .label-purple }

A date in OpenSearch can be represented as one of the following:

Expand Down
78 changes: 76 additions & 2 deletions _field-types/supported-field-types/derived.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,11 @@ Despite the potential performance impact of query-time computations, the flexibi

Currently, derived fields have the following limitations:

- **Aggregation, scoring, and sorting**: Not yet supported.
- **Scoring and sorting**: Not yet supported.
- **Aggregations**: Starting with OpenSearch 2.17, derived fields support most aggregation types. The following aggregations are not supported: geographic (geodistance, geohash grid, geohex grid, geotile grid, geobounds, geocentroid), significant terms, significant text, and scripted metric.
- **Dashboard support**: These fields are not displayed in the list of available fields in OpenSearch Dashboards. However, you can still use them for filtering if you know the derived field name.
- **Chained derived fields**: One derived field cannot be used to define another derived field.
- **Join field type**: Derived fields are not supported for the [join field type]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/join/).
- **Concurrent segment search**: Derived fields are not supported for [concurrent segment search]({{site.url}}{{site.baseurl}}/search-plugins/concurrent-segment-search/).

We are planning to address these limitations in future versions.

Expand Down Expand Up @@ -541,6 +541,80 @@ The response specifies highlighting in the `url` field:
```
</details>

## Aggregations

Starting with OpenSearch 2.17, derived fields support most aggregation types.

Geographic, significant terms, significant text, and scripted metric aggregations are not supported.
{: .note}

For example, the following request creates a simple `terms` aggregation on the `method` derived field:

```json
POST /logs/_search
{
"size": 0,
"aggs": {
"methods": {
"terms": {
"field": "method"
}
}
}
}
```
{% include copy-curl.html %}

The response contains the following buckets:

<details markdown="block">
<summary>
Response
</summary>
{: .text-delta}

```json
{
"took" : 12,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"methods" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "GET",
"doc_count" : 2
},
{
"key" : "POST",
"doc_count" : 2
},
{
"key" : "DELETE",
"doc_count" : 1
}
]
}
}
}
```
</details>

## Performance

Derived fields are not indexed but are computed dynamically by retrieving values from the `_source` field or doc values. Thus, they run more slowly. To improve performance, try the following:
Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/flat-object.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ redirect_from:
---

# Flat object field type
**Introduced 2.7**
{: .label .label-purple }

In OpenSearch, you don't have to specify a mapping before indexing documents. If you don't specify a mapping, OpenSearch uses [dynamic mapping]({{site.url}}{{site.baseurl}}/field-types/index#dynamic-mapping) to map every field and its subfields in the document automatically. When you ingest documents such as logs, you may not know every field's subfield name and type in advance. In this case, dynamically mapping all new subfields can quickly lead to a "mapping explosion," where the growing number of fields may degrade the performance of your cluster.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/geo-point.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ redirect_from:
---

# Geopoint field type
**Introduced 1.0**
{: .label .label-purple }

A geopoint field type contains a geographic point specified by latitude and longitude.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/geo-shape.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ redirect_from:
---

# Geoshape field type
**Introduced 1.0**
{: .label .label-purple }

A geoshape field type contains a geographic shape, such as a polygon or a collection of geographic points. To index a geoshape, OpenSearch tesselates the shape into a triangular mesh and stores each triangle in a BKD tree. This provides a 10<sup>-7</sup>decimal degree of precision, which represents near-perfect spatial resolution. Performance of this process is mostly impacted by the number of vertices in a polygon you are indexing.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/ip.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ redirect_from:
---

# IP address field type
**Introduced 1.0**
{: .label .label-purple }

An ip field type contains an IP address in IPv4 or IPv6 format.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/join.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ redirect_from:
---

# Join field type
**Introduced 1.0**
{: .label .label-purple }

A join field type establishes a parent/child relationship between documents in the same index.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/keyword.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ redirect_from:
---

# Keyword field type
**Introduced 1.0**
{: .label .label-purple }

A keyword field type contains a string that is not analyzed. It allows only exact, case-sensitive matches.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/knn-vector.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ has_math: true
---

# k-NN vector field type
**Introduced 1.0**
{: .label .label-purple }

The [k-NN plugin]({{site.url}}{{site.baseurl}}/search-plugins/knn/index/) introduces a custom data type, the `knn_vector`, that allows users to ingest their k-NN vectors into an OpenSearch index and perform different kinds of k-NN search. The `knn_vector` field is highly configurable and can serve many different k-NN workloads. In general, a `knn_vector` field can be built either by providing a method definition or specifying a model id.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/match-only-text.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ grand_parent: Supported field types
---

# Match-only text field type
**Introduced 2.12**
{: .label .label-purple }

A `match_only_text` field is a variant of a `text` field designed for full-text search when scoring and positional information of terms within a document are not critical.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/nested.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ redirect_from:
---

# Nested field type
**Introduced 1.0**
{: .label .label-purple }

A nested field type is a special type of [object field type]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/object/).

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/object.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ redirect_from:
---

# Object field type
**Introduced 1.0**
{: .label .label-purple }

An object field type contains a JSON object (a set of name/value pairs). A value in a JSON object may be another JSON object. It is not necessary to specify `object` as the type when mapping object fields because `object` is the default type.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/percolator.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ redirect_from:
---

# Percolator field type
**Introduced 1.0**
{: .label .label-purple }

A percolator field type specifies to treat this field as a query. Any JSON object field can be marked as a percolator field. Normally, documents are indexed and searches are run against them. When you use a percolator field, you store a search, and later the percolate query matches documents to that search.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/range.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ redirect_from:
---

# Range field types
**Introduced 1.0**
{: .label .label-purple }

The following table lists all range field types that OpenSearch supports.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/rank.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ redirect_from:
---

# Rank field types
**Introduced 1.0**
{: .label .label-purple }

The following table lists all rank field types that OpenSearch supports.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/search-as-you-type.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ redirect_from:
---

# Search-as-you-type field type
**Introduced 1.0**
{: .label .label-purple }

A search-as-you-type field type provides search-as-you-type functionality using both prefix and infix completion.

Expand Down
2 changes: 2 additions & 0 deletions _field-types/supported-field-types/text.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ redirect_from:
---

# Text field type
**Introduced 1.0**
{: .label .label-purple }

A `text` field type contains a string that is analyzed. It is used for full-text search because it allows partial matches. Searches for multiple terms can match some but not all of them. Depending on the analyzer, results can be case insensitive, stemmed, have stopwords removed, have synonyms applied, and so on.

Expand Down
Loading

0 comments on commit 148f3f1

Please sign in to comment.