Skip to content

Commit

Permalink
Add documentation for ingest-attachment plugin (#7891)
Browse files Browse the repository at this point in the history
* add ingest-attachment plugin doc

Signed-off-by: Ricky Lippmann <[email protected]>

* extend ingest-attachment with information how to limit content

Signed-off-by: Ricky Lippmann <[email protected]>

* Added target_bulk_bytes to the docs for logstash-output plugin (#7869)

* Added target_bulk_bytes

Signed-off-by: Sander van de Geijn <[email protected]>

* Update _tools/logstash/ship-to-opensearch.md

Nice

Co-authored-by: Naarcha-AWS <[email protected]>
Signed-off-by: Sander van de Geijn <[email protected]>

* Update _tools/logstash/ship-to-opensearch.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update ship-to-opensearch.md

* Remove "we"

* Update ship-to-opensearch.md

* Update ship-to-opensearch.md

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <[email protected]>

---------

Signed-off-by: Sander van de Geijn <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>
Co-authored-by: Naarcha-AWS <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Ricky Lippmann <[email protected]>

* Add doc for binary format support in k-NN (#7840)

* Add doc for binary format support in k-NN

Signed-off-by: Junqiu Lei <[email protected]>

* Resolve tech feedback

Signed-off-by: Junqiu Lei <[email protected]>

* Doc review

Signed-off-by: Fanit Kolchina <[email protected]>

* Add newline

Signed-off-by: Fanit Kolchina <[email protected]>

* Formatting

Signed-off-by: Fanit Kolchina <[email protected]>

* Link fix

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Add query results to examples

Signed-off-by: Junqiu Lei <[email protected]>

* Rephrased sentences and changed vector field name

Signed-off-by: Fanit Kolchina <[email protected]>

* Editorial review

Signed-off-by: Fanit Kolchina <[email protected]>

* Remove details from one of the requests

Signed-off-by: Fanit Kolchina <[email protected]>

---------

Signed-off-by: Junqiu Lei <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Fanit Kolchina <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Ricky Lippmann <[email protected]>

* Edit for redundant information and sections across Data Prepper (#7127)

* Edit for redundant information and sections across Data Prepper

Signed-off-by: Melissa Vagi <[email protected]>

* Edit for redundant information and sections across Data Prepper

Signed-off-by: Melissa Vagi <[email protected]>

* Rewrite expression syntax and reorganize doc structure for readability

Signed-off-by: Melissa Vagi <[email protected]>

* Rewrite expression syntax and reorganize doc structure for readability

Signed-off-by: Melissa Vagi <[email protected]>

* Rewrite expression syntax and reorganize doc structure for readability

Signed-off-by: Melissa Vagi <[email protected]>

* Rewrite expression syntax and reorganize doc structure for readability

Signed-off-by: Melissa Vagi <[email protected]>

* Rewrite expression syntax and reorganize doc structure for readability

Signed-off-by: Melissa Vagi <[email protected]>

* Update _data-prepper/index.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update configuring-data-prepper.md

Signed-off-by: Melissa Vagi <[email protected]>

Signed-off-by: Melissa Vagi <[email protected]>

* Update _data-prepper/pipelines/expression-syntax.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _data-prepper/pipelines/expression-syntax.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _data-prepper/pipelines/pipelines.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update expression-syntax.md

Signed-off-by: Melissa Vagi <[email protected]>

* Create Functions subpages

Signed-off-by: Melissa Vagi <[email protected]>

* Create functions subpages

Signed-off-by: Melissa Vagi <[email protected]>

* Copy edit

Signed-off-by: Melissa Vagi <[email protected]>

* add remaining subpages

Signed-off-by: Melissa Vagi <[email protected]>

* Update _data-prepper/index.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Heather Halter <[email protected]>

* Apply suggestions from code review

Accepted editorial suggestions.

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Heather Halter <[email protected]>

* Apply suggestions from code review

Accepted more editorial suggestions that were hidden.

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Heather Halter <[email protected]>

* Apply suggestions from code review

Co-authored-by: Heather Halter <[email protected]>
Signed-off-by: David Venable <[email protected]>

* removed-line

Signed-off-by: Heather Halter <[email protected]>

* Fixed broken link to pipelines

Signed-off-by: Heather Halter <[email protected]>

* Fixed broken links on Update add-entries.md

Signed-off-by: Heather Halter <[email protected]>

* Fixed broken link in Update dynamo-db.md

Signed-off-by: Heather Halter <[email protected]>

* Fixed link syntax in Update index.md

Signed-off-by: Heather Halter <[email protected]>

---------

Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Heather Halter <[email protected]>
Signed-off-by: David Venable <[email protected]>
Signed-off-by: Heather Halter <[email protected]>
Co-authored-by: Heather Halter <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Co-authored-by: David Venable <[email protected]>
Signed-off-by: Ricky Lippmann <[email protected]>

* Update index.md (#7893)

fixed typo

Signed-off-by: Philipp Dünnebeil <[email protected]>
Signed-off-by: Ricky Lippmann <[email protected]>

* Fix typo and make left nav heading uniform for neural sparse processor (#7895)

Signed-off-by: kolchfa-aws <[email protected]>
Signed-off-by: Ricky Lippmann <[email protected]>

* Add custom JSON lexer and highlighting color scheme (#7892)

* Add custom JSON lexer and highlighting color scheme

Signed-off-by: Fanit Kolchina <[email protected]>

* Update _getting-started/quickstart.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Ricky Lippmann <[email protected]>

* Add model names to Vale (#7901)

Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Ricky Lippmann <[email protected]>

* Renamed data prepper files to have dashes for consistency (#7790)

* Renamed data prepper files to have dashes for consistency

Signed-off-by: Fanit Kolchina <[email protected]>

* More files

Signed-off-by: Fanit Kolchina <[email protected]>

---------

Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Ricky Lippmann <[email protected]>

* Add documentation for ml inference search request processor/ search response processor (#7852)

* draft ml inference search request processor

Signed-off-by: Mingshi Liu <[email protected]>

* add doc

Signed-off-by: Mingshi Liu <[email protected]>

* add doc

Signed-off-by: Mingshi Liu <[email protected]>

* Doc review

Signed-off-by: Fanit Kolchina <[email protected]>

* Fixed links

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Unify processor docs

Signed-off-by: Fanit Kolchina <[email protected]>

* Update _query-dsl/geo-and-xy/xy.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Remove note

Signed-off-by: Fanit Kolchina <[email protected]>

* Fix link

Signed-off-by: Fanit Kolchina <[email protected]>

---------

Signed-off-by: Mingshi Liu <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Fanit Kolchina <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Ricky Lippmann <[email protected]>

* Refactor k-NN documentation (#7890)

* Refactor k-NN documentation

Signed-off-by: Fanit Kolchina <[email protected]>

* Change field name for cohesiveness

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Heather Halter <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Heather Halter <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Ricky Lippmann <[email protected]>

* Ml commons batch inference (#7899)

* add batch inference API

Signed-off-by: Xun Zhang <[email protected]>

* add more links and mark the api as experimental

Signed-off-by: Xun Zhang <[email protected]>

* use openAI as the blueprint example details

Signed-off-by: Xun Zhang <[email protected]>

* address comments

Signed-off-by: Xun Zhang <[email protected]>

* Doc review

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: Xun Zhang <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Fanit Kolchina <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Ricky Lippmann <[email protected]>

* Remove repeated sentence in distributed tracing doc (#7906)

Signed-off-by: Peter Alfonsi <[email protected]>
Co-authored-by: Peter Alfonsi <[email protected]>
Signed-off-by: Ricky Lippmann <[email protected]>

* Add apostrophe token filter page #7871 (#7884)

* adding apostrophe token filter page #7871

Signed-off-by: AntonEliatra <[email protected]>

* fixing vale error

Signed-off-by: AntonEliatra <[email protected]>

* Update apostrophe-token-filter.md

Signed-off-by: AntonEliatra <[email protected]>

* updating the naming

Signed-off-by: AntonEliatra <[email protected]>

* updating as per the review comments

Signed-off-by: AntonEliatra <[email protected]>

* updating the heading to Apostrophe token filter

Signed-off-by: AntonEliatra <[email protected]>

* updating as per PR comments

Signed-off-by: AntonEliatra <[email protected]>

* Apply suggestions from code review

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: AntonEliatra <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: AntonEliatra <[email protected]>

---------

Signed-off-by: AntonEliatra <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Ricky Lippmann <[email protected]>

* removed unnecessary backslash

Signed-off-by: Ricky Lippmann <[email protected]>

* fix:add missing whitespace in table

Signed-off-by: Ricky Lippmann <[email protected]>

* docs: add link to tika supported file formats

Signed-off-by: Ricky Lippmann <[email protected]>

* Update ingest-attachment-plugin.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <[email protected]>

* adjust to keep technical specific information with improved wording

Signed-off-by: Ricky Lippmann <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <[email protected]>

---------

Signed-off-by: Ricky Lippmann <[email protected]>
Signed-off-by: Sander van de Geijn <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>
Signed-off-by: Junqiu Lei <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: Heather Halter <[email protected]>
Signed-off-by: David Venable <[email protected]>
Signed-off-by: Heather Halter <[email protected]>
Signed-off-by: Philipp Dünnebeil <[email protected]>
Signed-off-by: Mingshi Liu <[email protected]>
Signed-off-by: Xun Zhang <[email protected]>
Signed-off-by: Peter Alfonsi <[email protected]>
Signed-off-by: AntonEliatra <[email protected]>
Co-authored-by: Sander van de Geijn <[email protected]>
Co-authored-by: Naarcha-AWS <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Co-authored-by: Junqiu Lei <[email protected]>
Co-authored-by: Fanit Kolchina <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Melissa Vagi <[email protected]>
Co-authored-by: Heather Halter <[email protected]>
Co-authored-by: David Venable <[email protected]>
Co-authored-by: Philipp Dünnebeil <[email protected]>
Co-authored-by: Mingshi Liu <[email protected]>
Co-authored-by: Xun Zhang <[email protected]>
Co-authored-by: Peter Alfonsi <[email protected]>
Co-authored-by: Peter Alfonsi <[email protected]>
Co-authored-by: AntonEliatra <[email protected]>
  • Loading branch information
16 people authored Aug 7, 2024
1 parent 106009c commit 8b731c5
Show file tree
Hide file tree
Showing 2 changed files with 231 additions and 3 deletions.
6 changes: 3 additions & 3 deletions _install-and-configure/additional-plugins/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ nav_order: 10

There are many more plugins available in addition to those provided by the standard distribution of OpenSearch. These additional plugins have been built by OpenSearch developers or members of the OpenSearch community. While it isn't possible to provide an exhaustive list (because many plugins are not maintained in an OpenSearch GitHub repository), the following plugins, available in the [OpenSearch/plugins](https://github.com/opensearch-project/OpenSearch/tree/main/plugins) directory on GitHub, are some of the plugins that can be installed using one of the installation options, for example, using the command `bin/opensearch-plugin install <plugin-name>`.


| Plugin name | Earliest available version |
| :--- | :--- |
| analysis-icu | 1.0.0 |
Expand All @@ -22,7 +21,7 @@ There are many more plugins available in addition to those provided by the stand
| discovery-azure-classic | 1.0.0 |
| discovery-ec2 | 1.0.0 |
| discovery-gce | 1.0.0 |
| ingest-attachment | 1.0.0 |
| [`ingest-attachment`]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/ingest-attachment-plugin/) | 1.0.0 |
| mapper-annotated-text | 1.0.0 |
| mapper-murmur3 | 1.0.0 |
| [`mapper-size`]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/mapper-size-plugin/) | 1.0.0 |
Expand All @@ -34,7 +33,8 @@ There are many more plugins available in addition to those provided by the stand
| store-smb | 1.0.0 |
| transport-nio | 1.0.0 |


## Related articles

[Installing plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/)
[`ingest-attachment` plugin]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/ingest-attachment-plugin/)
[`mapper-size` plugin]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/mapper-size-plugin/)
228 changes: 228 additions & 0 deletions _install-and-configure/additional-plugins/ingest-attachment-plugin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
---
layout: default
title: Ingest-attachment plugin
parent: Installing plugins
nav_order: 20

---

# Ingest-attachment plugin

The `ingest-attachment` plugin enables OpenSearch to extract content and other information from files using the Apache text extraction library [Tika](https://tika.apache.org/).
Supported document formats include PPT, PDF, RTF, ODF, and many more Tika ([Supported Document Formats](https://tika.apache.org/2.9.2/formats.html)).

The input field must be a base64-encoded binary.

## Installing the plugin

Install the `ingest-attachment` plugin using the following command:

```sh
./bin/opensearch-plugin install ingest-attachment
```

## Attachment processor options

| Name | Required | Default | Description |
| :--- | :--- | :--- | :--- |
| `field` | Yes | N/A | The field from which to get the base64-encoded binary. |
| `target_field` | No | Attachment | The field that stores the attachment information. |
| `properties` | No | All properties | An array of properties that should be stored. Can be `content`, `language`, `date`, `title`, `author`, `keywords`, `content_type`, or `content_length`. |
| `indexed_chars` | No | `100_000` | The number of characters used for extraction to prevent fields from becoming too large. Use `-1` for no limit. |
| `indexed_chars_field` | No | `null` | The field name used to overwrite the number of chars being used for extraction, for example, `indexed_chars`. |
| `ignore_missing` | No | `false` | When `true`, the processor exits without modifying the document when the specified field doesn't exist. |

## Example

The following steps show you how to get started with the `ingest-attachment` plugin.

### Step 1: Create an index for storing your attachments

The following command creates an index for storing your attachments:

```json
PUT /example-attachment-index
{
"mappings": {
"properties": {}
}
}
```

### Step 2: Create a pipeline

The following command creates a pipeline containing the attachment processor:

```json
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}
```

### Step 3: Store an attachment

Convert the attachment to a base64 string to pass it as `data`.
In this example the `base64` command converts the file `lorem.rtf`:

```sh
base64 lorem.rtf
```

Alternatively, you can use Node.js to read the file to `base64`, as shown in the following commands:

```typescript
import * as fs from "node:fs/promises";
import path from "node:path";

const filePath = path.join(import.meta.dirname, "lorem.rtf");
const base64File = await fs.readFile(filePath, { encoding: "base64" });

console.log(base64File);
```

The`.rtf` file contains the following base64 text:

`Lorem ipsum dolor sit amet`:
`e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=`.

```json
PUT example-attachment-index/_doc/lorem_rtf?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
```

### Query results

With the attachment processed, you can now search through the data using search queries, as shown in the following example:

```json
POST example-attachment-index/_search
{
"query": {
"match": {
"attachment.content": "ipsum"
}
}
}
```

OpenSearch responds with the following:

```json
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.1724279,
"hits": [
{
"_index": "example-attachment-index",
"_id": "lorem_rtf",
"_score": 1.1724279,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "pt",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
}
}
}
]
}
}
```

## Extracted information

The following fields can be extracted using the plugin:

- `content`
- `language`
- `date`
- `title`
- `author`
- `keywords`
- `content_type`
- `content_length`

To extract only a subset of these fields, define them in the `properties` of the
pipeline processor, as shown in the following example:

```json
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"properties": ["content", "title", "author"]
}
}
]
}
```

## Limit the extracted content

To prevent extracting too many characters and overloading the node memory, the default limit is `100_000`.
You can change this value using the setting `indexed_chars`. For example, you can use `-1` for unlimited characters, but you need to make sure you have enough HEAP space on your OpenSearch node to extract the content of large documents.

You can also define this limit per document using the `indexed_chars_field` request field.
If a document contains `indexed_chars_field`, it will overwrite the `indexed_chars` setting, as shown in the following example:

```json
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : 10,
"indexed_chars_field" : "max_chars",
}
}
]
}
```

With the attachment pipeline configured, you can extract the default `10` characters without specifying `max_chars` in the request, as shown in the following example:

```json
PUT example-attachment-index/_doc/lorem_rtf?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
```

Alternatively, you can change the `max_char` per document in order to extract up to `15` characters, as shown in the following example:

```json
PUT example-attachment-index/_doc/lorem_rtf?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"max_chars": 15
}
```

0 comments on commit 8b731c5

Please sign in to comment.