Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text embedding processor #304

Merged

Conversation

miguel-vila
Copy link
Contributor

Signed-off-by: miguel-vila [email protected]

Description

Adds a definition for the text embedding processor: https://opensearch.org/docs/latest/ingest-pipelines/processors/text-embedding/

Please confirm whether the field_map definition makes sense

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: miguel-vila <[email protected]>
@dblock
Copy link
Member

dblock commented May 31, 2024

Thank you!

We just merged a test framework in #299. Want to try to add a test for this API? Check out https://github.com/opensearch-project/opensearch-api-specification/blob/main/tests/index_lifecycle.yaml for an example. You can run it with npm run test:spec against a live local instance of OpenSearch (documentation incoming).

Copy link

github-actions bot commented May 31, 2024

Changes Analysis

Commit SHA: 32f992a
Comparing To SHA: 6da4a7c

API Changes

Summary

└─┬Components
  ├──[➕] schemas (43686:7)
  ├─┬ingest._common:ProcessorContainer
  │ └──[➕] properties (43570:9)
  └─┬ingest._common:Pipeline
    └──[➖] required (43464:11)❌ 

Document Element Total Changes Breaking Changes
components 3 1
  • BREAKING Changes: 1 out of 3
  • Removals: 1
  • Additions: 2
  • Breaking Removals: 1

Report

The full API changes report is available at: https://github.com/opensearch-project/opensearch-api-specification/actions/runs/9356073256/artifacts/1564049761

API Coverage

Before After Δ
Covered (%) 476 (46.62 %) 476 (46.62 %) 0 (0 %)
Uncovered (%) 545 (53.38 %) 545 (53.38 %) 0 (0 %)
Unknown 24 24 0

dblock
dblock previously approved these changes May 31, 2024
@dblock
Copy link
Member

dblock commented May 31, 2024

@nhtruong @Xtansia spec looks good to you?

@miguel-vila
Copy link
Contributor Author

@dblock I have added a test at ee776b4 but it fails locally because the response doesn't seem to include the _meta field. Might look into it later but not sure if it's because there's some real misalignment between the spec and the implementation.

@dblock
Copy link
Member

dblock commented May 31, 2024

_meta

So likely a bug! That's why we need tests ;)

@nhtruong
Copy link
Collaborator

nhtruong commented Jun 3, 2024

@dblock @miguel-vila Small discrepancies between the spec and the actual implementation of OS are to be expected right now. Much of spec of the core features that we have is inherited from ElasticSearch. Changes to OS since have not been reflected in the spec. We will need help from OS core team to review the spec of the core features. For now, @miguel-vila, you can remove the _meta as a required property.

Signed-off-by: miguel-vila <[email protected]>
@miguel-vila miguel-vila force-pushed the add-text-embedding-processor branch from 3f3c584 to 32f992a Compare June 3, 2024 19:24
Copy link
Member

@dblock dblock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're going to have many ingest pipeline tests. How about we organize things in folders that match the schema?

So this test should probably go into something like tests/ingest/ingest_with_text_embedding_processor.yaml?

It would be great if the story included a GET of the pipeline, maybe even used the processor.

skip: false
description: |
This test story checks that we can create an ingest pipeline with a text
embedding processor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's sweat some small stuff since this is new. Add a period. Maybe shorten, "Create and use an ingest pipeline with a text embedding processor."

method: DELETE
status: [200, 404]
chapters:
- synopsis: Create ingest pipeline for text embedding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create an ingest pipeline ... + add a period.

@nhtruong
Copy link
Collaborator

nhtruong commented Jun 5, 2024

I think this can be merge now. We address those minor wording issues later in the test.

@nhtruong nhtruong merged commit 7acd0fc into opensearch-project:main Jun 5, 2024
6 checks passed
@miguel-vila
Copy link
Contributor Author

I could address some of the bigger changes in a separate PR.

In particular, I wanted to test a whole flow of creating a model and using search against it (I think the search was failing due to the model not existing, which makes sense because I set some random "text-embedding-model" as model id) but then I noticed your test framework would need some extensions:

  • allow remembering some ids (e.g. the model id) and using them in later requests
  • allow retrying some requests (we need to wait for the model to be deployed)

I have some work in progress in that direction but these might be big/controversial changes so would be good to know your thoughts, might push my WIP PR later.

@miguel-vila
Copy link
Contributor Author

Created #315 to show you the possible changes, it's still in a very rough shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants