Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions community/rfcs/24-10-20-OPEA-001-Haystack-Integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# 24-10-20-OPEA-001-Haystack-Integration

## Author

[gadmarkovits](https://github.com/gadmarkovits)

## Status

Under Review

## Objective

Create a Haystack integration for OPEA that will enable the use of OPEA components within a Haystack pipeline.

## Motivation

Haystack is a production-ready open source AI framework that is used by many AI practitioners. It has over 70 integrations with various GenAI components such as document stores, model providers and evaluation frameworks from companies such as Amazon, Microsoft, Nvidia and more. Creating an integration for OPEA will allow Haystack customers to use OPEA components in their pipelines. This RFC is used to present a high-level overview of the Haystack integration.

## Design Proposal

The idea is to create thin wrappers for OPEA components that will enable communicating with them using the existing REST API. The wrappers will match Haystack's API so that they could be used within Haystack pipelines. This will allow developers to seamlessly use OPEA components alongside other Haystack components.

The integration will be implemented as a Python package (similar to other Haystack integrations). The source code will be hosted in OPEA's GenAIComps repo under a new directory called Integrations. The package itself will be uploaded to [PyPi](https://pypi.org/) to allow for easy installation.

Following a discussion with Haystack's technical team, it was agreed that a ChatQnA example, using this OPEA integration, would be a good way to showcase its capabilities. To support this, several component wrappers need to be implemented in the first version of the integration (other wrappers will be added gradually):

1. OPEA Document Embedder

This component will receive a Haystack Document and embed it using an OPEA embedding microservice.

2. OPEA Text Embedder

This component will receive text input and embed it using an OPEA embedding microservice.

3. OPEA Generator

This component will receive a text prompt and generate a reponse using an OPEA LLM microservice.

4. OPEA Retriever

This component will receive an embedding and retrieve documents with similar emebddings using an OPEA retrieval microservice.

## Alternatives Considered

n/a

## Compatibility

n/a

## Miscs

Once implemented, the Haystack team list the OPEA integration on their [integrations page](https://haystack.deepset.ai/integrations) which will allow for easier discovery. Haystack, in collaboration with Intel, will also publish a technical blog post showcasing a ChatQnA example using this integration (similar to this [NVidia NIM post](https://haystack.deepset.ai/blog/haystack-nvidia-nim-rag-guide)).


139 changes: 139 additions & 0 deletions community/rfcs/25-01-10-OPEA-Benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# Purpose

This RFC is used to describe the behavior of unified benchmark script for GenAIExamples user.

In v1.1, those bechmark scripts are per examples. It causes many duplicated codes and bad user experience.

That is why we have motivation to improve such tool to have an unified entry for perf benchmark.

## Original benchmark script layout

```
GenAIExamples/
├── ChatQnA/
│ ├── benchmark/
│ │ ├── benchmark.sh # each example has its own script
│ │ └── deploy.py
│ ├── kubernetes/
│ │ ├── charts.yaml
│ │ └── ...
│ ├── docker-compose/
│ │ └── compose.yaml
│ └── chatqna.py
└── ...
```

## Proposed benchmark script layout

```
GenAIExamples/
├── deploy_and_benchmark.py # main entry of GenAIExamples
├── ChatQnA/
│ ├── chatqna.yaml # default deploy and benchmark config for deploy_and_benchmark.py
│ ├── kubernetes/
│ │ ├── charts.yaml
│ │ └── ...
│ |── docker-compose/
│ | └── compose.yaml
| └── chatqna.py
└── ...
```


# Design

The pesudo code of deploy_and_benchmark.py is listed at below for your reference.

```
# deploy_and_benchmark.py
# below is the pesudo code to demostrate its behavior
#
# def main(yaml_file):
# # extract all deployment combinations from chatqna.yaml deploy section
# deploy_traverse_list = extract_deploy_cfg(yaml_file)
# # for example, deploy_traverse_list = [{'node': 2, 'device': gaudi, 'cards_per_node': 8, ...},
# {'node': 4, 'device': gaudi, 'cards_per_node': 8, ...},
# ...]
#
# benchmark_traverse_list = extract_benchmark_cfg(yaml_file)
# # for example, benchmark_traverse_list = [{'concurrency': 128, , 'totoal_query_num': 4096, ...},
# {'concurrency': 128, , 'totoal_query_num': 4096, ...},
# ...]
# for deploy_cfg in deploy_traverse_list:
# start_k8s_service(deploy_cfg)
# for benchmark_cfg in benchmark_traverse_list:
# if service_ready:
# ingest_dataset(benchmark_cfg.dataset)
# send_http_request(benchmark_cfg) # will call stresscli.py in GenAIEval
```

Taking chatqna as an example, the configurable fields are listed at below

```
# chatqna.yaml
#
# usage:
# 1) deploy_and_benchmark.py --workload chatqna [overrided parameters]
# 2) or deploy_and_benchmark.py ./chatqna/benchmark/chatqna.yaml [overrided parameters]
#
# for example, deploy_and_benchmark.sh ./chatqna/benchmark/chatqna.yaml --node=2
#
deploy:
# hardware related config
device: [xeon, gaudi, ...] # AMD and other h/ws could be extended into here
node: [1, 2, 4]
cards_per_node: [4, 8]

# components related config, by default is for OOB, if overrided, then it is for tuned version
embedding:
model_id: bge_large_v1.5
instance_num: [2, 4, 8]
cores_per_instance: 4
memory_capacity: 20 # unit: G
retrieval:
instance_num: [2, 4, 8]
cores_per_instance: 4
memory_capacity: 20 # unit: G
rerank:
enable: True
model_id: bge_rerank_v1.5
instance_num: 1
cards_per_instance: 1 # if cpu is specified, this field is ignored and will check cores_per_instance field
llm:
model_id: llama2-7b
instance_num: 7
cards_per_instance: 1 # if cpu is specified, this field is ignored and will check cores_per_instance field
# serving related config, dynamic batching
max_batch_size: [1, 2, 8, 16, 32] # the query number to construct a single batch in serving
max_latency: 20 # time to wait before combining incoming requests into a batch, unit milliseconds

benchmark:
# http request behavior related fields
concurrency: [1, 2, 4]
totoal_query_num: [2048, 4096]
duration: [5, 10] # unit minutes
query_num_per_concurrency: [4, 8, 16]
possion: True
possion_arrival_rate: 1.0
warmup_iterations: 10
seed: 1024

# dataset relted fields
dataset: [dummy_english, dummy_chinese, pub_med100, ...] # predefined keywords for supported dataset
user_query: [dummy_english_qlist, dummy_chinese_qlist, pub_med100_qlist, ...]
query_token_size: 128 # if specified, means fixed query token size will be sent out
data_ratio: [10%, 20%, ..., 100%] # optional, ratio from query dataset

#advance settings in each component which will impact perf.
data_prep: # not target this time
chunk_size: [1024]
chunk_overlap: [1000]
retriver: # not target this time
algo: IVF
fetch_k: 2
k: 1
rerank:
top_n: 2
llm:
max_token_size: 1024 # specify the output token size
```
Loading