Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions comps/dataprep/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,7 @@ For details, please refer to this [readme](src/README_neo4j_llamaindex.md)
## Dataprep Microservice for financial domain data

For details, please refer to this [readme](src/README_finance.md)

## Dataprep Microservice with MariaDB Vector

For details, please refer to this [readme](src/README_mariadb.md)
24 changes: 24 additions & 0 deletions comps/dataprep/deployment/docker_compose/compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ include:
- ../../../third_parties/tei/deployment/docker_compose/compose.yaml
- ../../../third_parties/vllm/deployment/docker_compose/compose.yaml
- ../../../third_parties/arangodb/deployment/docker_compose/compose.yaml
- ../../../third_parties/mariadb/deployment/docker_compose/compose.yaml

services:

Expand Down Expand Up @@ -414,6 +415,29 @@ services:
retries: 10
restart: unless-stopped

dataprep-mariadb-vector:
image: ${REGISTRY:-opea}/dataprep:${TAG:-latest}
container_name: dataprep-mariadb-vector
ports:
- "${DATAPREP_PORT:-5000}:5000"
depends_on:
mariadb-server:
condition: service_healthy
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
DATAPREP_COMPONENT_NAME: "OPEA_DATAPREP_MARIADBVECTOR"
MARIADB_CONNECTION_URL: ${MARIADB_CONNECTION_URL:-mariadb+mariadbconnector://dbuser:password@mariadb-server:3306/vectordb}
LOGFLAG: ${LOGFLAG}
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:5000/v1/health_check || exit 1"]
interval: 10s
timeout: 5s
retries: 10
restart: unless-stopped

networks:
default:
driver: bridge
Expand Down
1 change: 1 addition & 0 deletions comps/dataprep/src/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missin
libcairo2 \
libgl1-mesa-glx \
libjemalloc-dev \
libmariadb-dev \
libpq-dev \
libreoffice \
poppler-utils \
Expand Down
100 changes: 100 additions & 0 deletions comps/dataprep/src/README_mariadb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Dataprep Microservice with MariaDB Vector

## 🚀1. Start Microservice with Docker

### 1.1 Build Docker Image

```bash
cd GenAIComps
docker build -t opea/dataprep:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/src/Dockerfile .
```

### 1.2 Run Docker with CLI (Option A)

#### 1.2.1 Start MariaDB Server

Please refer to this [readme](../../third_parties/mariadb/src/README.md).

#### 1.2.2 Start the data preparation service

```bash

export HOST_IP=$(hostname -I | awk '{print $1}')
# If you've configured the server with the default env values then:
export MARIADB_CONNECTION_URL: mariadb+mariadbconnector://dbuser:password@${HOST_IP}$:3306/vectordb

docker run -d --rm --name="dataprep-mariadb-vector" -p 5000:5000 --ipc=host -e MARIADB_CONNECTION_URL=$MARIADB_CONNECTION_URL -e DATAPREP_COMPONENT_NAME="OPEA_DATAPREP_MARIADBVECTOR" opea/dataprep:latest
```

### 1.3 Run with Docker Compose (Option B)

```bash
cd comps/dataprep/deployment/docker_compose
docker compose -f compose.yaml up dataprep-mariadb-vector -d
```

## 🚀2. Consume Microservice

### 2.1 Consume Upload API

Once the data preparation microservice for MariaDB Vector is started, one can use the below command to invoke the microservice to convert documents/links to embeddings and save them to the vector store.

```bash
export document="/path/to/document"
curl -X POST \
-H "Content-Type: application/json" \
-d '{"path":"${document}"}' \
http://localhost:6007/v1/dataprep/ingest
```

### 2.2 Consume get API

To get the structure of the uploaded files, use the `get` API endpoint:

```bash
curl -X POST \
-H "Content-Type: application/json" \
http://localhost:6007/v1/dataprep/get
```

A JSON formatted response similar to the one below will follow:

```json
[
{
"name": "uploaded_file_1.txt",
"id": "uploaded_file_1.txt",
"type": "File",
"parent": ""
},
{
"name": "uploaded_file_2.txt",
"id": "uploaded_file_2.txt",
"type": "File",
"parent": ""
}
]
```

### 2.3 Consume delete API

To delete uploaded files/links, use the `delete` API endpoint.

The `file_path` is the `id` returned by the `/v1/dataprep/get` API.

```bash
# delete link
curl -X POST "http://${HOST_IP}:5000/v1/dataprep/delete"
-H "Content-Type: application/json" \
-d '{"file_path": "https://www.ces.tech/.txt"}'

# delete file
curl -X POST "http://${HOST_IP}:5000/v1/dataprep/delete"
-H "Content-Type: application/json" \
-d '{"file_path": "uploaded_file_1.txt"}'

# delete all files and links
curl -X POST "http://${HOST_IP}:5000/v1/dataprep/delete"
-H "Content-Type: application/json" \
-d '{"file_path": "all"}'
```
Loading