Skip to content
/ server Public

Real-Time Multimodal ETL Pipelines for GenAI

License

Notifications You must be signed in to change notification settings

EmmS21/server

Repository files navigation

Mixpeek Logo

Sign Up | Documentation | Email List

Github stars GitHub issues Join Slack

Real-Time Multimodal ETL Pipelines for GenAI

Overview

Mixpeek listens in on changes to your database then processes each change (file_url or inline) through an inference pipeline of: extraction, generation and embedding leaving your database with fresh multimodal data always.

It removes the need of setting up architecture to track database changes, extracting content, processing and embedding it then treating each change as its' own atomic unit

We support every modality: documents, images, video, audio and of course text.

Integrations

Architecture

Mixpeek is structured into two main services, each designed to handle a specific part of the process:

  • API Orchestrator: Coordinates the flow between services, ensuring smooth operation and handling failures gracefully.
  • Distributed Queue:
  • Inference Service: Extraction, embedding, and generation of payloads

These services are containerized and can be deployed on separate servers for optimal performance and scalability.

Getting Started

Clone the Mixpeek repository and navigate to the SDK directory:

git clone [email protected]:mixpeek/server.git
cd server

We use poetry for all services, but there is an optional Dockerfile in each. We'll use poetry in the setup for quick deployment.

Setup

For each service you'll do the following:

  1. Create a virtual environment
poetry env use python3.10
  1. Activate the virtual environment
poetry shell
  1. Install the requirements
poetry install

API

.env file:

SERVICES_CONTAINER_URL=http://localhost:8001
PYTHON_VERSION=3.11.6
OPENAI_KEY=
ENCRYPTION_KEY=

REDIS_URL=

MONGO_URL=
MONGODB_ATLAS_PUBLIC_KEY=
MONGODB_ATLAS_PRIVATE_KEY=
MONGODB_ATLAS_GROUP_ID=

AWS_ACCESS_KEY=
AWS_SECRET_KEY=
AWS_REGION=
AWS_ARN_LAMBDA=

MIXPEEK_ADMIN_TOKEN=

Then run it:

poetry run python3 -m uvicorn main:app --reload

Inference Service

.env file:

S3_BUCKET=
AWS_ACCESS_KEY=
AWS_SECRET_KEY=
AWS_REGION=
PYTHON_VERSION=
poetry run python3 -m uvicorn main:app --reload --host 0.0.0.0 --port 8001

Distributed Queue

Also runs inside the api folder and uses the same .env file as api

celery -A db.service.celery_app worker --loglevel=info

You now have 3 services running !

API Interface

All methods are exposed as HTTP endpoints.

You'll first need to generate an api key via POST /user Use the MIXPEEK_ADMIN_TOKEN you defined in the api env file.

curl --location 'http://localhost:8000/users/private' \
--header 'Authorization: MIXPEEK_ADMIN_TOKEN' \
--header 'Content-Type: application/json' \
--data-raw '{"email":"[email protected]"}'

You can use any email, doesn't matter

Cloud Service

If you want a completely managed version of Mixpeek: https://mixpeek.com/start

We also have a transparent and predictible billing model: https://mixpeek.com/pricing

Are we missing anything?

About

Real-Time Multimodal ETL Pipelines for GenAI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published