Quick Start Guide: Weaviate Vector Database with OpenAI Integration

This guide demonstrates how to use our read-only vector database powered by Weaviate with OpenAI integration for AI-enabled knowledge management. You'll need to provide your own OpenAI API key for generating embeddings and using generative AI features.

For complete documentation, visit the Weaviate Documentation.

Connection Details

REST Endpoint: https://larnd9rle1zwapsdcqw.c0.us-east1.gcp.weaviate.cloud
gRPC Endpoint: https://grpc-larnd9rle1zwapsdcqw.c0.us-east1.gcp.weaviate.cloud
ReadOnly API Key: Z0lQNXFDVnFnRmV6OTZRdF9KRVUwVzFEdjgxaHIvVFB3dkFyV0JqSnFxWFJYMWhqM0FTQ0NncU14S2hrPV92MjAw

1. Database Connection

See How to Connect for more details.

import weaviate
from weaviate.classes.init import Auth
import os

# Connect with read-only API key
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.environ["WEAVIATE_URL"],
    auth_credentials=Auth.api_key(os.environ["WEAVIATE_API_KEY"]),
    headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"]
    }
)

2. Search Operations

See Search Guide for complete search documentation.

Vector Search

content = client.collections.get("Content")
response = content.query.near_text(
    query="factory farming investigation techniques",
    limit=3
)

for result in response.objects:
    print(result.properties)

Hybrid Search (Vector + Keyword)

response = content.query.hybrid(
    query="effective animal advocacy strategies",
    alpha=0.5,  # Balance between vector and keyword search
    limit=3
)

Filtered Search

from weaviate.classes.query import Filter

response = content.query.near_text(
    query="sanctuary best practices",
    filters=Filter.by_property("content_type").equal("report"),
    limit=3
)

3. Advanced Features

RAG (Retrieval Augmented Generation)

See RAG Documentation for more details.

Single Prompt (Per-Result Generation)

response = content.generate.near_text(
    query="factory farming investigation techniques",
    single_prompt="Summarize this content: {main_text}",
    limit=3
)

print(response.generated)

Grouped Task (Single Generation for All Results)

response = content.generate.near_text(
    query="successful campaign strategies",
    grouped_task="What patterns emerge from these successful campaigns?",
    limit=5
)

print(response.generated)

Aggregate Data

See Aggregations Documentation for more options.

response = content.aggregate.over_all(
    group_by=weaviate.classes.aggregate.GroupByAggregate(
        prop="content_type"
    )
)

for group in response.groups:
    print(f"Type: {group.grouped_by.value} Count: {group.total_count}")

4. Database Schema

This schema defines the structure of our unified, open-access database designed to support animal advocacy. By integrating data from diverse sources, it supports AI applications such as retrieval-augmented generation (RAG) and model training, as well as general strategic planning and referencing.

The Content class functions as the primary centralized repository for materials related to veganism and animal advocacy, such as articles, reports, blogs, and transcripts. It supports AI applications like semantic search, knowledge graphs, and retrieval-augmented generation (RAG).

Structure:

summary: text
main_text: text
content_type: text
source_url: array of texts
date: date
scores: object
- good_for_animals: number
- cultural_sensitivity: number
- relevance: number
- insight: number
- trustworthiness: number
- emotional_impact: number
- rationality: number
- influence: number
- alignment: number
- predicted_performance: number
- average_score: number
tags: array of text

5. Best Practices

General Recommendations

Use hybrid search for optimal retrieval
Leverage filters to narrow results
Consider RAG for dynamic content generation
Use aggregations for movement insights
Start with small result limits and increase as needed
Test queries with different alpha values in hybrid search to find the optimal balance

Optimized Retrieval Strategy

Retrieve and Re-rank Pattern: When quality matters more than speed, retrieve a larger set of results (e.g., 20-30 items) and then apply post-processing to select the best ones:

# Retrieve more results initially
response = content.query.near_text(
    query="effective outreach strategies",
    limit=20
)

# Sort by relevant scores and select top performers
sorted_results = sorted(
    response.objects,
    key=lambda x: x.properties['scores']['predicted_performance'],
    reverse=True
)

# Pass only top 5 to your application
top_results = sorted_results[:5]

Multi-Score Ranking: Combine multiple scores for more nuanced ranking:

def calculate_composite_score(item):
    scores = item.properties['scores']
    # Weight different scores based on your use case
    return (
        scores['relevance'] * 0.3 +
        scores['trustworthiness'] * 0.3 +
        scores['insight'] * 0.2 +
        scores['predicted_performance'] * 0.2
    )

sorted_results = sorted(
    response.objects,
    key=calculate_composite_score,
    reverse=True
)

Advanced Filtering for Better RAG Results

Score-Based Filtering: Use score thresholds to ensure quality before RAG generation:

from weaviate.classes.query import Filter

# Only retrieve highly trustworthy and relevant content
response = content.generate.near_text(
    query="campaign evaluation methods",
    filters=(
        Filter.by_property("scores.trustworthiness").greater_than(0.7) &
        Filter.by_property("scores.relevance").greater_than(0.6)
    ),
    grouped_task="Synthesize the most reliable evaluation methods",
    limit=10
)

Time-Aware Retrieval: Combine date filters with scores for recency-weighted results:

from datetime import datetime, timedelta

# Get high-quality content from the last 6 months
recent_date = datetime.now() - timedelta(days=180)

response = content.generate.near_text(
    query="emerging advocacy trends",
    filters=(
        Filter.by_property("date").greater_than(recent_date) &
        Filter.by_property("scores.insight").greater_than(0.5)
    ),
    grouped_task="What are the most promising recent trends?",
    limit=15
)

Progressive Filtering: Start broad, then narrow based on initial results:

# First pass: broad retrieval
initial_response = content.query.near_text(
    query="corporate campaigns",
    limit=30
)

# Analyze score distribution
avg_trustworthiness = sum(
    obj.properties['scores']['trustworthiness'] 
    for obj in initial_response.objects
) / len(initial_response.objects)

# Second pass: refined search with dynamic threshold
refined_response = content.generate.near_text(
    query="corporate campaigns",
    filters=Filter.by_property("scores.trustworthiness").greater_than(avg_trustworthiness),
    grouped_task="What makes these campaigns particularly effective?",
    limit=10
)

Note: Always refer to OpenAI's documentation and Weaviate's documentation for the most up-to-date information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick Start Guide: Weaviate Vector Database with OpenAI Integration

Connection Details

1. Database Connection

2. Search Operations

Vector Search

Hybrid Search (Vector + Keyword)

Filtered Search

3. Advanced Features

RAG (Retrieval Augmented Generation)

Single Prompt (Per-Result Generation)

Grouped Task (Single Generation for All Results)

Aggregate Data

4. Database Schema

5. Best Practices

General Recommendations

Optimized Retrieval Strategy

Advanced Filtering for Better RAG Results

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Quick Start Guide: Weaviate Vector Database with OpenAI Integration

Connection Details

1. Database Connection

2. Search Operations

Vector Search

Hybrid Search (Vector + Keyword)

Filtered Search

3. Advanced Features

RAG (Retrieval Augmented Generation)

Single Prompt (Per-Result Generation)

Grouped Task (Single Generation for All Results)

Aggregate Data

4. Database Schema

5. Best Practices

General Recommendations

Optimized Retrieval Strategy

Advanced Filtering for Better RAG Results