Setup guide for the Wikipedia content pipeline on Fedora-based Strix Halo systems. This provides a local Wikipedia knowledge base with keyword search, semantic search, and an MCP server for VS Code Copilot integration.
The Wikipedia MCP Server pipeline:
- Extracts articles from a Wikipedia XML dump into clean JSON files
- Indexes articles into PostgreSQL (metadata + sections) and OpenSearch (full-text + vector search)
- Serves content via a FastAPI MCP server and React web GUI
| Feature | Description |
|---|---|
| ~7 million articles | English Wikipedia text content (no media) |
| Keyword search (BM25) | Full-text search via OpenSearch |
| Semantic search | Vector similarity search using nomic-embed-text-v1.5 embeddings |
| Hybrid search | Combined keyword + semantic with Reciprocal Rank Fusion |
| Section-level indexing | Precise retrieval at section granularity |
| MCP server | VS Code Copilot integration via SSE transport (port 7000) |
| Web GUI | Browser-based search and article browsing (port 8080) |
| REST API | Direct HTTP search and article retrieval |
┌─────────────────────┐
│ VS Code Copilot │
│ (MCP client) │
└────────┬────────────┘
│ SSE/JSON-RPC
┌────────▼────────────┐
│ MCP Server │
┌──────────┐ │ (FastAPI :7000) │ ┌──────────────────┐
│ Web GUI │──────►│ │──────►│ llama.cpp embed │
│ (:8080) │ │ mcp_server.py │ │ (:1235) │
└──────────┘ └──┬──────────────┬───┘ │ nomic-embed-text │
│ │ └──────────────────┘
┌───────▼──────┐ ┌─────▼──────┐
│ PostgreSQL │ │ OpenSearch │
│ (:5432) │ │ (:9200) │
│ articles │ │ wikipedia │
│ sections │ │ index │
│ redirects │ │ (k-NN) │
└──────────────┘ └─────────────┘
The following must already be set up via StrixHalo-Fedora-Setup.md and setup_strixhalo.py:
| Component | Setup Stage | Verified By |
|---|---|---|
| Fedora 43 installed | Manual (Phase 1) | uname -r → 6.18.4+ |
Data disk mounted at /mnt/data |
Manual (Phase 1) | df -h /mnt/data |
| Repo cloned | Manual (Phase 1) | ls $DEEPRED_REPO |
| Python venv with dependencies | python_venv |
source $DEEPRED_VENV/bin/activate |
| PostgreSQL running (user=wiki, db=wikidb) | postgresql |
pg_isready |
| OpenSearch running | opensearch |
curl localhost:9200 |
| Embedding server (nomic-embed-text-v1.5) | llama_server |
curl localhost:1235/v1/models |
| MCP server deployed | mcp_server |
curl localhost:7000/health |
| Firewall configured | firewall |
firewall-cmd --list-ports |
All commands in this guide assume you have sourced the environment:
source /mnt/data/DeepRedAI/deepred-env.shThis sets:
WIKI_DATA→/mnt/data/wikipediaDEEPRED_VENV→/mnt/data/venvDEEPRED_REPO→/mnt/data/DeepRedAI
| Component | Size |
|---|---|
| Wikipedia dump (compressed) | ~25 GB |
| Extracted JSON files | ~25 GB |
| PostgreSQL database | ~45 GB |
| OpenSearch index + embeddings | ~40 GB |
Total under $WIKI_DATA |
~135 GB |
The setup_strixhalo.py script creates the PostgreSQL database and user but does not apply the Wikipedia-specific schema. Apply it now.
Option A — Via setup script (recommended if running setup for the first time):
sudo -E python3 scripts/setup_strixhalo.py --stage wikipedia_schemaOption B — Via process_and_index.py (standalone, no sudo required):
source $DEEPRED_VENV/bin/activate
python3 scripts/process_and_index.py --initOption C — Manual SQL:
# Install pg_trgm extension (requires postgres superuser)
sudo -u postgres psql -d wikidb -c 'CREATE EXTENSION IF NOT EXISTS pg_trgm;'
# Apply schema as wiki user
PGPASSWORD=wiki psql -h localhost -U wiki -d wikidb -f $WIKI_DATA/schema.sqlVerify:
PGPASSWORD=wiki psql -h localhost -U wiki -d wikidb -c '\dt'
# Should show: articles, sections, redirectsThis phase extracts articles from the Wikipedia XML dump into clean JSON files. If you already have extracted data from a previous setup (check $WIKI_DATA/extracted/), you can skip this phase entirely.
ls $WIKI_DATA/extracted/*.json 2>/dev/null | wc -l
# If 7000+, extraction is already done — skip to Phase 3mkdir -p $WIKI_DATA/dumps
cd $WIKI_DATA/dumps
wget -c --timeout=60 --tries=10 \
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2This is ~25 GB and takes 1–2 hours. The -c flag resumes interrupted downloads.
source $DEEPRED_VENV/bin/activate
python3 $DEEPRED_REPO/scripts/extract_wikipedia.pyWhat happens:
- Reads the compressed XML dump directly (no manual decompression)
- Filters out redirects, disambiguation pages, and special pages
- Cleans wikitext: removes templates, HTML tags, formatting codes
- Removes non-content sections (References, Bibliography, etc.)
- Skips very short articles (< 100 characters)
- Parallelizes cleaning across all CPU cores
- Outputs JSON files with 1,000 articles each
Performance benchmarks:
| System | Duration | Articles |
|---|---|---|
| 8-core 1.8 GHz Xeon VM | ~21 hours | 7.0M |
| 16-core 3 GHz Ryzen AI | ~3 hours | 7.0M |
Output:
$WIKI_DATA/extracted/
├── wikipedia_batch_00000.json
├── wikipedia_batch_00001.json
├── ...
└── wikipedia_batch_07036.json (~7,037 files)
Each line is one article:
{"id": "12", "title": "Article Title", "url": "https://en.wikipedia.org/wiki?curid=12", "text": "Clean article text..."}Verify:
ls $WIKI_DATA/extracted/*.json | wc -l # Should be ~7000+
head -n1 $WIKI_DATA/extracted/wikipedia_batch_00000.json | python3 -m json.toolCommand-line options:
python3 scripts/extract_wikipedia.py --help
python3 scripts/extract_wikipedia.py --batch-size 500 # Smaller batch files
python3 scripts/extract_wikipedia.py --dump-file /path/to/custom/dump.xml.bz2The script includes a self-test mode that generates a small synthetic Wikipedia dump in /tmp, runs the full extraction pipeline against it, and displays the results. This is useful for verifying the parser after making changes — no real data or existing files are touched.
# Run the test
python3 scripts/extract_wikipedia.py --testThe test dump contains 4 pages that exercise the main code paths:
- Test Article Alpha — wiki headings, categories, inline URLs
- Test Article Beta — bulleted lists, wiki tables, headings
- Redirect page — should be filtered out (
#REDIRECT) - Namespace page — should be filtered out (colon in title)
Expected output: 2 extracted articles (Alpha and Beta). Each article's ID, title, URL, text length, and a 500-character preview are printed to stdout.
At the end the script logs the temp directory path. Clean up with:
# Remove test output (substitute the actual path from the log)
python3 scripts/extract_wikipedia.py --test-cleanup /tmp/wiki_extract_test_<suffix>The cleanup command refuses to delete any directory that doesn't contain wiki_extract_test_ in its name.
This is the longest phase. It reads all extracted JSON files, stores articles/sections in PostgreSQL, generates embeddings via the llama.cpp embedding server, and indexes everything to OpenSearch with k-NN vectors.
source $DEEPRED_VENV/bin/activate
# Verify all services are up and schema is ready
python3 scripts/process_and_index.py --testThis runs connectivity tests for:
- PostgreSQL connection and schema verification
- OpenSearch connection and k-NN plugin
- Embedding server (llama.cpp on port 1235)
- Full end-to-end pipeline with dummy data
All tests must pass before proceeding.
source $DEEPRED_VENV/bin/activate
python3 $DEEPRED_REPO/scripts/process_and_index.pyWhat happens:
- Reads each extracted JSON file line by line
- Splits each article into sections (based on markdown headers)
- Inserts article metadata and section text into PostgreSQL
- Generates embeddings for each section via the embedding server (port 1235)
- Bulk indexes documents with embeddings into OpenSearch
- Saves progress to a checkpoint file every 10 batches
Checkpoint/Resume: If interrupted (Ctrl+C, crash, reboot), simply run the same command again. The script automatically resumes from the last checkpoint.
Live Querying: The MCP server, search API, and VS Code Copilot integration all work while processing is still running. Results will be partial (only articles indexed so far) but grow as processing continues. This is useful for verifying the pipeline is working correctly without waiting for the full ~65-hour run.
# Check progress
python3 scripts/process_and_index.py --status
# Reset and start over
python3 scripts/process_and_index.py --resetPerformance: Embedding generation is the bottleneck. The script sends concurrent requests to utilize all parallel server slots.
| Configuration | Throughput | Duration (~7M articles) |
|---|---|---|
| Strix Halo only (local embedding) | ~30 articles/sec | ~65 hours |
| Strix Halo + remote A4000 (dual-endpoint) | ~75–80 articles/sec | ~25 hours |
When REMOTE_HOST is set and reachable, batches are distributed round-robin across local and remote embedding servers.
Tip: The embedding server container config is at
/etc/containers/systemd/llama-server-embed.container. Key tuning parameters:
--batch-size 32768— logical batch size (must hold all tokens in a single API request: 16 texts × up to 2048 tokens)--ubatch-size 2048— physical batch size (must be ≥--ctx-sizeso any single text fits in one GPU call)--cont-batching— enables continuous batching across parallel slots--parallelis auto-detected (defaults to 4 slots on Strix Halo)After changing, reload with:
sudo systemctl daemon-reload && sudo systemctl restart llama-server-embed
Monitor progress:
# In another terminal, watch the checkpoint file
watch -n 30 cat $WIKI_DATA/processing_checkpoint.json
# Or check database and index growth
PGPASSWORD=wiki psql -h localhost -U wiki -d wikidb -c "SELECT count(*) FROM articles;"
curl -s localhost:9200/_cat/indices?vMonitor GPU utilization:
# radeontop — real-time GPU stats (installed via setup_strixhalo.py)
# Note: "Unknown Radeon card" warning is benign on Strix Halo — data is correct
radeontop
# Alternative: kernel sysfs (no install needed)
watch -n 1 cat /sys/class/drm/card0/device/gpu_busy_percent| Metric | Healthy | Problem |
|---|---|---|
| Graphics pipe | 50–80% | 0% (CPU fallback) |
| VRAM | ~370M used | 0M (model not loaded) |
| Shader Clock | ~2.7 GHz (boosted) | ~200 MHz (idle) |
After processing completes:
# PostgreSQL article count (should be ~7 million)
PGPASSWORD=wiki psql -h localhost -U wiki -d wikidb -c "SELECT count(*) FROM articles;"
# PostgreSQL section count
PGPASSWORD=wiki psql -h localhost -U wiki -d wikidb -c "SELECT count(*) FROM sections;"
# OpenSearch index status
curl -s localhost:9200/wikipedia/_count | python3 -m json.tool
# Test a search
curl -s -X POST localhost:7000/mcp/search \
-H 'Content-Type: application/json' \
-d '{"query": "Apollo 11 moon landing", "mode": "hybrid", "limit": 3}' | python3 -m json.toolAll services are managed by systemd and start automatically on boot.
The MCP server (scripts/mcp_server.py) provides:
POST /mcp/search— keyword, semantic, or hybrid searchGET /mcp/article/{id}— retrieve article by database IDGET /mcp/article?title=...— retrieve article by titleGET /health— health check for all backendsGET /sse— Server-Sent Events endpoint for MCP protocolPOST /messages— MCP message handler
# Service management
systemctl status mcp
systemctl restart mcp
journalctl -u mcp -f
# Quick health check
curl localhost:7000/healthThe service file is generated by setup_strixhalo.py at /etc/systemd/system/mcp.service. See also services/mcp.service in the repo for the reference template.
The React web GUI provides a browser-based search interface. It communicates with the MCP server API on port 7000.
The web app source is in the webapp/ directory. The web_gui stage of setup_strixhalo.py handles everything:
- Installs Node.js and npm via dnf
- Copies webapp source to
$WIKI_DATA/frontend/ - Runs
npm installandnpm run build - Applies SELinux labels for
node_modules/.bin - Deploys and enables the
wiki-guisystemd service
# Automatic (recommended)
sudo -E python3 scripts/setup_strixhalo.py --stage web_gui
# Or rebuild manually
cd $WIKI_DATA/frontend
sudo npm install
sudo npm run build
sudo chown -R wiki:wiki $WIKI_DATA/frontend
sudo systemctl restart wiki-gui# Service management
systemctl status wiki-gui
systemctl restart wiki-gui
journalctl -u wiki-gui -fAccess at: http://<hostname>:8080
# Check cluster health
curl localhost:9200/_cluster/health?pretty
# Check wikipedia index
curl localhost:9200/wikipedia/_count
curl 'localhost:9200/wikipedia/_search?size=1&pretty'The embedding server runs as a Podman Quadlet container (llama-server-embed):
systemctl status llama-server-embed
curl localhost:1235/v1/modelsCurrent tuned configuration:
--ctx-size 2048 # Model context window (nomic-bert native)
--batch-size 32768 # Logical batch size (16 texts × 2048 tokens per API call)
--ubatch-size 2048 # Physical batch size (≥ ctx-size for single-text processing)
--cont-batching # Continuous batching across parallel slots
--parallel auto # Auto-detects 4 slots on Strix Halo
--flash-attn on # Flash attention for GPU efficiency
Add to your VS Code settings.json:
{
"github.copilot.chat.mcp": {
"servers": {
"wikipedia": {
"type": "sse",
"url": "http://localhost:7000/sse"
}
}
}
}If the MCP server runs on a different machine on your LAN:
{
"github.copilot.chat.mcp": {
"servers": {
"wikipedia": {
"type": "sse",
"url": "http://192.168.x.y:7000/sse"
}
}
}
}Once connected, Copilot can use these tools:
| Tool | Description |
|---|---|
search_wikipedia |
Search articles (keyword, semantic, or hybrid) |
get_article |
Retrieve full article by title |
get_article_by_id |
Retrieve full article by database ID |
health_check |
Check server and backend health |
All variables are set automatically by deepred-env.sh. Override before sourcing or export manually.
| Variable | Default | Description |
|---|---|---|
DEEPRED_ROOT |
/mnt/data |
Base data directory |
WIKI_DATA |
$DEEPRED_ROOT/wikipedia |
Wikipedia data directory |
DEEPRED_VENV |
$DEEPRED_ROOT/venv |
Python virtual environment |
PG_HOST |
localhost |
PostgreSQL host |
PG_PORT |
5432 |
PostgreSQL port |
PG_USER |
wiki |
PostgreSQL user |
PG_PASSWORD |
wiki |
PostgreSQL password |
PG_DATABASE |
wikidb |
PostgreSQL database name |
OS_HOST |
localhost |
OpenSearch host |
OS_PORT |
9200 |
OpenSearch port |
INFERENCE_HOST |
localhost |
Inference server host (LLM + embedding) |
INFERENCE_PORT |
1234 |
LLM inference server port |
EMBEDDING_PORT |
1235 |
Embedding server port |
REMOTE_HOST |
(blank) | Optional remote GPU server hostname/IP (see Remote GPU Server) |
REMOTE_LLM_PORT |
1234 |
LLM port on the remote server |
REMOTE_EMBED_PORT |
1235 |
Embedding port on the remote server |
-- articles: one row per Wikipedia article
CREATE TABLE articles (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
content TEXT,
url TEXT,
wikipedia_page_id INTEGER,
has_temporal_info BOOLEAN DEFAULT FALSE,
earliest_date DATE,
latest_date DATE,
created_at TIMESTAMP DEFAULT NOW()
);
-- sections: one row per article section (Introduction, History, etc.)
CREATE TABLE sections (
id SERIAL PRIMARY KEY,
article_id INTEGER REFERENCES articles(id) ON DELETE CASCADE,
section_title TEXT,
section_text TEXT,
section_order INTEGER
);
-- redirects: Wikipedia redirect mappings
CREATE TABLE redirects (
source_title TEXT PRIMARY KEY,
target_title TEXT NOT NULL
);The wikipedia index uses:
- English text analyzer for
title,section_title,textfields - k-NN vector field (
embedding, 768 dimensions, HNSW/cosine, Lucene engine) - Single shard, no replicas (single-node deployment)
If a remote GPU inference server is available (e.g. an NVIDIA A4000 system — see A4000-Fedora-Setup.md), it can offload embedding and LLM inference from the primary StrixHalo system. This is optional — all pipeline scripts work with local services only.
Set the REMOTE_HOST environment variable to the hostname or IP address of the remote system. The variable is defined in deepred-env.sh and defaults to blank (disabled).
Temporarily (current session only):
export REMOTE_HOST="A4000AI"
source /mnt/data/DeepRedAI/deepred-env.shPermanently (auto-load on login) — add to ~/.bashrc before the deepred-env.sh source line:
# ── DeepRedAI environment
export DEEPRED_ROOT="/mnt/data"
export REMOTE_HOST="A4000AI" # ← enable remote GPU server
[ -f "$DEEPRED_ROOT/DeepRedAI/deepred-env.sh" ] && source "$DEEPRED_ROOT/DeepRedAI/deepred-env.sh"After editing, log out and back in (or source ~/.bashrc) to apply. The env summary will confirm:
DeepRedAI environment loaded
...
REMOTE_HOST = A4000AI (LLM :1234, embed :1235)
If the remote services run on non-default ports, override before sourcing:
export REMOTE_HOST="A4000AI"
export REMOTE_LLM_PORT=1234 # default
export REMOTE_EMBED_PORT=1235 # defaultUse the provided test script to verify the remote server is reachable and produces identical embeddings to the local server:
source $DEEPRED_VENV/bin/activate
python3 $DEEPRED_REPO/scripts/test_remote.pyThe script performs:
- Availability check — calls the remote embedding and LLM
/v1/modelsendpoints - Embedding comparison — generates embeddings for test sentences on both local and remote servers, then verifies they are identical (cosine similarity ≈ 1.0)
All tests must pass before using the remote server for pipeline processing.
# Check service logs
journalctl -u mcp -n 50
# Common issues:
# 1. PostgreSQL not ready → systemctl status postgresql
# 2. OpenSearch not ready → curl localhost:9200
# 3. Schema not applied → python3 scripts/process_and_index.py --init
# 4. Wrong password → check PG_PASSWORD env var vs PostgreSQL user password# Check container status
systemctl status llama-server-embed
podman logs llama-server-embed
# Test embedding generation
curl -s -X POST localhost:1235/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{"model":"text-embedding-nomic-embed-text-v1.5@f16","input":"test"}' | python3 -m json.toolIf you see input (N tokens) is too large to process in the server logs:
# Increase --batch-size and --ubatch-size in the container config
sudo vi /etc/containers/systemd/llama-server-embed.container
# Set --batch-size 4096 --ubatch-size 4096 (must be ≥ max tokens per text)
sudo systemctl daemon-reload && sudo systemctl restart llama-server-embedIf the wikipedia index doesn't exist after processing:
# Check index list
curl localhost:9200/_cat/indices?v
# Process and index will create the index automatically
# Reset checkpoint if needed and re-run
python3 scripts/process_and_index.py --reset
python3 scripts/process_and_index.pyThe bottleneck is embedding generation. The script sends concurrent requests to utilize all server slots. Verify the GPU-accelerated embedding server is running with --cont-batching:
# Check server config
sudo journalctl -u llama-server-embed | grep -E 'n_parallel|n_batch|cont'
# Should show n_parallel = 4, n_batch = 4096
# If not, update /etc/containers/systemd/llama-server-embed.container
# Verify GPU is being used (should respond in <100ms)
time curl -s -X POST localhost:1235/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{"model":"text-embedding-nomic-embed-text-v1.5@f16","input":"test query"}' > /dev/null
# If > 1 second, check GPU access
systemctl status llama-server-embed# View checkpoint status
python3 scripts/process_and_index.py --status
# Resume from checkpoint (automatic)
python3 scripts/process_and_index.py
# Start fresh (deletes checkpoint and recreates index)
python3 scripts/process_and_index.py --reset| Script | Purpose |
|---|---|
scripts/extract_wikipedia.py |
Extract articles from Wikipedia XML dump to JSON |
scripts/process_and_index.py |
Index JSON articles to PostgreSQL and OpenSearch |
scripts/mcp_server.py |
FastAPI MCP server for search and retrieval |
scripts/setup_strixhalo.py |
Automated system setup (includes wikipedia_schema stage) |
python3 scripts/extract_wikipedia.py [--dump-file PATH] [--output-dir PATH] [--batch-size N]
python3 scripts/extract_wikipedia.py --test # Run test extraction with synthetic data
python3 scripts/extract_wikipedia.py --test-cleanup DIR # Remove test output directorypython3 scripts/process_and_index.py # Run full processing (resumes from checkpoint)
python3 scripts/process_and_index.py --init # Create database schema only
python3 scripts/process_and_index.py --test # Run connectivity tests
python3 scripts/process_and_index.py --status # Show checkpoint progress
python3 scripts/process_and_index.py --reset # Delete checkpoint, start fresh
python3 scripts/process_and_index.py --provider local # Use CPU embeddings (slow)