-
Notifications
You must be signed in to change notification settings - Fork 600
WIP: Introduce typesense search #7877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
2f58ee5
d94cc17
e38d851
5980cab
e658e1f
f3a8128
9c0e39a
1d0d954
3f69c85
f476460
45f782b
1fea4d3
92b7260
1d117d2
78fae3e
d89d1f0
6dad451
2c75d41
823ece5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
export TYPESENSE_API_KEY=test_api_key | ||
export TYPESENSE_HOSTNAME=localhost | ||
export TYPESENSE_ORIGIN=http://localhost:8108 | ||
export TYPESENSE_PORT=8108 | ||
export TYPESENSE_PROTOCOL=http | ||
|
||
export TYPESENSE_COLLECTION_NAME="mm_product_docs_1745019244" | ||
export DOCS_SITE_ORIGIN="http://localhost:8000" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
.env |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Using Typesense for documentation search | ||
|
||
With 3 terminals open, and run the following in the first two: | ||
- `make livehtml` - Run Sphinx docs server | ||
- `cd typesense && docker compose` - Run Typesense server and Typesense dashboard. You can access the Typesense dashboard at http://localhost:8001 | ||
|
||
After those are up and running, run this in the third terminal: | ||
- `cd typesense && docker compose --profile optional up scraper` - Run scraper to populate Typesense. The process will exit once complete. | ||
|
||
After running the scraper, we need to do some processing to make search result urls relative to the docs site. | ||
- `cd typesense && ./post-process-typesense-data.sh` | ||
|
||
If you'd like to re-index the Typesense collection, you can run: | ||
|
||
```sh | ||
cd typesense | ||
|
||
# Optionally delete all existing documents in the collection. Typesense will de-duplicate docs naturally, but this reset operation forces it to remove metadata from previous runs that we may want to remove as we change the schema/filters. | ||
./scripts/reset-typesense-collection.sh | ||
|
||
# Re-run scraper to populate Typesense | ||
docker compose --profile optional up scraper | ||
``` | ||
|
||
To export the index into a jsonl file, run: | ||
|
||
```sh | ||
cd typesense | ||
|
||
./scripts/download-typesense-collection.sh | ||
``` | ||
|
||
The output of the command will be a `documents.jsonl` file in the current directory. | ||
|
||
--- | ||
|
||
The scripts mentioned above support the following environment variables for configuration: | ||
|
||
- `TYPESENSE_API_KEY` - Defaults to `test_api_key` | ||
- `TYPESENSE_ORIGIN` - Defaults to `http://localhost:8108` | ||
- `TYPESENSE_HOSTNAME` - Defaults to `localhost` | ||
- `TYPESENSE_PORT` - Defaults to `8108` | ||
- `TYPESENSE_PROTOCOL` - Defaults to `http` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
{ | ||
"index_name": "mm_product_docs", | ||
"allowed_domains": [ | ||
"localhost", | ||
"mattermost-docs-preview-pulls.s3-website-us-east-1.amazonaws.com" | ||
], | ||
"start_urls": [ | ||
{ | ||
"url": "http://localhost:8000", | ||
"tags": [] | ||
} | ||
], | ||
"sitemap_urls": [ | ||
"http://localhost:8000/sitemap.xml" | ||
], | ||
"selectors": { | ||
"default": { | ||
"lvl0": "article h1", | ||
"lvl1": "article h2", | ||
"lvl2": "article h3", | ||
"lvl3": "article h4", | ||
"lvl4": "article h5", | ||
"lvl5": "article h6", | ||
"text": "article p, article li" | ||
} | ||
}, | ||
Comment on lines
+7
to
+26
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can add selectors for config settings, and section off user/admin docs. Below is an example from docsearch docs for sectioning off different content {
"start_urls": [
{
"url": "http://www.example.com/docs/faq/",
"selectors_key": "faq"
},
{
"url": "http://www.example.com/docs/"
}
],
"selectors": {
"default": {
"lvl0": ".docs h1",
"lvl1": ".docs h2",
"lvl2": ".docs h3",
"lvl3": ".docs h4",
"lvl4": ".docs h5",
"text": ".docs p, .docs li"
},
"faq": {
"lvl0": ".faq h1",
"lvl1": ".faq h2",
"lvl2": ".faq h3",
"lvl3": ".faq h4",
"lvl4": ".faq h5",
"text": ".faq p, .faq li"
}
}
} |
||
"custom_settings": { | ||
"token_separators": [ | ||
"-" | ||
], | ||
"symbols_to_index": [ | ||
"@" | ||
] | ||
}, | ||
"strip_chars": " .,;:#", | ||
"stop_urls": [], | ||
"scrape_start_urls": false, | ||
"nb_hits": 64205 | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
services: | ||
typesense: | ||
image: typesense/typesense:0.24.0 | ||
environment: | ||
- TYPESENSE_API_KEY=test_api_key | ||
- TYPESENSE_DATA_DIR=/data | ||
- TYPESENSE_ENABLE_CORS=true | ||
ports: | ||
- "8108:8108" | ||
volumes: | ||
- typesense-data:/data | ||
|
||
typesense-dashboard: | ||
image: ghcr.io/bfritscher/typesense-dashboard:latest | ||
ports: | ||
- "8001:80" | ||
|
||
scraper: | ||
image: typesense/docsearch-scraper | ||
profiles: | ||
- optional | ||
volumes: | ||
- ./config.json:/app/config.json | ||
network_mode: "host" | ||
environment: | ||
- CONFIG=/app/config.json | ||
- TYPESENSE_DATA_DIR=/data | ||
- TYPESENSE_ENABLE_CORS=true | ||
- TYPESENSE_API_KEY=${TYPESENSE_API_KEY:-test_api_key} | ||
- TYPESENSE_HOST=${TYPESENSE_HOSTNAME:-localhost} | ||
- TYPESENSE_PORT=${TYPESENSE_PORT:-8108} | ||
- TYPESENSE_PROTOCOL=${TYPESENSE_PROTOCOL:-http} | ||
|
||
volumes: | ||
typesense-data: |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
set -e | ||
|
||
export TYPESENSE_ORIGIN="${TYPESENSE_ORIGIN:-http://localhost:8108}" | ||
export TYPESENSE_API_KEY="${TYPESENSE_API_KEY:-test_api_key}" | ||
export DOCS_SITE_ORIGIN="${DOCS_SITE_ORIGIN:-http://localhost:8000}" | ||
export TYPESENSE_COLLECTION_NAME="${TYPESENSE_COLLECTION_NAME:-mm_product_docs}" | ||
|
||
echo "Downloading typesense collection" | ||
./scripts/download-typesense-collection.sh | ||
|
||
echo "Cleaning relative links in typesense collection" | ||
./scripts/clean-relative-links-in-typesense-collection.sh | ||
|
||
echo "Importing typesense collection" | ||
./scripts/import-typesense-collection.sh | ||
|
||
echo "Making alias for typesense collection" | ||
./scripts/make-alias-for-typesense-collection.sh |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
input_file="documents.jsonl" | ||
output_file="processed_documents.jsonl" | ||
|
||
DOCS_SITE_ORIGIN="${DOCS_SITE_ORIGIN:-http://localhost:8000}" | ||
|
||
cat "documents.jsonl" | python3 -c " | ||
import sys, json | ||
|
||
for line in sys.stdin: | ||
try: | ||
doc = json.loads(line) | ||
for key in ['url', 'url_without_anchor', 'url_without_variables']: | ||
if key in doc and doc[key].startswith('${DOCS_SITE_ORIGIN}'): | ||
doc[key] = doc[key][len('${DOCS_SITE_ORIGIN}'):] | ||
print(json.dumps(doc)) | ||
except Exception as e: | ||
print(f'// skipped invalid line: {line.strip()}', file=sys.stderr) | ||
" > processed_documents.jsonl |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
TYPESENSE_ORIGIN="${TYPESENSE_ORIGIN:-http://localhost:8108}" | ||
TYPESENSE_API_KEY="${TYPESENSE_API_KEY:-test_api_key}" | ||
|
||
curl "${TYPESENSE_ORIGIN}/collections" \ | ||
-X POST \ | ||
-H "Content-Type: application/json" \ | ||
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \ | ||
-d '{ | ||
"name": "mm_product_docs", | ||
"fields": [ | ||
{"name": "category", "type": "string" }, | ||
{"name": "weight", "type": "int32" } | ||
], | ||
"default_sorting_field": "weight" | ||
}' |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
TYPESENSE_ORIGIN="${TYPESENSE_ORIGIN:-http://localhost:8108}" | ||
TYPESENSE_API_KEY="${TYPESENSE_API_KEY:-test_api_key}" | ||
TYPESENSE_COLLECTION_NAME="${TYPESENSE_COLLECTION_NAME:-mm_product_docs}" | ||
|
||
curl -X DELETE "${TYPESENSE_ORIGIN}/collections/${TYPESENSE_COLLECTION_NAME}" \ | ||
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
TYPESENSE_ORIGIN="${TYPESENSE_ORIGIN:-http://localhost:8108}" | ||
TYPESENSE_API_KEY="${TYPESENSE_API_KEY:-test_api_key}" | ||
TYPESENSE_COLLECTION_NAME="${TYPESENSE_COLLECTION_NAME:-mm_product_docs}" | ||
|
||
curl "${TYPESENSE_ORIGIN}/collections/${TYPESENSE_COLLECTION_NAME}/documents/export" \ | ||
-X GET \ | ||
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \ | ||
-o documents.jsonl |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
TYPESENSE_ORIGIN="${TYPESENSE_ORIGIN:-http://localhost:8108}" | ||
TYPESENSE_API_KEY="${TYPESENSE_API_KEY:-test_api_key}" | ||
TYPESENSE_COLLECTION_NAME="${TYPESENSE_COLLECTION_NAME:-mm_product_docs}" | ||
|
||
curl -X POST "${TYPESENSE_ORIGIN}/collections/${TYPESENSE_COLLECTION_NAME}/documents/import?action=upsert" \ | ||
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \ | ||
-H "Content-Type: text/plain" \ | ||
--data-binary @processed_documents.jsonl |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
TYPESENSE_ORIGIN="${TYPESENSE_ORIGIN:-http://localhost:8108}" | ||
TYPESENSE_API_KEY="${TYPESENSE_API_KEY:-test_api_key}" | ||
TYPESENSE_COLLECTION_NAME="${TYPESENSE_COLLECTION_NAME:-mm_product_docs}" | ||
|
||
curl "${TYPESENSE_ORIGIN}/aliases/mm_product_docs" -X PUT \ | ||
-H "Content-Type: application/json" \ | ||
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -d "{ | ||
\"collection_name\": \"${TYPESENSE_COLLECTION_NAME}\" | ||
}" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
TYPESENSE_ORIGIN="${TYPESENSE_ORIGIN:-http://localhost:8108}" | ||
TYPESENSE_API_KEY="${TYPESENSE_API_KEY:-test_api_key}" | ||
TYPESENSE_COLLECTION_NAME="mm_product_docs" | ||
|
||
curl -X DELETE \ | ||
"${TYPESENSE_ORIGIN}/collections/${TYPESENSE_COLLECTION_NAME}/documents?truncate=true&filter_by=anchor:!=none" \ | ||
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want to bundle this, instead of serve from cdn, to be consistent with existing assets