Add Multi Query + LLM Filtering Retrieval, and PostHog (#130)

* initial attempt * add parallel calls to local LLM for filtering. It's fully working, but it's too slow * add newrelic logging * add langhub prompt stuffing, works great. prep newrelic logging * optimize load time of hub.pull(prompt) * Working filtering with time limit, but the time limit is not fully respected, it will only return the next one after your time limit expires * Working stably, but it's too slow and under-utilizing the GPUs. Need VLLM or Ray Serve to increase GPU Util * Adding replicate model run to our utils... but the concurrency is not good enough * Initial commit for multi query retriever * Integrating Multi query retriever with in context padding. Replaced LCEL with custom implementation for retrieval and reciprocal rank fusion. Added llm to Ingest() * Bumping up langchain version for new imports * Adding langchainhub to requirements * Using gpt3.5 instead of llm server * Updating python version in railway * Updated Nomic in requirements.txt * fix openai version to pre 1.0 * anyscale LLM inference is faster than replicate or kastan.ai, 10 seconds for 80 inference * upgrade python from 3.8 to 3.10 * trying to fix tesseract // pdfminer requirements for image ingest * adding strict versions to all requirements * Bump pymupdf from 1.22.5 to 1.23.6 (#136) Bumps [pymupdf](https://github.com/pymupdf/pymupdf) from 1.22.5 to 1.23.6. - [Release notes](https://github.com/pymupdf/pymupdf/releases) - [Changelog](https://github.com/pymupdf/PyMuPDF/blob/main/changes.txt) - [Commits](pymupdf/PyMuPDF@1.22.5...1.23.6) --- updated-dependencies: - dependency-name: pymupdf dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * compatible wheel version * upgrade pip during image startup * properly upgrade pip * Fully lock ALL requirements. Hopefully speed up build times, too * Limit unstructured dependencies, image balloned from 700MB to 6GB. Hopefully resolved * Lock version of pip * Lock (correct) version of pip * add libgl1 for cv2 in Docker (for unstructured) * adding proper error logging to image ingest * Installing unstructured requirements individually to hopefully redoce bundle size by 5GB * Downgrading openai package version to pre-vision release * Update requirements.txt to latest on main * Add langchainhub to requirements * Reduce use of unstructured, hopefully the install is much smaller now * Guarantee Unique S3 Upload paths (#137) * should be fully working, in final testing * trying to fix double nested kwargs * fixing readable_filename in pdf ingest * apt install tesseract-ocr, LAME * remove stupid typo * minor bug * Finally fix **kwargs passing * minor fix * guarding against webscrape kwargs in pdf * guarding against webscrape kwargs in pdf * guarding against webscrape kwargs in pdf * adding better error messages * revert req changes * simplify prints * Bump typing-extensions from 4.7.1 to 4.8.0 (#90) Bumps [typing-extensions](https://github.com/python/typing_extensions) from 4.7.1 to 4.8.0. - [Release notes](https://github.com/python/typing_extensions/releases) - [Changelog](https://github.com/python/typing_extensions/blob/main/CHANGELOG.md) - [Commits](python/typing_extensions@4.7.1...4.8.0) --- updated-dependencies: - dependency-name: typing-extensions dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kastan Day <[email protected]> * Bump flask from 2.3.3 to 3.0.0 (#101) Bumps [flask](https://github.com/pallets/flask) from 2.3.3 to 3.0.0. - [Release notes](https://github.com/pallets/flask/releases) - [Changelog](https://github.com/pallets/flask/blob/main/CHANGES.rst) - [Commits](pallets/flask@2.3.3...3.0.0) --- updated-dependencies: - dependency-name: flask dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kastan Day <[email protected]> * modified context_padding function to handle edge case * added print statements for testing * added print statements for multi-query test * added prints for valid docs * print statements for valid docs metadata * Guard against kwargs failures during webscrape * added fix for url parameter * HOTFIX: kwargs in html and pdf ingest for /webscrape * fix for pagenumber error * removed timestamp parameter * url fix * guard against missing URL metadata * minor refactor & cleanup * modified function to only pad first 5 docs * modified context_padding with multi-threading * modified context padding for only first 5 docs * modified function for removing duplicates in padded contexts * minor changes * Export conversation history on /analysis page (#141) * updated nomic version in requirements.txt * initial commit to PR * created API endpoint * completed export function * testing csv export on railway * code to remove file from repo after download * moved file storing out of docs folder * created a separate endpoint for multi-query-retrieval * added similar format_for_json for MQR * added option for extending one URL our when on baseurl or to opt out of it * merged context_filtering with MQR * added replicate to requirements.txt * added openai type to all openai functions * added filtering to the retrieval pipeline * moved filtering after context padding * changed model in run_anyscale() * minor string formatting in print statements * added ray.init() before calling filter function * added a wrapper function for run() * modified the code to use thread pool processor * fixed pool execution errors * replaced threadpool with processpool * testing multiprocessing with 10 contexts * restored to using all contexts * changed max_workers to 100 * changed max_workers to 100 * Guarentee unique s3 upload paths, support file updates (e.g. duplicate file guardfor Cron jobs) (#99) * added the add_users() for Canvas * added canvas course ingest * updated requirements * added .md ingest and fixed .py ingest * deleted test ipynb file * added nomic viz * added canvas file update function * completed update function * updated course export to include all contents * modified to handle diff file structures of downloaded content * modified canvas update * modified ingest function * modified update_files() for file replacement * removed the extra os.remove() * fix underscore to dash in for pip * removed json import and added abort to canvas functions * created separate PR for file update * added file-update logic in ingest, WIP * removed irrelevant text files * modified pdf ingest function * fixed PDF duplicate issue * removed unwanted files * updated nomic version in requirements.txt * modified s3_paths * testing unique filenames in aws upload * added missing library to requirements.txt * finished check_for_duplicates() * fixed filename errors * minor corrections * added a uuid check in check_for_duplicates() * regex depends on this being a dash * regex depends on this being a dash * Fix bug when no duplicate exists. * cleaning up prints, testing looks good. ready to merge * Further print and logging refinement * Remove s3 pased method for de-duplication, use Supabase only * remove duplicate imports * remove new requirement * Final print cleanups * remove pypdf import --------- Co-authored-by: root <root@ASMITA> Co-authored-by: Kastan Day <[email protected]> * changed workers to 30 in run.sh * Add Trunk Superlinter on-commit hooks (#164) * First attempt, should auto format on commit * maybe fix my yapf github action? Just bad formatting. * Finalized, excellent Trunk configs for my desired formatting * Further fix yapf GH Action * Full format of all files with Trunk * Fix more linting errors * Ignore .vscdoe folder * Reduce max line size to 120 (from 140) * Format code * Delete GH Action & Revert formatting in favor of Trunk. * Ignore the Readme * Remove trufflehog -- failing too much, confusing to new devs * Minor docstring update * trivial commit for testing * removing trivial commit for testing * Merge main into branch, vector_database.py probably needs work * Cleanup all Trunk lint errors that I can --------- Co-authored-by: KastanDay <[email protected]> Co-authored-by: Rohan Marwaha <[email protected]> * changed workers to 3 * logging time in API calling * removed wait parameter from executor.shutdown() * added timelog after openai completion * set openai api type as global variable * reduced max workers to 30 * moved filtering after MQR and modified th filtering code * minor function name change * minor changes * minor changes to print statements * Add example usage of our public API for chat calls * Add timeout to request, best practice * Add example usage notebook for our public API * Improve usage example to return model's response for easy storage. Fix linter inf loop * Final fix: Switch to https connections * Enhance logging in getTopContexts(), improve usage exmaple * Working implementation. Using ray, tested end to end locally * cleanup imports and dependencies * hard lock requirement versions * fix requirements hard locks * slim down reqs * Merge main.. touching up lint errors * Add pydantic req * fix ray start syntax * Improve prints logging * Add posthot logging for filter_top_contexts * Add course name to posthog logs * Remove langsmith hub for prompts because too unstable, hardcoded instead * remove osv-scanner from trunk linting runs --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Kastan Day <[email protected]> Co-authored-by: Asmita Dabholkar <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jkmin3 <[email protected]> Co-authored-by: root <root@ASMITA> Co-authored-by: KastanDay <[email protected]>
UIUC-Chatbot · Dec 15, 2023 · 7306dc3 · 7306dc3
1 parent 6781977
commit 7306dc3
Show file tree

Hide file tree

Showing 35 changed files with 3,295 additions and 2,295 deletions.
diff --git a/.github/workflows/mkdocs_deploy.yml b/.github/workflows/mkdocs_deploy.yml
@@ -1,25 +1,25 @@
-name: mkdocs gh-pages deploy
-on:
-  push:
-    branches:
-      - master 
-      - main
-permissions:
-  contents: write
-jobs:
-  deploy:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v3
-      - uses: actions/setup-python@v4
-        with:
-          python-version: 3.x
-      - run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV 
-      - uses: actions/cache@v3
-        with:
-          key: mkdocs-material-${{ env.cache_id }}
-          path: .cache
-          restore-keys: |
-            mkdocs-material-
-      - run: pip install mkdocs-material mkdocstrings[python] 
-      - run: mkdocs gh-deploy --force
+name: mkdocs gh-pages deploy
+on:
+  push:
+    branches:
+      - master
+      - main
+permissions:
+  contents: write
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - uses: actions/setup-python@v4
+        with:
+          python-version: 3.x
+      - run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
+      - uses: actions/cache@v3
+        with:
+          key: mkdocs-material-${{ env.cache_id }}
+          path: .cache
+          restore-keys: |
+            mkdocs-material-
+      - run: pip install mkdocs-material mkdocstrings[python]
+      - run: mkdocs gh-deploy --force
diff --git a/.github/workflows/yapf-format.yml b/.github/workflows/yapf-format.yml
diff --git a/.gitignore b/.gitignore
@@ -3,6 +3,7 @@ coursera-dl/
 *parsed.json
 wandb
 *.ipynb
+*.pem
 
 # don't expose env files
 .env

diff --git a/.trunk/.gitignore b/.trunk/.gitignore
@@ -0,0 +1,8 @@
+*out
+*logs
+*actions
+*notifications
+*tools
+plugins
+user_trunk.yaml
+user.yaml
diff --git a/.trunk/configs/.isort.cfg b/.trunk/configs/.isort.cfg
@@ -0,0 +1,2 @@
+[settings]
+profile=black
diff --git a/.trunk/configs/.markdownlint.yaml b/.trunk/configs/.markdownlint.yaml
@@ -0,0 +1,10 @@
+# Autoformatter friendly markdownlint config (all formatting rules disabled)
+default: true
+blank_lines: false
+bullet: false
+html: false
+indentation: false
+line_length: false
+spaces: false
+url: false
+whitespace: false
diff --git a/.trunk/configs/.shellcheckrc b/.trunk/configs/.shellcheckrc
@@ -0,0 +1,7 @@
+enable=all
+source-path=SCRIPTDIR
+disable=SC2154
+
+# If you're having issues with shellcheck following source, disable the errors via:
+# disable=SC1090
+# disable=SC1091
diff --git a/.trunk/configs/.style.yapf b/.trunk/configs/.style.yapf
@@ -0,0 +1,4 @@
+[style]
+based_on_style = google
+column_limit = 120
+indent_width = 2
diff --git a/.trunk/configs/.yamllint.yaml b/.trunk/configs/.yamllint.yaml
@@ -0,0 +1,10 @@
+rules:
+  quoted-strings:
+    required: only-when-needed
+    extra-allowed: ["{|}"]
+  empty-values:
+    forbid-in-block-mappings: true
+    forbid-in-flow-mappings: true
+  key-duplicates: {}
+  octal-values:
+    forbid-implicit-octal: true
diff --git a/.trunk/configs/ruff.toml b/.trunk/configs/ruff.toml
@@ -0,0 +1,5 @@
+# Generic, formatter-friendly config.
+select = ["B", "D3", "E", "F"]
+
+# Never enforce `E501` (line length violations). This should be handled by formatters.
+ignore = ["E501"]
diff --git a/.trunk/trunk.yaml b/.trunk/trunk.yaml
@@ -0,0 +1,49 @@
+# This file controls the behavior of Trunk: https://docs.trunk.io/cli
+# To learn more about the format of this file, see https://docs.trunk.io/reference/trunk-yaml
+version: 0.1
+cli:
+  version: 1.18.0
+# Trunk provides extensibility via plugins. (https://docs.trunk.io/plugins)
+plugins:
+  sources:
+    - id: trunk
+      ref: v1.3.0
+      uri: https://github.com/trunk-io/plugins
+# Many linters and tools depend on runtimes - configure them here. (https://docs.trunk.io/runtimes)
+runtimes:
+  enabled:
+    - [email protected]
+    - [email protected]
+    - [email protected]
+# This is the section where you manage your linters. (https://docs.trunk.io/check/configuration)
+# - [email protected]  # too sensitive, causing failures that make devs skip checks.
+lint:
+  enabled:
+    - [email protected]
+    - [email protected]
+    - [email protected]
+    - [email protected]
+    - git-diff-check
+    - [email protected]
+    - [email protected]
+    - [email protected]
+    - [email protected]
+    - [email protected]
+    - [email protected]
+    - [email protected]
+    - [email protected]
+  ignore:
+    - linters: [ALL]
+      paths:
+        - .github/**/*
+        - .trunk/**/*
+        - mkdocs.yml
+        - .DS_Store
+        - .vscode/**/*
+        - README.md
+actions:
+  enabled:
+    - trunk-announce
+    - trunk-check-pre-push
+    - trunk-fmt-pre-commit
+    - trunk-upgrade-available
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -1,38 +1,38 @@
-{
-  "yaml.schemas": {
-    "https://squidfunk.github.io/mkdocs-material/schema.json": "mkdocs.yml"
-  },
-  "yaml.customTags": [
-    "!ENV scalar",
-    "!ENV sequence",
-    "tag:yaml.org,2002:python/name:materialx.emoji.to_svg",
-    "tag:yaml.org,2002:python/name:materialx.emoji.twemoji",
-    "tag:yaml.org,2002:python/name:pymdownx.superfences.fence_code_format"
-  ],
-  "python.analysis.typeCheckingMode": "basic",
-  "cSpell.words": [
-    "availabe",
-    "boto",
-    "callout",
-    "Callout",
-    "dotenv",
-    "fitz",
-    "jsonify",
-    "langchain",
-    "metadatas",
-    "NLTK",
-    "numpy",
-    "qdrant",
-    "Qdrant",
-    "QDRANT",
-    "Spacy",
-    "sqlalchemy",
-    "supabase",
-    "Supabase",
-    "SUPABASE",
-    "tiktoken",
-    "UIUC",
-    "vectorstore",
-    "vectorstores"
-  ]
-}
+{
+  "yaml.schemas": {
+    "https://squidfunk.github.io/mkdocs-material/schema.json": "mkdocs.yml"
+  },
+  "yaml.customTags": [
+    "!ENV scalar",
+    "!ENV sequence",
+    "tag:yaml.org,2002:python/name:materialx.emoji.to_svg",
+    "tag:yaml.org,2002:python/name:materialx.emoji.twemoji",
+    "tag:yaml.org,2002:python/name:pymdownx.superfences.fence_code_format"
+  ],
+  "python.analysis.typeCheckingMode": "basic",
+  "cSpell.words": [
+    "availabe",
+    "boto",
+    "callout",
+    "Callout",
+    "dotenv",
+    "fitz",
+    "jsonify",
+    "langchain",
+    "metadatas",
+    "NLTK",
+    "numpy",
+    "qdrant",
+    "Qdrant",
+    "QDRANT",
+    "Spacy",
+    "sqlalchemy",
+    "supabase",
+    "Supabase",
+    "SUPABASE",
+    "tiktoken",
+    "UIUC",
+    "vectorstore",
+    "vectorstores"
+  ]
+}
diff --git a/README.md b/README.md
@@ -1,34 +1,38 @@
-## AI TA Backend for UIUC's Course Assistant Chatbot
-A Flask application hosting endpoints for AI TA backend.
-
-### 👉 See the main app for details: https://github.com/UIUC-Chatbot/ai-teaching-assistant-uiuc
-
-### 🛠️ Technical Architecture
-Hosted (mostly for free) on [Railway](https://railway.app/).
-Architecture diagram of Flask + Next.js & React hosted on Vercel. 
-![Architecture diagram](https://github.com/UIUC-Chatbot/ai-ta-backend/assets/13607221/bda7b4d6-79ce-4d12-bf8f-cff9207c37af)
-
-## Documentation
-Automatic [API Reference](https://uiuc-chatbot.github.io/ai-ta-backend/reference/)
-
-## 📣 Development
-
-1. Rename `.env.template` to `.env` and fill in the required variables
-2. Install Python requirements `pip install -r requirements.txt`
-3. Start the server for development (with live reloads) `cd ai_ta_backend` then `flask --app ai_ta_backend.main:app --debug run --port 8000`
-
-The docs are auto-built and deployed to [our docs website](https://uiuc-chatbot.github.io/ai-ta-backend/) on every push. Or you can build the docs locally when writing:
-- `mkdocs serve`
-
-
-### Course metadata structure
-```
-'text': doc.page_content,
-'readable_filename': doc.metadata['readable_filename'],
-'course_name ': doc.metadata['course_name'],
-'s3_path': doc.metadata['s3_path'],
-'pagenumber': doc.metadata['pagenumber_or_timestamp'], # this is the recent breaking change!! 
-# OPTIONAL properties
-'url': doc.metadata.get('url'), # wouldn't this error out?
-'base_url': doc.metadata.get('base_url'),
-```
+# AI TA Backend for UIUC's Course Assistant Chatbot
+
+A Flask application hosting endpoints for AI TA backend.
+
+### 👉 See the main app for details: https://github.com/UIUC-Chatbot/ai-teaching-assistant-uiuc
+
+### 🛠️ Technical Architecture
+
+Hosted (mostly for free) on [Railway](https://railway.app/).
+Architecture diagram of Flask + Next.js & React hosted on Vercel.
+![Architecture diagram](https://github.com/UIUC-Chatbot/ai-ta-backend/assets/13607221/bda7b4d6-79ce-4d12-bf8f-cff9207c37af)
+
+## Documentation
+
+Automatic [API Reference](https://uiuc-chatbot.github.io/ai-ta-backend/reference/)
+
+## 📣 Development
+
+1. Rename `.env.template` to `.env` and fill in the required variables
+2. Install Python requirements `pip install -r requirements.txt`
+3. Start the server for development (with live reloads) `cd ai_ta_backend` then `flask --app ai_ta_backend.main:app --debug run --port 8000`
+
+The docs are auto-built and deployed to [our docs website](https://uiuc-chatbot.github.io/ai-ta-backend/) on every push. Or you can build the docs locally when writing:
+
+- `mkdocs serve`
+
+### Course metadata structure
+
+```text
+'text': doc.page_content,
+'readable_filename': doc.metadata['readable_filename'],
+'course_name ': doc.metadata['course_name'],
+'s3_path': doc.metadata['s3_path'],
+'pagenumber': doc.metadata['pagenumber_or_timestamp'], # this is the recent breaking change!!
+# OPTIONAL properties
+'url': doc.metadata.get('url'), # wouldn't this error out?
+'base_url': doc.metadata.get('base_url'),
+```
-Original file line number
+Diff line change
@@ Expand Up / @@ -3,6 +3,7 @@ coursera-dl/ @@
     *parsed.json
     wandb
     *.ipynb
+    *.pem
     # don't expose env files
     .env
@@ Expand Down @@