Skip to content

Commit

Permalink
Add Multi Query + LLM Filtering Retrieval, and PostHog (#130)
Browse files Browse the repository at this point in the history
* initial attempt

* add parallel calls to local LLM for filtering. It's fully working, but it's too slow

* add newrelic logging

* add langhub prompt stuffing, works great. prep newrelic logging

* optimize load time of hub.pull(prompt)

* Working filtering with time limit, but the time limit is not fully respected, it will only return the next one after your time limit expires

* Working stably, but it's too slow and under-utilizing the GPUs. Need VLLM or Ray Serve to increase GPU Util

* Adding replicate model run to our utils... but the concurrency is not good enough

* Initial commit for multi query retriever

* Integrating Multi query retriever with in context padding.
Replaced LCEL with custom implementation for retrieval and reciprocal rank fusion.
Added llm to Ingest()

* Bumping up langchain version for new imports

* Adding langchainhub to requirements

* Using gpt3.5 instead of llm server

* Updating python version in railway

* Updated Nomic in requirements.txt

* fix openai version to pre 1.0

* anyscale LLM inference is faster than replicate or kastan.ai, 10 seconds for 80 inference

* upgrade python from 3.8 to 3.10

* trying to fix tesseract // pdfminer requirements for image ingest

* adding strict versions to all requirements

* Bump pymupdf from 1.22.5 to 1.23.6 (#136)

Bumps [pymupdf](https://github.com/pymupdf/pymupdf) from 1.22.5 to 1.23.6.
- [Release notes](https://github.com/pymupdf/pymupdf/releases)
- [Changelog](https://github.com/pymupdf/PyMuPDF/blob/main/changes.txt)
- [Commits](pymupdf/PyMuPDF@1.22.5...1.23.6)

---
updated-dependencies:
- dependency-name: pymupdf
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* compatible wheel version

* upgrade pip during image startup

* properly upgrade pip

* Fully lock ALL requirements. Hopefully speed up build times, too

* Limit unstructured dependencies, image balloned from 700MB to 6GB. Hopefully resolved

* Lock version of pip

* Lock (correct) version of pip

* add libgl1 for cv2 in Docker (for unstructured)

* adding proper error logging to image ingest

* Installing unstructured requirements individually to hopefully redoce bundle size by 5GB

* Downgrading openai package version to pre-vision release

* Update requirements.txt to latest on main

* Add langchainhub to requirements

* Reduce use of unstructured, hopefully the install is much smaller now

* Guarantee Unique S3 Upload paths (#137)

* should be fully working, in final testing

* trying to fix double nested kwargs

* fixing readable_filename in pdf ingest

* apt install tesseract-ocr, LAME

* remove stupid typo

* minor bug

* Finally fix **kwargs passing

* minor fix

* guarding against webscrape kwargs in pdf

* guarding against webscrape kwargs in pdf

* guarding against webscrape kwargs in pdf

* adding better error messages

* revert req changes

* simplify prints

* Bump typing-extensions from 4.7.1 to 4.8.0 (#90)

Bumps [typing-extensions](https://github.com/python/typing_extensions) from 4.7.1 to 4.8.0.
- [Release notes](https://github.com/python/typing_extensions/releases)
- [Changelog](https://github.com/python/typing_extensions/blob/main/CHANGELOG.md)
- [Commits](python/typing_extensions@4.7.1...4.8.0)

---
updated-dependencies:
- dependency-name: typing-extensions
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kastan Day <[email protected]>

* Bump flask from 2.3.3 to 3.0.0 (#101)

Bumps [flask](https://github.com/pallets/flask) from 2.3.3 to 3.0.0.
- [Release notes](https://github.com/pallets/flask/releases)
- [Changelog](https://github.com/pallets/flask/blob/main/CHANGES.rst)
- [Commits](pallets/flask@2.3.3...3.0.0)

---
updated-dependencies:
- dependency-name: flask
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kastan Day <[email protected]>

* modified context_padding function to handle edge case

* added print statements for testing

* added print statements for multi-query test

* added prints for valid docs

* print statements for valid docs metadata

* Guard against kwargs failures during webscrape

* added fix for url parameter

* HOTFIX: kwargs in html and pdf ingest for /webscrape

* fix for pagenumber error

* removed timestamp parameter

* url fix

* guard against missing URL metadata

* minor refactor & cleanup

* modified function to only pad first 5 docs

* modified context_padding with multi-threading

* modified context padding for only first 5 docs

* modified function for removing duplicates in padded contexts

* minor changes

* Export conversation history on /analysis page (#141)

* updated nomic version in requirements.txt

* initial commit to PR

* created API endpoint

* completed export function

* testing csv export on railway

* code to remove file from repo after download

* moved file storing out of docs folder

* created a separate endpoint for multi-query-retrieval

* added similar format_for_json for MQR

* added option for extending one URL our when on baseurl or to opt out of it

* merged context_filtering with MQR

* added replicate to requirements.txt

* added openai type to all openai functions

* added filtering to the retrieval pipeline

* moved filtering after context padding

* changed model in run_anyscale()

* minor string formatting in print statements

* added ray.init() before calling filter function

* added a wrapper function for run()

* modified the code to use thread pool processor

* fixed pool execution errors

* replaced threadpool with processpool

* testing multiprocessing with 10 contexts

* restored to using all contexts

* changed max_workers to 100

* changed max_workers to 100

* Guarentee unique s3 upload paths, support file updates (e.g. duplicate file guardfor Cron jobs) (#99)

* added the add_users() for Canvas

* added canvas course ingest

* updated requirements

* added .md ingest and fixed .py ingest

* deleted test ipynb file

* added nomic viz

* added canvas file update function

* completed update function

* updated course export to include all contents

* modified to handle diff file structures of downloaded content

* modified canvas update

* modified ingest function

* modified update_files() for file replacement

* removed the extra os.remove()

* fix underscore to dash in for pip

* removed json import and added abort to canvas functions

* created separate PR for file update

* added file-update logic in ingest, WIP

* removed irrelevant text files

* modified pdf ingest function

* fixed PDF duplicate issue

* removed unwanted files

* updated nomic version in requirements.txt

* modified s3_paths

* testing unique filenames in aws upload

* added missing library to requirements.txt

* finished check_for_duplicates()

* fixed filename errors

* minor corrections

* added a uuid check in check_for_duplicates()

* regex depends on this being a dash

* regex depends on this being a dash

* Fix bug when no duplicate exists.

* cleaning up prints, testing looks good. ready to merge

* Further print and logging refinement

* Remove s3 pased method for de-duplication, use Supabase only

* remove duplicate imports

* remove new requirement

* Final print cleanups

* remove pypdf import

---------

Co-authored-by: root <root@ASMITA>
Co-authored-by: Kastan Day <[email protected]>

* changed workers to 30 in run.sh

* Add Trunk Superlinter on-commit hooks (#164)

* First attempt, should auto format on commit

* maybe fix my yapf github action? Just bad formatting.

* Finalized, excellent Trunk configs for my desired formatting

* Further fix yapf GH Action

* Full format of all files with Trunk

* Fix more linting errors

* Ignore .vscdoe folder

* Reduce max line size to 120 (from 140)

* Format code

* Delete GH Action & Revert formatting in favor of Trunk.

* Ignore the Readme

* Remove trufflehog -- failing too much, confusing to new devs

* Minor docstring update

* trivial commit for testing

* removing trivial commit for testing

* Merge main into branch, vector_database.py probably needs work

* Cleanup all Trunk lint errors that I can

---------

Co-authored-by: KastanDay <[email protected]>
Co-authored-by: Rohan Marwaha <[email protected]>

* changed workers to 3

* logging time in API calling

* removed wait parameter from executor.shutdown()

* added timelog after openai completion

* set openai api type as global variable

* reduced max workers to 30

* moved filtering after MQR and modified th filtering code

* minor function name change

* minor changes

* minor changes to print statements

* Add example usage of our public API for chat calls

* Add timeout to request, best practice

* Add example usage notebook for our public API

* Improve usage example to return model's response for easy storage. Fix linter inf loop

* Final fix: Switch to https connections

* Enhance logging in getTopContexts(), improve usage exmaple

* Working implementation. Using ray, tested end to end locally

* cleanup imports and dependencies

* hard lock requirement versions

* fix requirements hard locks

* slim down reqs

* Merge main.. touching up lint errors

* Add pydantic req

* fix ray start syntax

* Improve prints logging

* Add posthot logging for filter_top_contexts

* Add course name to posthog logs

* Remove langsmith hub for prompts because too unstable, hardcoded instead

* remove osv-scanner from trunk linting runs

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Kastan Day <[email protected]>
Co-authored-by: Asmita Dabholkar <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jkmin3 <[email protected]>
Co-authored-by: root <root@ASMITA>
Co-authored-by: KastanDay <[email protected]>
  • Loading branch information
7 people authored Dec 15, 2023
1 parent 6781977 commit 7306dc3
Show file tree
Hide file tree
Showing 35 changed files with 3,295 additions and 2,295 deletions.
50 changes: 25 additions & 25 deletions .github/workflows/mkdocs_deploy.yml
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
name: mkdocs gh-pages deploy
on:
push:
branches:
- master
- main
permissions:
contents: write
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: 3.x
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
- uses: actions/cache@v3
with:
key: mkdocs-material-${{ env.cache_id }}
path: .cache
restore-keys: |
mkdocs-material-
- run: pip install mkdocs-material mkdocstrings[python]
- run: mkdocs gh-deploy --force
name: mkdocs gh-pages deploy
on:
push:
branches:
- master
- main
permissions:
contents: write
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: 3.x
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
- uses: actions/cache@v3
with:
key: mkdocs-material-${{ env.cache_id }}
path: .cache
restore-keys: |
mkdocs-material-
- run: pip install mkdocs-material mkdocstrings[python]
- run: mkdocs gh-deploy --force
25 changes: 0 additions & 25 deletions .github/workflows/yapf-format.yml

This file was deleted.

1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ coursera-dl/
*parsed.json
wandb
*.ipynb
*.pem

# don't expose env files
.env
Expand Down
8 changes: 8 additions & 0 deletions .trunk/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
*out
*logs
*actions
*notifications
*tools
plugins
user_trunk.yaml
user.yaml
2 changes: 2 additions & 0 deletions .trunk/configs/.isort.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[settings]
profile=black
10 changes: 10 additions & 0 deletions .trunk/configs/.markdownlint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Autoformatter friendly markdownlint config (all formatting rules disabled)
default: true
blank_lines: false
bullet: false
html: false
indentation: false
line_length: false
spaces: false
url: false
whitespace: false
7 changes: 7 additions & 0 deletions .trunk/configs/.shellcheckrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
enable=all
source-path=SCRIPTDIR
disable=SC2154

# If you're having issues with shellcheck following source, disable the errors via:
# disable=SC1090
# disable=SC1091
4 changes: 4 additions & 0 deletions .trunk/configs/.style.yapf
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[style]
based_on_style = google
column_limit = 120
indent_width = 2
10 changes: 10 additions & 0 deletions .trunk/configs/.yamllint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
rules:
quoted-strings:
required: only-when-needed
extra-allowed: ["{|}"]
empty-values:
forbid-in-block-mappings: true
forbid-in-flow-mappings: true
key-duplicates: {}
octal-values:
forbid-implicit-octal: true
5 changes: 5 additions & 0 deletions .trunk/configs/ruff.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generic, formatter-friendly config.
select = ["B", "D3", "E", "F"]

# Never enforce `E501` (line length violations). This should be handled by formatters.
ignore = ["E501"]
49 changes: 49 additions & 0 deletions .trunk/trunk.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# This file controls the behavior of Trunk: https://docs.trunk.io/cli
# To learn more about the format of this file, see https://docs.trunk.io/reference/trunk-yaml
version: 0.1
cli:
version: 1.18.0
# Trunk provides extensibility via plugins. (https://docs.trunk.io/plugins)
plugins:
sources:
- id: trunk
ref: v1.3.0
uri: https://github.com/trunk-io/plugins
# Many linters and tools depend on runtimes - configure them here. (https://docs.trunk.io/runtimes)
runtimes:
enabled:
- [email protected]
- [email protected]
- [email protected]
# This is the section where you manage your linters. (https://docs.trunk.io/check/configuration)
# - [email protected] # too sensitive, causing failures that make devs skip checks.
lint:
enabled:
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- git-diff-check
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
ignore:
- linters: [ALL]
paths:
- .github/**/*
- .trunk/**/*
- mkdocs.yml
- .DS_Store
- .vscode/**/*
- README.md
actions:
enabled:
- trunk-announce
- trunk-check-pre-push
- trunk-fmt-pre-commit
- trunk-upgrade-available
76 changes: 38 additions & 38 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -1,38 +1,38 @@
{
"yaml.schemas": {
"https://squidfunk.github.io/mkdocs-material/schema.json": "mkdocs.yml"
},
"yaml.customTags": [
"!ENV scalar",
"!ENV sequence",
"tag:yaml.org,2002:python/name:materialx.emoji.to_svg",
"tag:yaml.org,2002:python/name:materialx.emoji.twemoji",
"tag:yaml.org,2002:python/name:pymdownx.superfences.fence_code_format"
],
"python.analysis.typeCheckingMode": "basic",
"cSpell.words": [
"availabe",
"boto",
"callout",
"Callout",
"dotenv",
"fitz",
"jsonify",
"langchain",
"metadatas",
"NLTK",
"numpy",
"qdrant",
"Qdrant",
"QDRANT",
"Spacy",
"sqlalchemy",
"supabase",
"Supabase",
"SUPABASE",
"tiktoken",
"UIUC",
"vectorstore",
"vectorstores"
]
}
{
"yaml.schemas": {
"https://squidfunk.github.io/mkdocs-material/schema.json": "mkdocs.yml"
},
"yaml.customTags": [
"!ENV scalar",
"!ENV sequence",
"tag:yaml.org,2002:python/name:materialx.emoji.to_svg",
"tag:yaml.org,2002:python/name:materialx.emoji.twemoji",
"tag:yaml.org,2002:python/name:pymdownx.superfences.fence_code_format"
],
"python.analysis.typeCheckingMode": "basic",
"cSpell.words": [
"availabe",
"boto",
"callout",
"Callout",
"dotenv",
"fitz",
"jsonify",
"langchain",
"metadatas",
"NLTK",
"numpy",
"qdrant",
"Qdrant",
"QDRANT",
"Spacy",
"sqlalchemy",
"supabase",
"Supabase",
"SUPABASE",
"tiktoken",
"UIUC",
"vectorstore",
"vectorstores"
]
}
72 changes: 38 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,38 @@
## AI TA Backend for UIUC's Course Assistant Chatbot
A Flask application hosting endpoints for AI TA backend.

### 👉 See the main app for details: https://github.com/UIUC-Chatbot/ai-teaching-assistant-uiuc

### 🛠️ Technical Architecture
Hosted (mostly for free) on [Railway](https://railway.app/).
Architecture diagram of Flask + Next.js & React hosted on Vercel.
![Architecture diagram](https://github.com/UIUC-Chatbot/ai-ta-backend/assets/13607221/bda7b4d6-79ce-4d12-bf8f-cff9207c37af)

## Documentation
Automatic [API Reference](https://uiuc-chatbot.github.io/ai-ta-backend/reference/)

## 📣 Development

1. Rename `.env.template` to `.env` and fill in the required variables
2. Install Python requirements `pip install -r requirements.txt`
3. Start the server for development (with live reloads) `cd ai_ta_backend` then `flask --app ai_ta_backend.main:app --debug run --port 8000`

The docs are auto-built and deployed to [our docs website](https://uiuc-chatbot.github.io/ai-ta-backend/) on every push. Or you can build the docs locally when writing:
- `mkdocs serve`


### Course metadata structure
```
'text': doc.page_content,
'readable_filename': doc.metadata['readable_filename'],
'course_name ': doc.metadata['course_name'],
's3_path': doc.metadata['s3_path'],
'pagenumber': doc.metadata['pagenumber_or_timestamp'], # this is the recent breaking change!!
# OPTIONAL properties
'url': doc.metadata.get('url'), # wouldn't this error out?
'base_url': doc.metadata.get('base_url'),
```
# AI TA Backend for UIUC's Course Assistant Chatbot

A Flask application hosting endpoints for AI TA backend.

### 👉 See the main app for details: https://github.com/UIUC-Chatbot/ai-teaching-assistant-uiuc

### 🛠️ Technical Architecture

Hosted (mostly for free) on [Railway](https://railway.app/).
Architecture diagram of Flask + Next.js & React hosted on Vercel.
![Architecture diagram](https://github.com/UIUC-Chatbot/ai-ta-backend/assets/13607221/bda7b4d6-79ce-4d12-bf8f-cff9207c37af)

## Documentation

Automatic [API Reference](https://uiuc-chatbot.github.io/ai-ta-backend/reference/)

## 📣 Development

1. Rename `.env.template` to `.env` and fill in the required variables
2. Install Python requirements `pip install -r requirements.txt`
3. Start the server for development (with live reloads) `cd ai_ta_backend` then `flask --app ai_ta_backend.main:app --debug run --port 8000`

The docs are auto-built and deployed to [our docs website](https://uiuc-chatbot.github.io/ai-ta-backend/) on every push. Or you can build the docs locally when writing:

- `mkdocs serve`

### Course metadata structure

```text
'text': doc.page_content,
'readable_filename': doc.metadata['readable_filename'],
'course_name ': doc.metadata['course_name'],
's3_path': doc.metadata['s3_path'],
'pagenumber': doc.metadata['pagenumber_or_timestamp'], # this is the recent breaking change!!
# OPTIONAL properties
'url': doc.metadata.get('url'), # wouldn't this error out?
'base_url': doc.metadata.get('base_url'),
```
Loading

0 comments on commit 7306dc3

Please sign in to comment.