File update #99

star-nox · 2023-10-02T23:35:47Z

Created a new branch from the canvas PR for file update mechanism. This PR contains canvas functions too.

lintrule-review · 2023-10-02T23:35:49Z

You need to setup a payment method to use Lintrule

You can fix that by putting in a card here.

gitguardian · 2023-10-02T23:35:51Z

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	Secret	Commit	Filename
7724259	Supabase Service Role JWT	`deceb15`	ai_ta_backend/nomic.ipynb	View secret
7724259	Supabase Service Role JWT	`07238a2`	ai_ta_backend/nomic.ipynb	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secrets safely. Learn here the best practices.
Revoke and rotate these secrets.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!}

railway-app · 2023-10-02T23:35:53Z

This PR is being deployed to Railway 🚅

flask: ◻️ REMOVED

star-nox · 2023-10-19T23:16:31Z

All other file-types are working fine and duplicates are getting detected. PDF is still work in progress. I tried changing the text extract method for PDFs, still the same problem.

star-nox · 2023-10-20T23:14:36Z

@KastanDay the basic function is complete. If the file is duplicate, we skip ingestion. I had a doubt on what to do when the file exists in the database, but it is updated/changed.
I was working on deleting the previous file and continue ingesting the new one. But as per current workflow, we check for duplicates in split_and_upload(), after the new file is already in S3. If I use our current delete_data(), it will also remove the new file along with the old one from S3. How are we handling duplication in S3?

star-nox · 2023-11-16T22:06:36Z

Encountering minor errors in creating filenames in webscrape. Sometimes filename is none and it causes error when checking for duplicates. Need to fix this.

path name in webscrape:  c2c646ff-a1ca-43fa-a53d-57b0126b4bb6_
Uploading .html to S3
Top of ingest, Course_name ag-test-v1. S3 paths courses/ag-test-v1/c2c646ff-a1ca-43fa-a53d-57b0126b4bb6_.html
KWARGS:  {'url': 'https://python.langchain.com/', 'base_url': 'https://python.langchain.com'}
In split and upload
METADATAS:  [{'course_name': 'ag-test-v1', 's3_path': 'courses/ag-test-v1/c2c646ff-a1ca-43fa-a53d-57b0126b4bb6_.html', 'readable_filename': '', 'url': 'https://python.langchain.com/', 'base_url': 'https://python.langchain.com', 'pagenumber': '', 'timestamp': ''}]
in check_for_duplicates
original_filename:  .html
no. of docs previously present:  13

Update: I think this is a one-off thing. I was not able to reproduce this and there are no entries in the database where the filename is empty in case of webscrape.

ai_ta_backend/vector_database.py

KastanDay

Tested and working great. Ready for merge.

* initial attempt * add parallel calls to local LLM for filtering. It's fully working, but it's too slow * add newrelic logging * add langhub prompt stuffing, works great. prep newrelic logging * optimize load time of hub.pull(prompt) * Working filtering with time limit, but the time limit is not fully respected, it will only return the next one after your time limit expires * Working stably, but it's too slow and under-utilizing the GPUs. Need VLLM or Ray Serve to increase GPU Util * Adding replicate model run to our utils... but the concurrency is not good enough * Initial commit for multi query retriever * Integrating Multi query retriever with in context padding. Replaced LCEL with custom implementation for retrieval and reciprocal rank fusion. Added llm to Ingest() * Bumping up langchain version for new imports * Adding langchainhub to requirements * Using gpt3.5 instead of llm server * Updating python version in railway * Updated Nomic in requirements.txt * fix openai version to pre 1.0 * anyscale LLM inference is faster than replicate or kastan.ai, 10 seconds for 80 inference * upgrade python from 3.8 to 3.10 * trying to fix tesseract // pdfminer requirements for image ingest * adding strict versions to all requirements * Bump pymupdf from 1.22.5 to 1.23.6 (#136) Bumps [pymupdf](https://github.com/pymupdf/pymupdf) from 1.22.5 to 1.23.6. - [Release notes](https://github.com/pymupdf/pymupdf/releases) - [Changelog](https://github.com/pymupdf/PyMuPDF/blob/main/changes.txt) - [Commits](pymupdf/PyMuPDF@1.22.5...1.23.6) --- updated-dependencies: - dependency-name: pymupdf dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * compatible wheel version * upgrade pip during image startup * properly upgrade pip * Fully lock ALL requirements. Hopefully speed up build times, too * Limit unstructured dependencies, image balloned from 700MB to 6GB. Hopefully resolved * Lock version of pip * Lock (correct) version of pip * add libgl1 for cv2 in Docker (for unstructured) * adding proper error logging to image ingest * Installing unstructured requirements individually to hopefully redoce bundle size by 5GB * Downgrading openai package version to pre-vision release * Update requirements.txt to latest on main * Add langchainhub to requirements * Reduce use of unstructured, hopefully the install is much smaller now * Guarantee Unique S3 Upload paths (#137) * should be fully working, in final testing * trying to fix double nested kwargs * fixing readable_filename in pdf ingest * apt install tesseract-ocr, LAME * remove stupid typo * minor bug * Finally fix **kwargs passing * minor fix * guarding against webscrape kwargs in pdf * guarding against webscrape kwargs in pdf * guarding against webscrape kwargs in pdf * adding better error messages * revert req changes * simplify prints * Bump typing-extensions from 4.7.1 to 4.8.0 (#90) Bumps [typing-extensions](https://github.com/python/typing_extensions) from 4.7.1 to 4.8.0. - [Release notes](https://github.com/python/typing_extensions/releases) - [Changelog](https://github.com/python/typing_extensions/blob/main/CHANGELOG.md) - [Commits](python/typing_extensions@4.7.1...4.8.0) --- updated-dependencies: - dependency-name: typing-extensions dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kastan Day <[email protected]> * Bump flask from 2.3.3 to 3.0.0 (#101) Bumps [flask](https://github.com/pallets/flask) from 2.3.3 to 3.0.0. - [Release notes](https://github.com/pallets/flask/releases) - [Changelog](https://github.com/pallets/flask/blob/main/CHANGES.rst) - [Commits](pallets/flask@2.3.3...3.0.0) --- updated-dependencies: - dependency-name: flask dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kastan Day <[email protected]> * modified context_padding function to handle edge case * added print statements for testing * added print statements for multi-query test * added prints for valid docs * print statements for valid docs metadata * Guard against kwargs failures during webscrape * added fix for url parameter * HOTFIX: kwargs in html and pdf ingest for /webscrape * fix for pagenumber error * removed timestamp parameter * url fix * guard against missing URL metadata * minor refactor & cleanup * modified function to only pad first 5 docs * modified context_padding with multi-threading * modified context padding for only first 5 docs * modified function for removing duplicates in padded contexts * minor changes * Export conversation history on /analysis page (#141) * updated nomic version in requirements.txt * initial commit to PR * created API endpoint * completed export function * testing csv export on railway * code to remove file from repo after download * moved file storing out of docs folder * created a separate endpoint for multi-query-retrieval * added similar format_for_json for MQR * added option for extending one URL our when on baseurl or to opt out of it * merged context_filtering with MQR * added replicate to requirements.txt * added openai type to all openai functions * added filtering to the retrieval pipeline * moved filtering after context padding * changed model in run_anyscale() * minor string formatting in print statements * added ray.init() before calling filter function * added a wrapper function for run() * modified the code to use thread pool processor * fixed pool execution errors * replaced threadpool with processpool * testing multiprocessing with 10 contexts * restored to using all contexts * changed max_workers to 100 * changed max_workers to 100 * Guarentee unique s3 upload paths, support file updates (e.g. duplicate file guardfor Cron jobs) (#99) * added the add_users() for Canvas * added canvas course ingest * updated requirements * added .md ingest and fixed .py ingest * deleted test ipynb file * added nomic viz * added canvas file update function * completed update function * updated course export to include all contents * modified to handle diff file structures of downloaded content * modified canvas update * modified ingest function * modified update_files() for file replacement * removed the extra os.remove() * fix underscore to dash in for pip * removed json import and added abort to canvas functions * created separate PR for file update * added file-update logic in ingest, WIP * removed irrelevant text files * modified pdf ingest function * fixed PDF duplicate issue * removed unwanted files * updated nomic version in requirements.txt * modified s3_paths * testing unique filenames in aws upload * added missing library to requirements.txt * finished check_for_duplicates() * fixed filename errors * minor corrections * added a uuid check in check_for_duplicates() * regex depends on this being a dash * regex depends on this being a dash * Fix bug when no duplicate exists. * cleaning up prints, testing looks good. ready to merge * Further print and logging refinement * Remove s3 pased method for de-duplication, use Supabase only * remove duplicate imports * remove new requirement * Final print cleanups * remove pypdf import --------- Co-authored-by: root <root@ASMITA> Co-authored-by: Kastan Day <[email protected]> * changed workers to 30 in run.sh * Add Trunk Superlinter on-commit hooks (#164) * First attempt, should auto format on commit * maybe fix my yapf github action? Just bad formatting. * Finalized, excellent Trunk configs for my desired formatting * Further fix yapf GH Action * Full format of all files with Trunk * Fix more linting errors * Ignore .vscdoe folder * Reduce max line size to 120 (from 140) * Format code * Delete GH Action & Revert formatting in favor of Trunk. * Ignore the Readme * Remove trufflehog -- failing too much, confusing to new devs * Minor docstring update * trivial commit for testing * removing trivial commit for testing * Merge main into branch, vector_database.py probably needs work * Cleanup all Trunk lint errors that I can --------- Co-authored-by: KastanDay <[email protected]> Co-authored-by: Rohan Marwaha <[email protected]> * changed workers to 3 * logging time in API calling * removed wait parameter from executor.shutdown() * added timelog after openai completion * set openai api type as global variable * reduced max workers to 30 * moved filtering after MQR and modified th filtering code * minor function name change * minor changes * minor changes to print statements * Add example usage of our public API for chat calls * Add timeout to request, best practice * Add example usage notebook for our public API * Improve usage example to return model's response for easy storage. Fix linter inf loop * Final fix: Switch to https connections * Enhance logging in getTopContexts(), improve usage exmaple * Working implementation. Using ray, tested end to end locally * cleanup imports and dependencies * hard lock requirement versions * fix requirements hard locks * slim down reqs * Merge main.. touching up lint errors * Add pydantic req * fix ray start syntax * Improve prints logging * Add posthot logging for filter_top_contexts * Add course name to posthog logs * Remove langsmith hub for prompts because too unstable, hardcoded instead * remove osv-scanner from trunk linting runs --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Kastan Day <[email protected]> Co-authored-by: Asmita Dabholkar <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jkmin3 <[email protected]> Co-authored-by: root <root@ASMITA> Co-authored-by: KastanDay <[email protected]>

* updated nomic version in requirements.txt * Updated Nomic in requirements.txt * fix openai version to pre 1.0 * upgrade python from 3.8 to 3.10 * trying to fix tesseract // pdfminer requirements for image ingest * adding strict versions to all requirements * Bump pymupdf from 1.22.5 to 1.23.6 (#136) Bumps [pymupdf](https://github.com/pymupdf/pymupdf) from 1.22.5 to 1.23.6. - [Release notes](https://github.com/pymupdf/pymupdf/releases) - [Changelog](https://github.com/pymupdf/PyMuPDF/blob/main/changes.txt) - [Commits](pymupdf/PyMuPDF@1.22.5...1.23.6) --- updated-dependencies: - dependency-name: pymupdf dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * compatible wheel version * upgrade pip during image startup * properly upgrade pip * Fully lock ALL requirements. Hopefully speed up build times, too * Limit unstructured dependencies, image balloned from 700MB to 6GB. Hopefully resolved * Lock version of pip * Lock (correct) version of pip * add libgl1 for cv2 in Docker (for unstructured) * adding proper error logging to image ingest * Installing unstructured requirements individually to hopefully redoce bundle size by 5GB * Reduce use of unstructured, hopefully the install is much smaller now * Guarantee Unique S3 Upload paths (#137) * should be fully working, in final testing * trying to fix double nested kwargs * fixing readable_filename in pdf ingest * apt install tesseract-ocr, LAME * remove stupid typo * minor bug * Finally fix **kwargs passing * minor fix * guarding against webscrape kwargs in pdf * guarding against webscrape kwargs in pdf * guarding against webscrape kwargs in pdf * adding better error messages * revert req changes * simplify prints * Bump typing-extensions from 4.7.1 to 4.8.0 (#90) Bumps [typing-extensions](https://github.com/python/typing_extensions) from 4.7.1 to 4.8.0. - [Release notes](https://github.com/python/typing_extensions/releases) - [Changelog](https://github.com/python/typing_extensions/blob/main/CHANGELOG.md) - [Commits](python/typing_extensions@4.7.1...4.8.0) --- updated-dependencies: - dependency-name: typing-extensions dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kastan Day <[email protected]> * Bump flask from 2.3.3 to 3.0.0 (#101) Bumps [flask](https://github.com/pallets/flask) from 2.3.3 to 3.0.0. - [Release notes](https://github.com/pallets/flask/releases) - [Changelog](https://github.com/pallets/flask/blob/main/CHANGES.rst) - [Commits](pallets/flask@2.3.3...3.0.0) --- updated-dependencies: - dependency-name: flask dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kastan Day <[email protected]> * Guard against kwargs failures during webscrape * HOTFIX: kwargs in html and pdf ingest for /webscrape * Export conversation history on /analysis page (#141) * updated nomic version in requirements.txt * initial commit to PR * created API endpoint * completed export function * testing csv export on railway * code to remove file from repo after download * moved file storing out of docs folder * added option for extending one URL our when on baseurl or to opt out of it * Guarentee unique s3 upload paths, support file updates (e.g. duplicate file guardfor Cron jobs) (#99) * added the add_users() for Canvas * added canvas course ingest * updated requirements * added .md ingest and fixed .py ingest * deleted test ipynb file * added nomic viz * added canvas file update function * completed update function * updated course export to include all contents * modified to handle diff file structures of downloaded content * modified canvas update * modified ingest function * modified update_files() for file replacement * removed the extra os.remove() * fix underscore to dash in for pip * removed json import and added abort to canvas functions * created separate PR for file update * added file-update logic in ingest, WIP * removed irrelevant text files * modified pdf ingest function * fixed PDF duplicate issue * removed unwanted files * updated nomic version in requirements.txt * modified s3_paths * testing unique filenames in aws upload * added missing library to requirements.txt * finished check_for_duplicates() * fixed filename errors * minor corrections * added a uuid check in check_for_duplicates() * regex depends on this being a dash * regex depends on this being a dash * Fix bug when no duplicate exists. * cleaning up prints, testing looks good. ready to merge * Further print and logging refinement * Remove s3 pased method for de-duplication, use Supabase only * remove duplicate imports * remove new requirement * Final print cleanups * remove pypdf import --------- Co-authored-by: root <root@ASMITA> Co-authored-by: Kastan Day <[email protected]> * Add Trunk Superlinter on-commit hooks (#164) * First attempt, should auto format on commit * maybe fix my yapf github action? Just bad formatting. * Finalized, excellent Trunk configs for my desired formatting * Further fix yapf GH Action * Full format of all files with Trunk * Fix more linting errors * Ignore .vscdoe folder * Reduce max line size to 120 (from 140) * Format code * Delete GH Action & Revert formatting in favor of Trunk. * Ignore the Readme * Remove trufflehog -- failing too much, confusing to new devs * Minor docstring update * trivial commit for testing * removing trivial commit for testing * Merge main into branch, vector_database.py probably needs work * Cleanup all Trunk lint errors that I can --------- Co-authored-by: KastanDay <[email protected]> Co-authored-by: Rohan Marwaha <[email protected]> * Add example usage of our public API for chat calls * Add timeout to request, best practice * Add example usage notebook for our public API * Improve usage example to return model's response for easy storage. Fix linter inf loop * Final fix: Switch to https connections * Enhance logging in getTopContexts(), improve usage exmaple * minor changes for postman testing * minor changes for testing * added print statements * re-creating error * added condition to check if content is a list * added json handling needed to test with Postman * exception handling for get-nomic-map * json decoding for testing * added prints for testing * added prints for testing * added prints for testing * added prints for testing * fix for string error in nomic log * removed json debugging code * Cleanup comments * Enhance type checking, cleanup formatting * formatting * Fix type checks to isinstance() * Revert vector_database.py to status on main --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Kastan Day <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jkmin3 <[email protected]> Co-authored-by: root <root@ASMITA> Co-authored-by: KastanDay <[email protected]> Co-authored-by: Rohan Marwaha <[email protected]>

star-nox and others added 19 commits August 10, 2023 17:04

added the add_users() for Canvas

c42f606

added canvas course ingest

6854205

updated requirements

54e3fb0

added .md ingest and fixed .py ingest

07238a2

deleted test ipynb file

deceb15

added nomic viz

27383e1

added canvas file update function

6f08340

completed update function

34cbbdc

updated course export to include all contents

efd9048

modified to handle diff file structures of downloaded content

bf3726b

modified canvas update

93646ac

modified add_users() and ingest_course_content() functions

05ab444

modified ingest function

f5655ab

modified update_files() for file replacement

6f80b96

removed the extra os.remove()

0223a22

fix underscore to dash in for pip

2e10cc8

removed json import and added abort to canvas functions

a38fb90

Merge branch 'main' into canvas

79142c5

created separate PR for file update

118b725

star-nox self-assigned this Oct 4, 2023

star-nox added 3 commits October 11, 2023 16:46

added file-update logic in ingest, WIP

35a50a8

removed irrelevant text files

8499603

modified pdf ingest function

4319578

star-nox added 2 commits October 20, 2023 18:01

fixed PDF duplicate issue

0daac23

removed unwanted files

dd05d51

star-nox added 6 commits November 15, 2023 16:19

modified s3_paths

31002ed

Merge branch 'main' into file-update

21f64fb

testing unique filenames in aws upload

0a0e870

added missing library to requirements.txt

bcefb36

finished check_for_duplicates()

3bda544

fixed filename errors

b63ca84

star-nox added 2 commits November 16, 2023 17:23

Merge branch 'main' into file-update

273d598

minor corrections

a1e0f4b

star-nox requested a review from KastanDay November 17, 2023 16:25

KastanDay reviewed Nov 20, 2023

View reviewed changes

ai_ta_backend/vector_database.py Show resolved Hide resolved

star-nox added 2 commits November 20, 2023 16:25

added a uuid check in check_for_duplicates()

290c616

Merge branch 'main' into file-update

7a5cc3a

star-nox commented Dec 4, 2023

View reviewed changes

ai_ta_backend/vector_database.py Outdated Show resolved Hide resolved

KastanDay added 9 commits December 11, 2023 15:45

regex depends on this being a dash

bd73036

regex depends on this being a dash

2a6f4b2

Fix bug when no duplicate exists.

a1b4127

cleaning up prints, testing looks good. ready to merge

e01ee11

Further print and logging refinement

154d45b

Remove s3 pased method for de-duplication, use Supabase only

f7ee763

remove duplicate imports

2b43ab0

remove new requirement

36145d3

Final print cleanups

b76b449

KastanDay approved these changes Dec 12, 2023

View reviewed changes

KastanDay marked this pull request as ready for review December 12, 2023 01:12

remove pypdf import

c42ff61

KastanDay merged commit 5c93fe5 into main Dec 12, 2023
2 checks passed

KastanDay deleted the file-update branch December 12, 2023 01:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File update #99

File update #99

star-nox commented Oct 2, 2023

lintrule-review bot commented Oct 2, 2023

gitguardian bot commented Oct 2, 2023 •

edited

Loading

railway-app bot commented Oct 2, 2023 •

edited

Loading

star-nox commented Oct 19, 2023

star-nox commented Oct 20, 2023 •

edited

Loading

star-nox commented Nov 16, 2023 •

edited

Loading

KastanDay left a comment

File update #99

File update #99

Conversation

star-nox commented Oct 2, 2023

lintrule-review bot commented Oct 2, 2023

You need to setup a payment method to use Lintrule

gitguardian bot commented Oct 2, 2023 • edited Loading

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

railway-app bot commented Oct 2, 2023 • edited Loading

star-nox commented Oct 19, 2023

star-nox commented Oct 20, 2023 • edited Loading

star-nox commented Nov 16, 2023 • edited Loading

KastanDay left a comment

Choose a reason for hiding this comment

gitguardian bot commented Oct 2, 2023 •

edited

Loading

railway-app bot commented Oct 2, 2023 •

edited

Loading

star-nox commented Oct 20, 2023 •

edited

Loading

star-nox commented Nov 16, 2023 •

edited

Loading