Skip to content

Commit

Permalink
Fix Type Error in Nomic Logging (#174)
Browse files Browse the repository at this point in the history
* updated nomic version in requirements.txt

* Updated Nomic in requirements.txt

* fix openai version to pre 1.0

* upgrade python from 3.8 to 3.10

* trying to fix tesseract // pdfminer requirements for image ingest

* adding strict versions to all requirements

* Bump pymupdf from 1.22.5 to 1.23.6 (#136)

Bumps [pymupdf](https://github.com/pymupdf/pymupdf) from 1.22.5 to 1.23.6.
- [Release notes](https://github.com/pymupdf/pymupdf/releases)
- [Changelog](https://github.com/pymupdf/PyMuPDF/blob/main/changes.txt)
- [Commits](pymupdf/PyMuPDF@1.22.5...1.23.6)

---
updated-dependencies:
- dependency-name: pymupdf
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* compatible wheel version

* upgrade pip during image startup

* properly upgrade pip

* Fully lock ALL requirements. Hopefully speed up build times, too

* Limit unstructured dependencies, image balloned from 700MB to 6GB. Hopefully resolved

* Lock version of pip

* Lock (correct) version of pip

* add libgl1 for cv2 in Docker (for unstructured)

* adding proper error logging to image ingest

* Installing unstructured requirements individually to hopefully redoce bundle size by 5GB

* Reduce use of unstructured, hopefully the install is much smaller now

* Guarantee Unique S3 Upload paths (#137)

* should be fully working, in final testing

* trying to fix double nested kwargs

* fixing readable_filename in pdf ingest

* apt install tesseract-ocr, LAME

* remove stupid typo

* minor bug

* Finally fix **kwargs passing

* minor fix

* guarding against webscrape kwargs in pdf

* guarding against webscrape kwargs in pdf

* guarding against webscrape kwargs in pdf

* adding better error messages

* revert req changes

* simplify prints

* Bump typing-extensions from 4.7.1 to 4.8.0 (#90)

Bumps [typing-extensions](https://github.com/python/typing_extensions) from 4.7.1 to 4.8.0.
- [Release notes](https://github.com/python/typing_extensions/releases)
- [Changelog](https://github.com/python/typing_extensions/blob/main/CHANGELOG.md)
- [Commits](python/typing_extensions@4.7.1...4.8.0)

---
updated-dependencies:
- dependency-name: typing-extensions
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kastan Day <[email protected]>

* Bump flask from 2.3.3 to 3.0.0 (#101)

Bumps [flask](https://github.com/pallets/flask) from 2.3.3 to 3.0.0.
- [Release notes](https://github.com/pallets/flask/releases)
- [Changelog](https://github.com/pallets/flask/blob/main/CHANGES.rst)
- [Commits](pallets/flask@2.3.3...3.0.0)

---
updated-dependencies:
- dependency-name: flask
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kastan Day <[email protected]>

* Guard against kwargs failures during webscrape

* HOTFIX: kwargs in html and pdf ingest for /webscrape

* Export conversation history on /analysis page (#141)

* updated nomic version in requirements.txt

* initial commit to PR

* created API endpoint

* completed export function

* testing csv export on railway

* code to remove file from repo after download

* moved file storing out of docs folder

* added option for extending one URL our when on baseurl or to opt out of it

* Guarentee unique s3 upload paths, support file updates (e.g. duplicate file guardfor Cron jobs) (#99)

* added the add_users() for Canvas

* added canvas course ingest

* updated requirements

* added .md ingest and fixed .py ingest

* deleted test ipynb file

* added nomic viz

* added canvas file update function

* completed update function

* updated course export to include all contents

* modified to handle diff file structures of downloaded content

* modified canvas update

* modified ingest function

* modified update_files() for file replacement

* removed the extra os.remove()

* fix underscore to dash in for pip

* removed json import and added abort to canvas functions

* created separate PR for file update

* added file-update logic in ingest, WIP

* removed irrelevant text files

* modified pdf ingest function

* fixed PDF duplicate issue

* removed unwanted files

* updated nomic version in requirements.txt

* modified s3_paths

* testing unique filenames in aws upload

* added missing library to requirements.txt

* finished check_for_duplicates()

* fixed filename errors

* minor corrections

* added a uuid check in check_for_duplicates()

* regex depends on this being a dash

* regex depends on this being a dash

* Fix bug when no duplicate exists.

* cleaning up prints, testing looks good. ready to merge

* Further print and logging refinement

* Remove s3 pased method for de-duplication, use Supabase only

* remove duplicate imports

* remove new requirement

* Final print cleanups

* remove pypdf import

---------

Co-authored-by: root <root@ASMITA>
Co-authored-by: Kastan Day <[email protected]>

* Add Trunk Superlinter on-commit hooks (#164)

* First attempt, should auto format on commit

* maybe fix my yapf github action? Just bad formatting.

* Finalized, excellent Trunk configs for my desired formatting

* Further fix yapf GH Action

* Full format of all files with Trunk

* Fix more linting errors

* Ignore .vscdoe folder

* Reduce max line size to 120 (from 140)

* Format code

* Delete GH Action & Revert formatting in favor of Trunk.

* Ignore the Readme

* Remove trufflehog -- failing too much, confusing to new devs

* Minor docstring update

* trivial commit for testing

* removing trivial commit for testing

* Merge main into branch, vector_database.py probably needs work

* Cleanup all Trunk lint errors that I can

---------

Co-authored-by: KastanDay <[email protected]>
Co-authored-by: Rohan Marwaha <[email protected]>

* Add example usage of our public API for chat calls

* Add timeout to request, best practice

* Add example usage notebook for our public API

* Improve usage example to return model's response for easy storage. Fix linter inf loop

* Final fix: Switch to https connections

* Enhance logging in getTopContexts(), improve usage exmaple

* minor changes for postman testing

* minor changes for testing

* added print statements

* re-creating error

* added condition to check if content is a list

* added json handling needed to test with Postman

* exception handling for get-nomic-map

* json decoding for testing

* added prints for testing

* added prints for testing

* added prints for testing

* added prints for testing

* fix for string error in nomic log

* removed json debugging code

* Cleanup comments

* Enhance type checking, cleanup formatting

* formatting

* Fix type checks to isinstance()

* Revert vector_database.py to status on main

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Kastan Day <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jkmin3 <[email protected]>
Co-authored-by: root <root@ASMITA>
Co-authored-by: KastanDay <[email protected]>
Co-authored-by: Rohan Marwaha <[email protected]>
  • Loading branch information
7 people authored Dec 19, 2023
1 parent a619243 commit 90ec8b9
Show file tree
Hide file tree
Showing 3 changed files with 64 additions and 25 deletions.
1 change: 1 addition & 0 deletions ai_ta_backend/export_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import supabase
import sentry_sdk


def export_convo_history_csv(course_name: str, from_date='', to_date=''):
"""
This function exports the conversation history to a csv file.
Expand Down
4 changes: 2 additions & 2 deletions ai_ta_backend/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,7 @@
# Set profiles_sample_rate to 1.0 to profile 100% of sampled transactions.
# We recommend adjusting this value in production.
profiles_sample_rate=1.0,
enable_tracing=True
)
enable_tracing=True)

app = Flask(__name__)
CORS(app)
Expand Down Expand Up @@ -491,6 +490,7 @@ def logToNomic():
data = request.get_json()
course_name = data['course_name']
conversation = data['conversation']

if course_name == '' or conversation == '':
# proper web error "400 Bad request"
abort(
Expand Down
84 changes: 61 additions & 23 deletions ai_ta_backend/nomic_logging.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,15 @@ def log_convo_to_nomic(course_name: str, conversation) -> str:
NOMIC_MAP_NAME_PREFIX = 'Conversation Map for '
"""
Logs conversation to Nomic.
1. Check if map exists for given course
1. Check if ma
p exists for given course
2. Check if conversation ID exists
- if yes, delete and add new data point
- if no, add new data point
3. Keep current logic for map doesn't exist - update metadata
"""
print(f"in log_convo_to_nomic() for course: {course_name}")

print(f"in log_convo_to_nomic() for course: {course_name}")
messages = conversation['conversation']['messages']
user_email = conversation['conversation']['user_email']
conversation_id = conversation['conversation']['id']
Expand All @@ -42,6 +43,7 @@ def log_convo_to_nomic(course_name: str, conversation) -> str:
try:
# fetch project metadata and embbeddings
project = AtlasProject(name=project_name, add_datums_if_exists=True)

map_metadata_df = project.maps[1].data.df # type: ignore
map_embeddings_df = project.maps[1].embeddings.latent
map_metadata_df['id'] = map_metadata_df['id'].astype(int)
Expand Down Expand Up @@ -70,7 +72,12 @@ def log_convo_to_nomic(course_name: str, conversation) -> str:
else:
emoji = "πŸ€– "

prev_convo += "\n>>> " + emoji + message['role'] + ": " + message['content'] + "\n"
if isinstance(message['content'], list):
text = message['content'][0]['text']
else:
text = message['content']

prev_convo += "\n>>> " + emoji + message['role'] + ": " + text + "\n"

# modified timestamp
current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
Expand All @@ -92,15 +99,24 @@ def log_convo_to_nomic(course_name: str, conversation) -> str:
# add new data point
user_queries = []
conversation_string = ""

first_message = messages[0]['content']
if isinstance(first_message, list):
first_message = first_message[0]['text']
user_queries.append(first_message)

for message in messages:
if message['role'] == 'user':
emoji = "πŸ™‹ "
else:
emoji = "πŸ€– "
conversation_string += "\n>>> " + emoji + message['role'] + ": " + message['content'] + "\n"

if isinstance(message['content'], list):
text = message['content'][0]['text']
else:
text = message['content']

conversation_string += "\n>>> " + emoji + message['role'] + ": " + text + "\n"

# modified timestamp
current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
Expand Down Expand Up @@ -163,18 +179,19 @@ def get_nomic_map(course_name: str):

try:
project = atlas.AtlasProject(name=project_name, add_datums_if_exists=True)
except Exception as e:
err = f"Nomic map does not exist yet, probably because you have less than 20 queries on your project: {e}"
map = project.get_map(project_name)

print(f"⏰ Nomic Full Map Retrieval: {(time.monotonic() - start_time):.2f} seconds")
return {"map_id": f"iframe{map.id}", "map_link": map.map_link}
except ValueError as ve:
# Error: ValueError: You must specify a unique_id_field when creating a new project.
err = f"Nomic map does not exist yet, probably because you have less than 20 queries on your project: {ve}"
print(err)
return {"map_id": None, "map_link": None}
except Exception as e:
sentry_sdk.capture_exception(e)
return {"map_id": None, "map_link": None}

map = project.get_map(project_name)

print(f"⏰ Nomic Full Map Retrieval: {(time.monotonic() - start_time):.2f} seconds")

return {"map_id": f"iframe{map.id}", "map_link": map.map_link}


def create_nomic_map(course_name: str, log_data: list):
"""
Expand Down Expand Up @@ -216,28 +233,44 @@ def create_nomic_map(course_name: str, log_data: list):
created_at = pd.to_datetime(row['created_at']).strftime('%Y-%m-%d %H:%M:%S')
convo = row['convo']
messages = convo['messages']

first_message = messages[0]['content']
if isinstance(first_message, list):
first_message = first_message[0]['text']

user_queries.append(first_message)

# create metadata for multi-turn conversation
conversation = ""
if message['role'] == 'user': # type: ignore
emoji = "πŸ™‹ "
else:
emoji = "πŸ€– "
for message in messages:
# string of role: content, role: content, ...
conversation += "\n>>> " + emoji + message['role'] + ": " + message['content'] + "\n"
if message['role'] == 'user': # type: ignore
emoji = "πŸ™‹ "
else:
emoji = "πŸ€– "

if isinstance(message['content'], list):
text = message['content'][0]['text']
else:
text = message['content']

conversation += "\n>>> " + emoji + message['role'] + ": " + text + "\n"

# append current chat to previous chat if convo already exists
if convo['id'] == log_conversation_id:
conversation_exists = True
if m['role'] == 'user': # type: ignore
emoji = "πŸ™‹ "
else:
emoji = "πŸ€– "

for m in log_messages:
conversation += "\n>>> " + emoji + m['role'] + ": " + m['content'] + "\n"
if m['role'] == 'user': # type: ignore
emoji = "πŸ™‹ "
else:
emoji = "πŸ€– "

if isinstance(m['content'], list):
text = m['content'][0]['text']
else:
text = m['content']
conversation += "\n>>> " + emoji + m['role'] + ": " + text + "\n"

# adding modified timestamp
current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
Expand Down Expand Up @@ -265,7 +298,12 @@ def create_nomic_map(course_name: str, log_data: list):
emoji = "πŸ™‹ "
else:
emoji = "πŸ€– "
conversation += "\n>>> " + emoji + message['role'] + ": " + message['content'] + "\n"

if isinstance(message['content'], list):
text = message['content'][0]['text']
else:
text = message['content']
conversation += "\n>>> " + emoji + message['role'] + ": " + text + "\n"

# adding timestamp
current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
Expand Down

0 comments on commit 90ec8b9

Please sign in to comment.