Skip to content

Commit

Permalink
Update PDF parsing to use utf-8 chars instead of ascii
Browse files Browse the repository at this point in the history
  • Loading branch information
KastanDay authored Oct 10, 2023
1 parent 6fbd1bb commit 97a300d
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion ai_ta_backend/vector_database.py
Original file line number Diff line number Diff line change
Expand Up @@ -496,7 +496,7 @@ def _ingest_single_pdf(self, s3_path: str, course_name: str, **kwargs):
self.s3_client.upload_fileobj(f, os.getenv('S3_BUCKET_NAME'), s3_upload_path)

# Extract text
text = page.get_text().encode("utf8").decode('ascii', errors='ignore') # get plain text (is in UTF-8)
text = page.get_text().encode("utf8").decode("utf8", errors='ignore') # get plain text (is in UTF-8)
pdf_pages_OCRed.append(dict(text=text, page_number=i, readable_filename=Path(s3_path).name))

if kwargs['kwargs'] == {}:
Expand Down

0 comments on commit 97a300d

Please sign in to comment.