Skip to content

Commit

Permalink
- supporting docx and pptx in offline ingestion script via form recog…
Browse files Browse the repository at this point in the history
…nizer (#474)

Co-authored-by: FARHAD SHAKERIN <[email protected]>
  • Loading branch information
fxs130430 and FARHAD SHAKERIN committed Dec 31, 2023
1 parent 62b4f97 commit 2b92ab3
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions scripts/data_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@
"shtml": "html",
"htm": "html",
"py": "python",
"pdf": "pdf"
"pdf": "pdf",
"docx": "docx",
"pptx": "pptx"
}

RETRY_COUNT = 5
Expand Down Expand Up @@ -800,7 +802,7 @@ def chunk_file(
raise UnsupportedFormatError(f"{file_name} is not supported")

cracked_pdf = False
if file_format == "pdf":
if file_format in ["pdf", "docx", "pptx"]:
if form_recognizer_client is None:
raise UnsupportedFormatError("form_recognizer_client is required for pdf files")
content = extract_pdf_content(file_path, form_recognizer_client, use_layout=use_layout)
Expand Down

1 comment on commit 2b92ab3

@mfugate-ywcss
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting this error. Message: Invalid request.Inner error: { "code": "InvalidContent", "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats."}

Please sign in to comment.