You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm assuming this isn't supported out of the box? I tried this PDF with allenai/olmOCR-7B-0225-preview and did not get good results.
{"id": "033dae2f4c12b9b07d00a72702f03ac0639292e4", "text": "The quick fox jumps over the lazy, brave dog, or orangutan.", "source": "olmocr", "added": "2025-02-27", "created": "2025-02-27", "metadata": {"Source-File": "/workspaces/olmocr/tests/gnarly_pdfs/strikethrough_sample.pdf", "olmocr-version": "0.1.58", "pdf-total-pages": 1, "total-input-tokens": 1129, "total-output-tokens": 48, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 59, 1]]}}
From a sample PDF: The quick fox jumps over the lazybrave dogorangutan.
1
Modified the prompt and added:
def build_finetuning_prompt(base_text: str) -> str:
return (
f"Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. "
f"Just return the plain text representation of this document as if you were reading it naturally.\n"
f"The text may or may not contain strike-throughs. Return only the text that is NOT struck through.\n"
f"Do not hallucinate.\n"
f"RAW_TEXT_START\n{base_text}\nRAW_TEXT_END"
)
Alternatives
OpenAI will process this fine. If you instruct it to return non-struck-through texts. But this gets really expensive.
Additional context
No response
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
I'm assuming this isn't supported out of the box? I tried this PDF with
allenai/olmOCR-7B-0225-preview
and did not get good results.{"id": "033dae2f4c12b9b07d00a72702f03ac0639292e4", "text": "The quick fox jumps over the lazy, brave dog, or orangutan.", "source": "olmocr", "added": "2025-02-27", "created": "2025-02-27", "metadata": {"Source-File": "/workspaces/olmocr/tests/gnarly_pdfs/strikethrough_sample.pdf", "olmocr-version": "0.1.58", "pdf-total-pages": 1, "total-input-tokens": 1129, "total-output-tokens": 48, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 59, 1]]}}
From a sample PDF: The
quickfox jumps over thelazybravedogorangutan.1
Modified the prompt and added:
Alternatives
Additional context
No response
The text was updated successfully, but these errors were encountered: