Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of formattings (strikethroughs, etc.) #53

Open
knguyen1 opened this issue Feb 27, 2025 · 1 comment
Open

Support of formattings (strikethroughs, etc.) #53

knguyen1 opened this issue Feb 27, 2025 · 1 comment

Comments

@knguyen1
Copy link

knguyen1 commented Feb 27, 2025

🚀 The feature, motivation and pitch

I'm assuming this isn't supported out of the box? I tried this PDF with allenai/olmOCR-7B-0225-preview and did not get good results.

{"id": "033dae2f4c12b9b07d00a72702f03ac0639292e4", "text": "The quick fox jumps over the lazy, brave dog, or orangutan.", "source": "olmocr", "added": "2025-02-27", "created": "2025-02-27", "metadata": {"Source-File": "/workspaces/olmocr/tests/gnarly_pdfs/strikethrough_sample.pdf", "olmocr-version": "0.1.58", "pdf-total-pages": 1, "total-input-tokens": 1129, "total-output-tokens": 48, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 59, 1]]}}

From a sample PDF: The quick fox jumps over the lazybrave dogorangutan.
1

Image

Modified the prompt and added:

def build_finetuning_prompt(base_text: str) -> str:
    return (
        f"Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. "
        f"Just return the plain text representation of this document as if you were reading it naturally.\n"
        f"The text may or may not contain strike-throughs. Return only the text that is NOT struck through.\n"
        f"Do not hallucinate.\n"
        f"RAW_TEXT_START\n{base_text}\nRAW_TEXT_END"
    )

Alternatives

  • OpenAI will process this fine. If you instruct it to return non-struck-through texts. But this gets really expensive.

Additional context

No response

@jakep-allenai
Copy link
Collaborator

Good suggestion, perhaps it is underrepresented in olmocr-mix right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants