Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image files missing in mPLUG/MP-DocStruct1M #115

Open
ymzhang0303 opened this issue Oct 8, 2024 · 5 comments
Open

Image files missing in mPLUG/MP-DocStruct1M #115

ymzhang0303 opened this issue Oct 8, 2024 · 5 comments

Comments

@ymzhang0303
Copy link

Thanks for your work! I found that there are many image files missing, for example, "pdfa-eng-wds/pdfa-eng-train-0034-pdfplumber/pages/6551295_page3.png" is used in multi_page_parsing.jsonl, but was not found in imgs.zip.

@HAWLYQ
Copy link
Collaborator

HAWLYQ commented Oct 9, 2024

Hi, @ymzhang0303 Thanks for your reminder, we found just the first image of each sample is uploaded. We will update the images soon~

@ymzhang0303
Copy link
Author

Thanks! Looking forward to it!

@ymzhang0303
Copy link
Author

Also I have a question for processing PDFA dataset, when there are two columns in one page, looks like you are just connect the text in the same line from left to right, which is not a human-reading style. May I know the reason you do that? Thanks!

@HAWLYQ
Copy link
Collaborator

HAWLYQ commented Oct 17, 2024

Hi, @ymzhang0303, images of MP-DocStruct1M have been updated on both HuggingFace and ModelScope~

@HAWLYQ
Copy link
Collaborator

HAWLYQ commented Oct 17, 2024

Also I have a question for processing PDFA dataset, when there are two columns in one page, looks like you are just connect the text in the same line from left to right, which is not a human-reading style. May I know the reason you do that? Thanks!

Organizing texts in the reading order is indeed better, however, we didn't find a satisfying tool or human annotations to construct such training samples~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants