-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Image files missing in mPLUG/MP-DocStruct1M #115
Comments
Hi, @ymzhang0303 Thanks for your reminder, we found just the first image of each sample is uploaded. We will update the images soon~ |
Thanks! Looking forward to it! |
Also I have a question for processing PDFA dataset, when there are two columns in one page, looks like you are just connect the text in the same line from left to right, which is not a human-reading style. May I know the reason you do that? Thanks! |
Hi, @ymzhang0303, images of MP-DocStruct1M have been updated on both HuggingFace and ModelScope~ |
Organizing texts in the reading order is indeed better, however, we didn't find a satisfying tool or human annotations to construct such training samples~ |
Thanks for your work! I found that there are many image files missing, for example, "pdfa-eng-wds/pdfa-eng-train-0034-pdfplumber/pages/6551295_page3.png" is used in multi_page_parsing.jsonl, but was not found in imgs.zip.
The text was updated successfully, but these errors were encountered: