Image files missing in mPLUG/MP-DocStruct1M #115

ymzhang0303 · 2024-10-08T21:33:31Z

Thanks for your work! I found that there are many image files missing, for example, "pdfa-eng-wds/pdfa-eng-train-0034-pdfplumber/pages/6551295_page3.png" is used in multi_page_parsing.jsonl, but was not found in imgs.zip.

HAWLYQ · 2024-10-09T03:03:43Z

Hi, @ymzhang0303 Thanks for your reminder, we found just the first image of each sample is uploaded. We will update the images soon~

ymzhang0303 · 2024-10-09T03:16:51Z

Thanks! Looking forward to it!

ymzhang0303 · 2024-10-09T19:25:30Z

Also I have a question for processing PDFA dataset, when there are two columns in one page, looks like you are just connect the text in the same line from left to right, which is not a human-reading style. May I know the reason you do that? Thanks!

HAWLYQ · 2024-10-17T07:29:05Z

Hi, @ymzhang0303, images of MP-DocStruct1M have been updated on both HuggingFace and ModelScope~

HAWLYQ · 2024-10-17T07:42:01Z

Also I have a question for processing PDFA dataset, when there are two columns in one page, looks like you are just connect the text in the same line from left to right, which is not a human-reading style. May I know the reason you do that? Thanks!

Organizing texts in the reading order is indeed better, however, we didn't find a satisfying tool or human annotations to construct such training samples~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image files missing in mPLUG/MP-DocStruct1M #115

Image files missing in mPLUG/MP-DocStruct1M #115

ymzhang0303 commented Oct 8, 2024

HAWLYQ commented Oct 9, 2024

ymzhang0303 commented Oct 9, 2024

ymzhang0303 commented Oct 9, 2024

HAWLYQ commented Oct 17, 2024

HAWLYQ commented Oct 17, 2024

Image files missing in mPLUG/MP-DocStruct1M #115

Image files missing in mPLUG/MP-DocStruct1M #115

Comments

ymzhang0303 commented Oct 8, 2024

HAWLYQ commented Oct 9, 2024

ymzhang0303 commented Oct 9, 2024

ymzhang0303 commented Oct 9, 2024

HAWLYQ commented Oct 17, 2024

HAWLYQ commented Oct 17, 2024