Thanks! #24

srush · 2020-04-25T19:12:16Z

Thanks so much for this repo it has been amazingly useful. We used it to build ICLR 2020 virtual addition and added all the pictures this way!

https://twitter.com/srush_nlp/status/1253788694739386371

zhxgj · 2020-04-25T22:12:07Z

Hi @srush, Thanks for using PubLayNet. Glad to hear it helped!

zhxgj · 2020-04-26T23:08:25Z

Hi @srush We happen to be preparing a blog about PubLayNet and want to add this great news to the blog. Do you have an estimate of how much time you think PubLayNet saved you? Any type of metric will be greatly useful for us. Thanks very much.

srush · 2020-04-26T23:15:12Z

Infinity time, I would have given up. I tried every other direct PDF extraction method and they all had intractable issues, e.g. couldn't extract PDF images or were too low res or were just bad. I was about to give up until I found this tool, and in 20 lines of code, and 2 hours on Google Colab (sorry) , I had every image from 700 papers at high res with precise enough accuracy (it's unfortunately bad at columns? Guessing that is because of pubmed).

zhxgj · 2020-04-27T03:47:32Z

Thanks @srush for your feedback. Do you have any examples where the model is bad at columns? I can have a look if that is because of any bias or annotation errors in the data, if I can fix it.

srush · 2020-04-27T04:14:21Z

I don't have an exact example, but it seemed to do worse in papers like this where the text wraps around the images https://openreview.net/pdf?id=Bkxv90EKPB

…

On Sun, Apr 26, 2020 at 11:47 PM zhxgj ***@***.***> wrote: Thanks @srush <https://github.com/srush> for your feedback. Do you have any examples where the model is bad at columns? I can have a look if that is because of any bias or annotation errors in the data, if I can fix it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAIYKV6FDR76WW2Y6E5LVLROT56BANCNFSM4MQ33L7A> .

zhxgj · 2020-04-27T07:29:00Z

Ah, yes. The dataset does not have many samples with text wraps around images. Most journals do not typeset in that way. And I also think our automated annotation algorithm does not handle this case well, and poor annotations are excluded, which further reduces samples with this appearance.

zhxgj · 2020-04-27T07:29:45Z

Infinity time, I would have given up. I tried every other direct PDF extraction method and they all had intractable issues, e.g. couldn't extract PDF images or were too low res or were just bad. I was about to give up until I found this tool, and in 20 lines of code, and 2 hours on Google Colab (sorry) , I had every image from 700 papers at high res with precise enough accuracy (it's unfortunately bad at columns? Guessing that is because of pubmed).

Hi @srush , could I please quote your above feedback in our blog post?

srush · 2020-04-27T11:06:09Z

Yes you can use whatever you need. Right, I figured this was non standard formatting. For whatever reason you do see it sometimes in NeurIPS, ICLR. For future years maybe we should generate a small finetuning set. El lun., 27 de abril de 2020 3:29, zhxgj <[email protected]> escribió:

…

Infinity time, I would have given up. I tried every other direct PDF extraction method and they all had intractable issues, e.g. couldn't extract PDF images or were too low res or were just bad. I was about to give up until I found this tool, and in 20 lines of code, and 2 hours on Google Colab (sorry) , I had every image from 700 papers at high res with precise enough accuracy (it's unfortunately bad at columns? Guessing that is because of pubmed). Hi @srush <https://github.com/srush> , could I please quote your above feedback in our blog post? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAIYKRBZIL5JFSKNHKMB7LROUX7PANCNFSM4MQ33L7A> .

zhxgj · 2020-04-27T22:52:54Z

Thanks @srush Yes, it is a good idea to have a small fine-tuning set for a specific template. The set can be pre-annotated with our model then manually curated, which will save some time.

tshrjn · 2021-06-09T11:30:33Z

Hi there,

This is great!
However, I just wanted to clarify why something like pyMupdf or pdfminer (& it's cli script for images) couldn't be used?

They seem straightforward. Am I missing something?
Does publaynet do something additional like extract main image or so?

srush · 2021-06-09T12:37:50Z

Publaynet extracts the images as they are shown in the paper (cropped, captioned etc) which is more interesting and harder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanks! #24

Thanks! #24

srush commented Apr 25, 2020

zhxgj commented Apr 25, 2020

zhxgj commented Apr 26, 2020

srush commented Apr 26, 2020 •

edited

Loading

zhxgj commented Apr 27, 2020

srush commented Apr 27, 2020 via email

zhxgj commented Apr 27, 2020

zhxgj commented Apr 27, 2020

srush commented Apr 27, 2020 via email

zhxgj commented Apr 27, 2020

tshrjn commented Jun 9, 2021 •

edited

Loading

srush commented Jun 9, 2021

Thanks! #24

Thanks! #24

Comments

srush commented Apr 25, 2020

zhxgj commented Apr 25, 2020

zhxgj commented Apr 26, 2020

srush commented Apr 26, 2020 • edited Loading

zhxgj commented Apr 27, 2020

srush commented Apr 27, 2020 via email

zhxgj commented Apr 27, 2020

zhxgj commented Apr 27, 2020

srush commented Apr 27, 2020 via email

zhxgj commented Apr 27, 2020

tshrjn commented Jun 9, 2021 • edited Loading

srush commented Jun 9, 2021

srush commented Apr 26, 2020 •

edited

Loading

tshrjn commented Jun 9, 2021 •

edited

Loading