Skip to content
This repository has been archived by the owner on May 29, 2018. It is now read-only.

arXiv data pull #1

Open
bmcfee opened this issue Oct 11, 2014 · 4 comments
Open

arXiv data pull #1

bmcfee opened this issue Oct 11, 2014 · 4 comments

Comments

@bmcfee
Copy link
Member

bmcfee commented Oct 11, 2014

arXiv makes its data available by S3 buckets, see: http://arxiv.org/help/bulk_data_s3

Some highlights from this page:

  • Papers are available in both PDF and latex source
  • Complete PDF data is about 270GB
  • Complete source data is about 190GB
  • All data lives in requester-pays buckets, so we'll have to cover the cost of the pull ($0.12/GB, about $50 total for both)

Questions:

  • Where are we going to host the local copies? Anyone want to volunteer server space?
  • How will analysis work?
    • Option 1: grep pdf text for urls and/or known DOIs for software
    • Option 2: parse the source directly, maybe with plasTex? This could get expensive and difficult, but may give more reliable results
@dfm
Copy link
Member

dfm commented Oct 11, 2014

I downloaded the full dataset onto a machine in the Physics dept. about a year ago. I'll look into running the incremental update.

My intuition is that mining the source would be easier because PDFs are such a pain in the ass and TeX is just text... why do you think it would be harder?

@cranmer
Copy link

cranmer commented Oct 11, 2014

Hey

I sent out a message to crew that did the URL link rot study for astrophysics arXiv stuff.

Surprisingly, INSPIRE uses pdf to extract references instead of source. They say they get better results. I asked for a link to code.
We can get an rtf from them, but only a small fraction will include code.

Impactstory and ORCID have Apis that explicitly tag code I think. That will give us a highly curated author's view... biased list with a strong signal.

Similarly, Zenodo and figshare can have metadata and collections specific to code.

Kyle

On Oct 11, 2014, at 10:48 AM, Dan Foreman-Mackey [email protected] wrote:

I downloaded the full dataset onto a machine in the Physics dept. about a year ago. I'll look into running the incremental update.

My intuition is that mining the source would be easier because PDFs are such a pain in the ass and TeX is just text... why do you think it would be harder?


Reply to this email directly or view it on GitHub.

@bmcfee
Copy link
Member Author

bmcfee commented Oct 11, 2014

My intuition is that mining the source would be easier because PDFs are such a pain in the ass and TeX is just text... why do you think it would be harder?

TeX is just text, but parsing it correctly could be a substantial undertaking. For instance, we can't just mine the .bib files because not all entries get cited. We could just crawl the bits that get compiled down into bbl, but then we'd be in the business of compiling gigs of tex, which could easily take weeks. Similarly, parsing out links by grepping for http or href could get tricky, especially when dealing with things that are difficult to render in tex directly (`url/~username/project.html' comes to mind).

PDF could be easier (but lossier), since we're working directly with the rendered output.

@cranmer
Copy link

cranmer commented Oct 11, 2014

Link rot study is here:
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0104798#s4
They have some discussion of their url cleaning. Probably some lessons learned there. I asked if they had some code for this. Ok, I just found link to their extracted URLs here:
http://thedata.harvard.edu/dvn/dv/astrocite/faces/study/StudyPage.xhtml?globalId=hdl:10904/10214

I'm thinking it would be good to separate some common arxiv data mining parts into it's own repository.

I've also contacted Thorsten Schwander, who is my buddy that was part of the arXiv team for several years and is a PDF parsing expert for INSPIRE to see if I can get hands on code or if he has any words of wisdom.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants