Skip to content

๐Ÿ“– Deduplicate and tidy a collection of arXiv.org RSS feeds for easier skimming.

License

Notifications You must be signed in to change notification settings

tomshafer/arxivrss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

10 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

ArXiv.org RSS Scrubber

The arxivrss package is a simple tool that collectively deduplicates multiple arXiv.org RSS feeds and then reformats the output. The resulting feeds are dumped into an output directory specified using the argument -o. From the help:

$ python -m arxivrss -h
usage: arxivrss.py [-h] -o DIR SUBJ [SUBJ ...]

positional arguments:
  SUBJ

optional arguments:
  -h, --help  show this help message and exit
  -o DIR      output directory

You can execute the package directly via python -m arxivrss. I run the package via crontab on my web server.

$  python -m arxivrss -o output cs.CV cs.CL cs.SI nucl-th stat.AP cs.LG stat.ML
2020-05-19 22:00:44,599 [cs.CV] Collecting subject
2020-05-19 22:00:46,027 [cs.CL] Collecting subject
2020-05-19 22:00:47,360 [cs.SI] Collecting subject
2020-05-19 22:00:48,704 [nucl-th] Collecting subject
2020-05-19 22:00:49,447 [stat.AP] Collecting subject
2020-05-19 22:00:50,282 [cs.LG] Collecting subject
2020-05-19 22:00:52,078 [stat.ML] Collecting subject
2020-05-19 22:00:53,076 [cs.CV] Removing 33 UPDATED articles
2020-05-19 22:00:53,083 [cs.CL] Removing 17 UPDATED articles
2020-05-19 22:00:53,087 [cs.SI] Removing 3 UPDATED articles
2020-05-19 22:00:53,088 [nucl-th] Removing 5 UPDATED articles
2020-05-19 22:00:53,088 [stat.AP] Removing 3 UPDATED articles
2020-05-19 22:00:53,091 [cs.LG] Removing 62 UPDATED articles
2020-05-19 22:00:53,108 [stat.ML] Removing 43 UPDATED articles
2020-05-19 22:00:53,115 [cs.CV] Removing 4 CROSS POSTED articles
2020-05-19 22:00:53,118 [cs.CL] Removing 5 CROSS POSTED articles
2020-05-19 22:00:53,119 [cs.SI] Removing 0 CROSS POSTED articles
2020-05-19 22:00:53,119 [nucl-th] Removing 0 CROSS POSTED articles
2020-05-19 22:00:53,120 [stat.AP] Removing 0 CROSS POSTED articles
2020-05-19 22:00:53,121 [cs.LG] Removing 22 CROSS POSTED articles
2020-05-19 22:00:53,126 [stat.ML] Removing 28 CROSS POSTED articles
2020-05-19 22:00:53,130 [cs.CV] Removing 0 DUPLICATE articles
2020-05-19 22:00:53,131 [cs.CL] Removing 0 DUPLICATE articles
2020-05-19 22:00:53,132 [cs.SI] Removing 0 DUPLICATE articles
2020-05-19 22:00:53,132 [nucl-th] Removing 0 DUPLICATE articles
2020-05-19 22:00:53,133 [stat.AP] Removing 0 DUPLICATE articles
2020-05-19 22:00:53,134 [cs.LG] Removing 9 DUPLICATE articles
2020-05-19 22:00:53,137 [stat.ML] Removing 9 DUPLICATE articles
2020-05-19 22:00:53,139 [cs.CV] Final result: pre 80, post 43; reduction 37 (46.2%)
2020-05-19 22:00:53,139 [cs.CL] Final result: pre 53, post 31; reduction 22 (41.5%)
2020-05-19 22:00:53,140 [cs.SI] Final result: pre 9, post 6; reduction 3 (33.3%)
2020-05-19 22:00:53,140 [nucl-th] Final result: pre 12, post 7; reduction 5 (41.7%)
2020-05-19 22:00:53,140 [stat.AP] Final result: pre 13, post 10; reduction 3 (23.1%)
2020-05-19 22:00:53,141 [cs.LG] Final result: pre 154, post 61; reduction 93 (60.4%)
2020-05-19 22:00:53,141 [stat.ML] Final result: pre 84, post 4; reduction 80 (95.2%)
2020-05-19 22:00:53,143 [cs.CV] Writing to output/cs.CV.xml
2020-05-19 22:00:53,148 [cs.CL] Writing to output/cs.CL.xml
2020-05-19 22:00:53,149 [cs.SI] Writing to output/cs.SI.xml
2020-05-19 22:00:53,150 [nucl-th] Writing to output/nucl-th.xml
2020-05-19 22:00:53,151 [stat.AP] Writing to output/stat.AP.xml
2020-05-19 22:00:53,154 [cs.LG] Writing to output/cs.LG.xml
2020-05-19 22:00:53,156 [stat.ML] Writing to output/stat.ML.xml
2020-05-19 22:00:53,156 [TOTAL] Final result: pre 405, post 162; reduction 243 (60.0%)

About

๐Ÿ“– Deduplicate and tidy a collection of arXiv.org RSS feeds for easier skimming.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages