Skip to content

JanZahalka/dblp_survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dblp_survey

Iterates over the entire DBLP database of scientific papers, creating CSV(s) of papers from the given year or newer coming from the given conferences/journals, whose title contains any of the given keywords. The CSV output contains the paper title, conference/journal, year, and a URL to the paper. Useful for scientific surveys.

Usage

  1. Install the required Python packages by pip install -r requirements.txt
  2. Download the DBLP database to the root directory of the repo from this link (if it's dead, let me know, going to dblp.org->XML Data->"raw dblp data in a single XML file" should work). Download both the dblp.xml.gz file (unpack to dblp.xml) and the dblp.dtd file.
  3. Set up the list of conferences/journals the papers should be from. The script expects it at dblp_survey/inputs/conf_journ.csv (an example provided with the repo). One entry per line, the entries must exactly match the ones in the DBLP XML database. To find those, load the XML database in some text editor that is able to handle large files, search for a paper that you are sure comes from the desired conference/journal, and record what you see between the <journal> (journal papers) or <booktitle> (conference papers) tags.
  4. Set up the list of keywords the script will be searching in the titles. The script expects it at dblp_survey/inputs/keywords.csv (an example provided with the repo). Again, one entry per line, the script performs a case-insensitive exact search. Therefore, stemming the words you are searching for is strongly recommended. For example, if you are interested in papers on evaluation, it's a good idea to use evaluat as a keyword, as that searches for evaluation, evaluate, evaluating etc.
  5. Run the script using python dblp_survey.py <year> --split <split_mode>. <year> is a mandatory argument, it is the oldest year from which papers will be considered (e. g., 2017 will consider papers from 2017 to now). --split <split_mode> is an optional parameter with two possible values: none will not split the papers and output a single CSV at dblp_survey/outputs/dblp_survey.csv, per-venue outputs a CSV for each conference/journal you specified in the respective file. Default value is per-venue.
  6. The output CSV(s) for each title contain the title, conference/journal, year, and a link. The links are clickable if you import the CSV to Google Sheets, should be clickable in Excel, in LibreOffice they seem not to be clickable.

Happy surveying!

About

Iterates over the entire DBLP database of scientific papers, creating CSV(s) of papers from the given year or newer coming from the given conferences/journals, whose title contains any of the given keywords. The CSV output contains the paper title, conference/journal, year, and a URL to the paper. Useful for scientific surveys.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages