Iterates over the entire DBLP database of scientific papers, creating CSV(s) of papers from the given year or newer coming from the given conferences/journals, whose title contains any of the given keywords. The CSV output contains the paper title, conference/journal, year, and a URL to the paper. Useful for scientific surveys.
- Install the required Python packages by
pip install -r requirements.txt - Download the DBLP database to the root directory of the repo from this link (if it's dead, let me know, going to dblp.org->XML Data->"raw dblp data in a single XML file" should work). Download both the
dblp.xml.gzfile (unpack todblp.xml) and thedblp.dtdfile. - Set up the list of conferences/journals the papers should be from. The script expects it at
dblp_survey/inputs/conf_journ.csv(an example provided with the repo). One entry per line, the entries must exactly match the ones in the DBLP XML database. To find those, load the XML database in some text editor that is able to handle large files, search for a paper that you are sure comes from the desired conference/journal, and record what you see between the<journal>(journal papers) or<booktitle>(conference papers) tags. - Set up the list of keywords the script will be searching in the titles. The script expects it at
dblp_survey/inputs/keywords.csv(an example provided with the repo). Again, one entry per line, the script performs a case-insensitive exact search. Therefore, stemming the words you are searching for is strongly recommended. For example, if you are interested in papers on evaluation, it's a good idea to useevaluatas a keyword, as that searches forevaluation,evaluate,evaluatingetc. - Run the script using
python dblp_survey.py <year> --split <split_mode>.<year>is a mandatory argument, it is the oldest year from which papers will be considered (e. g.,2017will consider papers from 2017 to now).--split <split_mode>is an optional parameter with two possible values:nonewill not split the papers and output a single CSV atdblp_survey/outputs/dblp_survey.csv,per-venueoutputs a CSV for each conference/journal you specified in the respective file. Default value isper-venue. - The output CSV(s) for each title contain the title, conference/journal, year, and a link. The links are clickable if you import the CSV to Google Sheets, should be clickable in Excel, in LibreOffice they seem not to be clickable.
Happy surveying!