Mining Service Contribution from Web-Based Journal Listings

Setup

python version: 3.6

Installation of pdftotext dependencies

https://pypi.org/project/pdftotext/

Installation of python packages:

pip install -r requirements.txt

Journal Crawling

Run Crawler

python crawl.py <CRAWL_TYPE>

Additional configurations of the crawl are located at: cfp_crawl/config.py and journal_crawl/config.py, also specifies crawl log and data save directories

<CRAWL_TYPE> options:

wikicfp_latest crawls details of the most recent conferences on the homepage of wikicfp at http://www.wikicfp.com

wikicfp_all traverses through and scapes information from every conference series on wikicfp starting from http://www.wikicfp.com/cfp/series?t=c&i=A

conf_crawl assumes a database populated with basic conference information obtained from either wikicfp_latest/wikicfp_all and proceeds to store the HTML information of the specified conferences. Crawls for directory specified in database/database_config.py.

journal_all traverses through and scapes information from every journal available from Springer/ACM

springer_crawl assumes a database populated with basic journal URL information, obtained from journal_all and proceeds to store the HTMl information of the Springer journal. Crawls for directory specified in database/database_config.py.

acm_crawl assumes a database populated with basic journal URL information, obtained from journal_all and proceeds to store the HTMl information of the ACM journal. Crawls for directory specified in database/database_config.py.

Notes on journal_all/ conf_crawl

Selenium chromedriver is needed to better simulate organic access of conference sites (e.g. waiting for the loading of javascript elements). The chromedriver should match your chrome version can be downloaded https://chromedriver.chromium.org/. Move the executable into this repo or as specified in cfp_crawl/config.py and journal_crawl/config.py.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
cfp_crawl		cfp_crawl
cfp_post_processing		cfp_post_processing
database		database
entity_ranking		entity_ranking
journal_crawl		journal_crawl
journal_post_processing		journal_post_processing
proceedings_crawl		proceedings_crawl
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Mining_Service_Contribution_from_Web_based_Journal_Listings.pdf		Mining_Service_Contribution_from_Web_based_Journal_Listings.pdf
README.md		README.md
crawl.py		crawl.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mining Service Contribution from Web-Based Journal Listings

Setup

Journal Crawling

Run Crawler

Notes on journal_all/ conf_crawl

About

Releases

Packages

Languages

lixinze777/servicemarq2.0

Folders and files

Latest commit

History

Repository files navigation

Mining Service Contribution from Web-Based Journal Listings

Setup

Journal Crawling

Run Crawler

Notes on journal_all/ conf_crawl

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages