This repo is to scrape certain websites for their fictions and compile them into an epub format using selenium for scraping and calibre for epub packing
- Selenium
sudo -H pip install selenium
- Gecko Driver Firefox
wget https://github.com/mozilla/geckodriver/releases/download/v0.19.0/geckodriver-v0.19.0-linux64.tar.gz
tar -xzvf geckodriver-v0.19.0-linux64.tar.gz
rm -rf geckodriver-v0.19.0-linux64.tar.gz
sudo ln -sf geckodriver /usr/bin/
- PhantomJS Driver
sudo apt-get update
sudo apt-get install build-essential chrpath libssl-dev libxft-dev
sudo apt-get install libfreetype6 libfreetype6-dev
sudo apt-get install libfontconfig1 libfontconfig1-dev
export PHANTOM_JS="phantomjs-2.5.0-beta-linux-ubuntu-xenial-x86_64"
wget https://bitbucket.org/ariya/phantomjs/downloads/$PHANTOM_JS.tar.gz
sudo tar xvjf $PHANTOM_JS.tar.gz
rm -rf $PHANTOM_JS.tar.gz
sudo mv $PHANTOM_JS /usr/local/share
sudo ln -sf /usr/local/share/$PHANTOM_JS/bin/phantomjs /usr/local/bin
- Calibre
sudo -v && wget -nv -O- https://download.calibre-ebook.com/linux-installer.py | sudo python -c "import sys; main=lambda:sys.stderr.write('Download failed\n'); exec(sys.stdin.read()); main()"
- Java
sudo apt-get update
sudo apt-get install default-jdk
- Spark
sudo -H pip install pyspark
Regular Version
./driver.py urlOfFirstChapter fictionName
Spark Version
spark-submit spark_driver.py urlOfLatestChapter fictionName
Spark shows a 40% decrease in time over the regular version for 50 chapter test and a 60% decrease over 1300 chapters on a 4 core computer. Note that partitions should be increased/decreased to better optimize for the number of cores one has.
-
Splitting epubs into books based on urls
-
Handling chapter titles
-
Take out calibre and spark