Skip to content

rhett-g/novels

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

novels

This repo is to scrape certain websites for their fictions and compile them into an epub format using selenium for scraping and calibre for epub packing

Requirements

  1. Selenium
sudo -H pip install selenium
  1. Gecko Driver Firefox
wget https://github.com/mozilla/geckodriver/releases/download/v0.19.0/geckodriver-v0.19.0-linux64.tar.gz
tar -xzvf geckodriver-v0.19.0-linux64.tar.gz
rm -rf geckodriver-v0.19.0-linux64.tar.gz
sudo ln -sf geckodriver /usr/bin/
  1. PhantomJS Driver
sudo apt-get update
sudo apt-get install build-essential chrpath libssl-dev libxft-dev
sudo apt-get install libfreetype6 libfreetype6-dev
sudo apt-get install libfontconfig1 libfontconfig1-dev
export PHANTOM_JS="phantomjs-2.5.0-beta-linux-ubuntu-xenial-x86_64"
wget https://bitbucket.org/ariya/phantomjs/downloads/$PHANTOM_JS.tar.gz
sudo tar xvjf $PHANTOM_JS.tar.gz
rm -rf $PHANTOM_JS.tar.gz
sudo mv $PHANTOM_JS /usr/local/share
sudo ln -sf /usr/local/share/$PHANTOM_JS/bin/phantomjs /usr/local/bin
  1. Calibre
sudo -v && wget -nv -O- https://download.calibre-ebook.com/linux-installer.py | sudo python -c "import sys; main=lambda:sys.stderr.write('Download failed\n'); exec(sys.stdin.read()); main()"

Requirements for Spark

  1. Java
sudo apt-get update
sudo apt-get install default-jdk
  1. Spark
sudo -H pip install pyspark

How to run

Regular Version

./driver.py urlOfFirstChapter fictionName

Spark Version

spark-submit spark_driver.py urlOfLatestChapter fictionName

Spark vs Regular

Spark shows a 40% decrease in time over the regular version for 50 chapter test and a 60% decrease over 1300 chapters on a 4 core computer. Note that partitions should be increased/decreased to better optimize for the number of cores one has.

Desired Features

  1. Splitting epubs into books based on urls

  2. Handling chapter titles

  3. Take out calibre and spark

Supported Sites

Planned Sites

Releases

No releases published

Packages

No packages published

Languages