Skip to content

A tool used to crawl docs from any websites, filtered by extensions or regex pattern. (Support NJUSE Moodle)

Notifications You must be signed in to change notification settings

NintenSAGA/DocCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocCrawler

Updates

2022.09.21:

  • Now Moodle Crawler can download videos and folders

  • Now you can exclude particluar courses in Moodle Crawler

  • Now Moodle Crawler supports login by scanning WeChat QRCode.

  • Download path is changed to .../DocCrawler/Download

  • Now the crawler will show whether a file was updated

Setup

python = "^3.10"
rich = "^12.5.1"
PyYAML = "^6.0"
bs4 = "^0.0.1"
requests = "^2.28.1"
Pillow = "^9.2.0"
rarfile = "^4.0"
html5lib = "^1.1

Dependencies are managed by Poetry. Hence you can either install them manually or (requiring Poetry installed):

.../DocCrawler> poetry install
.../DocCrawler> poetry run python ./doccrawler/general_crawler.py # General
.../DocCrawler> poetry run python ./doccrawler/moodle_crawler.py # Moodle

Usage

DocCrawler contains two tools.

GeneralCrawler

Can be used to crawl docs on any websites, filtered by extensions or regex pattern.

Default output directory is .../DocCrawler/Download

You can use it with cli arguments:

usage: general_crawler.py [-h] [-u URL] [-r REGEX] [-e EX [EX ...]] 
													[-a] [-n] [-d DIR]
                          [-o] [-U] [-z]

options:
  -h, --help            show this help message and exit
  -u URL, --url URL     Target url
  -r REGEX, --regex REGEX
                        Target regex
  -e EX [EX ...], --ex EX [EX ...]
                        Target extensions
  -a, --all             Match all
  -n, --name            Use tag text as filename
  -d DIR, --dir DIR     Output directory
  -o, --order           Add order prefix
  -U, --update          Update existed file
  -z, --unzip           Unzip compressed files

Or execute it without any args to enter the interactive setup:

image-20220918104510929

Configs

The config file is .../DocCrawler/general_config.yaml, using YAML syntax.

You can add presets to the configs in the following manner:

websites:
  $Preset_name$:
    $arg0$: ...
    $arg1$: ...
    ...

Example:

websites:
  CAT Assignments:
    dir: "~/Library/CloudStorage/OneDrive-Personal/CAT - Concurrency-Algorithms and Theories/Assignments"
    ex:
      - pdf
    name: false
    url: https://h*******g.github.io/teaching/concurrency/
  CAT Slides:
    dir: "~/Library/CloudStorage/OneDrive-Personal/CAT - Concurrency-Algorithms and Theories/Slides"
    ex:
      - ppt
      - pptx
    name: false
    url: https://h*******g.github.io/teaching/concurrency/
  FLA Slides:
    dir: "~/Library/CloudStorage/OneDrive-Personal/FLA - Formal Languages and Automata/Slides"
    ex:
      - ppt
      - pptx
    name: true
    order: true
    url: https://c*******n/bulei/FLA22.html
  SPA - Slides:
    dir: "~/Library/CloudStorage/OneDrive-Personal/SPA - Static Program Analysis/Slides"
    ex:
      - pdf
    name: false
    order: true
    url: http://t*******b.net/lectures.html

image-20220918105813267

MoodleCrawler

Can be used on the new Moodle website of NJU SE. This will automatically scan all the courses you have joined and download their resources.

A valid Cookies string should be provided when you run it for the first time, or when the previous cookies is invalid.

image-20220918110143776

Configs

The config file is .../DocCrawler/general_config.yaml, using YAML syntax, which will be generated automatically.

You can edit the configs in the following manner:

moodle:
  cookies: ...
  courses:
    $CourseID$: # Generated
      dir: ...
      my_args: [$arg1$, $arg2$, ...]
      name: ... # Generated
      exclude: ... # True or False

About

A tool used to crawl docs from any websites, filtered by extensions or regex pattern. (Support NJUSE Moodle)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages