This package is used to generate documentation for the strands project in the form of a readthedocs page.
sudo apt-get install pandoc
pip install requests pypandocThe doc_scraper.py script is used to scrape documentation from specified github
repositories and dataset webpages. It is only necessary to run this if the documentation
in one of those locations has changed. You don't need to run it if you're just
making changes in this repository.
python scripts/doc_scraper.pyOn the first run, an oauth header for github will be generated, which allows the
script to make more requests. By default only public repositories will be
scraped, but you can also scrape private repositories using the --private flag.
The script will then download all repositories in the organisation, excluding
those specified in conf/conf.yaml. You can also exclude readme files which
match specific strings on a per-repository basis, by adding a list below the
repo name in the ignore_repos list.
For example, the following would ignore the whole strands_utils repository
ignore_repos:
- strands_utilsBut this would only ignore files which contain the string trash_file or bad_readme
ignore_repos:
- strands_utils:
- trash_file
- bad_readmeYou can use a different config by passing a
file to the --conf flag, which should contain the same keys that the one in
the conf directory has. Packages with a wiki page will also have those cloned
and added to the docs directory. You can ignore wikis using the --nowiki flag.
With the --datasets flag, the scraper will go through dataset urls given in
datasets/datasets.yaml and download the html pages specified there, converting
them to markdown. Images on the pages will also be downloaded to the
datasets/images directory.
The documentation is monitored by readthedocs, and any changes in the master branch should be visible on the website after a short time.