Skip to content

Commit

Permalink
update github action config
Browse files Browse the repository at this point in the history
  • Loading branch information
zzeppozz committed Nov 6, 2023
1 parent 97558a3 commit 4008780
Show file tree
Hide file tree
Showing 11 changed files with 202 additions and 33 deletions.
25 changes: 2 additions & 23 deletions .github/workflows/build_sphinx_docs.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: Sphinx to GitHub Pages
# Builds documentation residing at https://biotaphy.github.io/tutorials/
# Builds documentation residing at https://lifemapper.github.io/bison/
on:
push:
tags:
Expand All @@ -15,7 +15,7 @@ jobs:
- name: Install dependencies
run: pip install sphinx sphinx_rtd_theme myst_parser
- name: Sphinx build
run: sphinx-build doc _build
run: sphinx-build _sphinx_config _build
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
Expand All @@ -24,24 +24,3 @@ jobs:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: _build/
force_orphan: true

# steps:
# - name: Checkout
# uses: actions/checkout@master
# with:
# fetch-depth: 0
# - name: Install documentation dependencies
# #uses: py-actions/py-dependency-install@v2
# #with:
# # path: "requirements-docs.txt"
# run: |
# pip install -r requirements-docs.txt
# - name: Build and Commit
# uses: sphinx-notes/[email protected]
# with:
# documentation_path: _sphinx_config/
# - name: Push changes to GH Pages branch
# uses: ad-m/github-push-action@master
# with:
# github_token: ${{ secrets.GITHUB_TOKEN }}
# branch: gh-pages
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)

==================
Year 3, 2021-2022 Data processing
Process Data Year 3, 2021-2022
==================

--------------
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
===============
Lifemapper-BISON processing
===============
==================
Process Data Year 4, 2022-2023
==================

Processing consists of 6 unique steps, each initiated with the process_gbif.py script
and 2 arguments: command and a parameter file.
Expand Down Expand Up @@ -75,8 +75,8 @@ Summarize annotated GBIF occurrence records (each subset file), by:
* scientific name, species name (for convenience in final aggregation outputs)
* count

Then summarize the summaries into a single file, and aggregate summary into files of
species and counts for each region:
Then summarize the subset summaries into a single file, and aggregate single summary
into files of species and counts for each region:

::

Expand Down
Empty file.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
===============
Input data
Data Preparation
===============

2023, Year 4 SOW specifies:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
===============
LMBison Preparation
Installation
===============

LMBison can be run either locally on a powerful machine with a large amount of storage,
Expand Down
Empty file.
File renamed without changes.
192 changes: 192 additions & 0 deletions _sphinx_config/pages/yr5_inputs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
===============
Data Preparation
===============

2023, Year 4 SOW specifies:
* US Registry of Introduced and Invasive Species
* GBIF occurrence data from the US with coordinates
* US Census state and county boundaries
* American Indian and Alaskan Native Land Area Representations (AIAN‐LAR)
* US Federal Protected Areas (US‐PAD)
* Summarize (count and proportion) regions by species name/RIIS

Data inputs may be updated regularly, so constants in some files may change with the
updates. Below are constants and their file locations that should be checked and
possibly modified anytime input data is updated.

The US-PAD dataset proved unsupportable in any configuration tried so far. More
information is below under **Protected Areas Database**.

Pre-Processing
----------------

Currently, much of our input data (GBIF, census county/state and AIANNH) are in
EPSG:4326, using decimal degrees. The DOI dataset is in NAD_1983_Albers/EPSG:6269, and
the PAD datasets are in USA_Contiguous_Albers_Equal_Area_Conic_USGS_version/EPSG:9822.
These, and possibly other updated datasets must be projected to EPSG:4326 before
intersecting points and annotating records. A sample script is in `project_doi_pad.sh
<https://github.com/lifemapper/bison/tree/main/bison/data/project_doi_pad.sh>`_

USGS may choose to change the geospatial regions for aggregation. If so, the REGION
class in `constants.py
<https://github.com/lifemapper/bison/tree/main/bison/common/constants.py>`_
must be changed, and code changed slightly. Only the county/state data is required for
matching RIIS records to occurrence records.

USGS RIIS data
----------------

US-RIIS V2.0, November 2022, available at https://doi.org/10.5066/P9KFFTOD
webpage: https://www.sciencebase.gov/catalog/item/62d59ae5d34e87fffb2dda99

US-RIIS records consist of a list of species and the areas in which they are considered
Introduced or Invasive. Any other species/region combinations encountered will be
identified as "presumed-native"

**Data location**: The RIIS data may be placed in any accessible directory, but must
be specified in the "riis_filename" value of the configuration file `process_bison.json
<https://github.com/lifemapper/bison/tree/main/data/config/process_bison.json>`_. The
RIIS annotation process will place the annotated file, with a postfix of "_annotated"
in the same directory.

The latest US-RIIS data is present in this Github repository in the `data/input
<https://github.com/lifemapper/bison/tree/main/data/input>`_ directory. If a new
version is available, update it, and the following:

* Check/modify attributes in the RIIS_DATA class in the `constants.py
<https://github.com/lifemapper/bison/tree/main/bison/common/constants.py>`_ file:
* Edit the filename in DATA_DICT_FNAME
* Check the file header, and if necessary, edit the fields in SPECIES_GEO_HEADER and
matching fields in SPECIES_GEO_KEY, GBIF_KEY, ITIS_KEY, LOCALITY_FLD, KINGDOM_FLD,
SCINAME_FLD, SCIAUTHOR_FLD, RANK_FLD, ASSESSMENT_FLD, TAXON_AUTHORITY_FLD.


GBIF data
----------------

To get a current version of GBIF data:
* Create a user account on the GBIF website, then login and
* request the data by putting the following URL in a browser:
https://www.gbif.org/occurrence/search?country=US&has_coordinate=true&has_geospatial_issue=false&occurrence_status=present
* adding a restriction to occurrence data identified to species or a lower rank
will reduce the amount of data that will be filtered out.

The query will request a download, which will take some time for GBIF to assemble.
GBIF will send an email with a link for downloading the Darwin Core Archive, a
very large zipped file. Only the occurrence.txt file is required for data processing.
Rename the file with the date for clarity on what data is being used. Use
the following pattern gbif_yyyy-mm-dd.csv so that interim data filenames can be
created and parsed consistently. Note the underscore (_) between 'gbif' and the date, and
the dash (-) between date elements.

**Data location**: The GBIF data may be placed in any accessible directory, but must
be specified in the "gbif_filename" value of the configuration file `process_bison.json
<https://github.com/lifemapper/bison/tree/main/data/config/process_bison.json>`_. The
temporary output files, such as raw chunks, annotated chunks, and summaries of chunks,
will be placed in the directory specified in the "process_path" value of the
configuration file, with postfixes "_raw", "_annotate", and "_summary" respectively.
Final output files will be placed in the directory specified in the "output_path" value.

Verify that the file occurrence.txt contains GBIF-annotated records that will be the
primary input file. The primary input file will contain fieldnames in the first line
of the file, and those listed as values for GBIF class attributes with (attribute)
names ending in _FLD or _KEY should all be among the fields.

.. code-block::
unzip <dwca zipfile> occurrence.txt
mv occurrence.txt gbif_2023-08-23.csv

Check/modify attributes in the GBIF class in the `constants.py
<https://github.com/lifemapper/bison/tree/main/bison/common/constants.py>`_ file:

* Edit the filename in DATA_DICT_FNAME
* Verify that the DWCA_META_FNAME is still the correct file for field definitions.


Geographic Data for aggregation
----------------

**Data location**: The geospatial data may be placed in any accessible directory, but
must be specified in the "geo_path" value of the configuration file `process_bison.json
<https://github.com/lifemapper/bison/tree/main/data/config/process_bison.json>`_.
Relative filepaths to the data are specified in the REGION class of the file
`constants.py <https://github.com/lifemapper/bison/tree/main/bison/common/constants.py>`_ .

Census: State and County
................
Up-to-date census data including state and county boundaries are available at:
https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html

Shapefiles used for 2023 processing (2022 was not yet available at time of download):
Census, Cartographic Boundary Files, 2021
* https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html

**Counties**
* 1:500,000, cb_2021_us_county_500k.zip

Check/modify attributes in the REGION class in the `constants.py
<https://github.com/lifemapper/bison/tree/main/bison/common/constants.py>`_ file:
including: COUNTY["file"] for the filename and the keys in COUNTY["map"] for
fieldnames within that shapefile.

Census: AIANNH
.........

Up-to-date census data, including American Indian, Alaska Native, and Native Hawaiian,
are available at:
https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html

**American Indian/Alaska Native Areas/Hawaiian Home Lands**, AIANNH
* 1:500,000, cb_2021_us_aiannh_500k.zip

Check/modify attributes in the REGION class in the `constants.py
<https://github.com/lifemapper/bison/tree/main/bison/common/constants.py>`_ file:
including: AIANNH["file"] for the filename and the keys in AIANNH["map"] for
fieldnames within that shapefile.

Protected Areas Database, US-PAD (not currently used)
...................................................

U.S. Geological Survey (USGS) Gap Analysis Project (GAP), 2022, Protected Areas Database
of the United States (PAD-US) 3.0: U.S. Geological Survey data release,
https://doi.org/10.5066/P9Q9LQ4B.

The US-PAD dataset proved too complex to intersect at an acceptable speed. Intersecting
with 900 million records was projected to take 60 days. I tested this data in
multiple implementations (local machine or Docker containers) and with multiple versions
of the data (split by Dept of Interior, DOI, regions, or by states) and with multiple
Docker configurations, with no success. For this reason, US-PAD was abandoned until a
good solution can be found.

The next configuration to try will use different AWS tools. I was unable to insert
these data into AWS RDS, PostgreSQL with PostGIS (other polygon datasets succeeded).

The PAD data is divided into datasets by Department of Interior (DOI) region, but
those datasets are still too large and complex.
Download the PAD data for states, this also removes the need for another intersect.

Project the dataset to EPSG:4326 with commands like A sample script is in
`project_doi_pad.sh
<https://github.com/lifemapper/bison/tree/main/bison/data/project_doi_pad.sh>`_

Reported problems with projected dataset:
* TopologyException: side location conflict
* Invalid polygon with 3 points instead of 0 or >= 4

* US_PAD for DOI regions 1-12
* https://www.sciencebase.gov/catalog/item/62226321d34ee0c6b38b6be3
* Metadata: https://www.sciencebase.gov/catalog/item/622262c8d34ee0c6b38b6bcf
* Citation:
U.S. Geological Survey (USGS) Gap Analysis Project (GAP), 2022,
Protected Areas Database of the United States (PAD-US) 3.0:
U.S. Geological Survey data release, https://doi.org/10.5066/P9Q9LQ4B.
* Geographic areas in separate shapefiles for Designation, Easement, Fee,
Proclamation, Marine
* target GAP status 1-3
* 1 - managed for biodiversity - disturbance events proceed or are mimicked
* 2 - managed for biodiversity - disturbance events suppressed
* 3 - managed for multiple uses - subject to extractive (e.g. mining or logging) or OHV use
* 4 - no known mandate for biodiversity protection
* Citation: U.S. Geological Survey (USGS) Gap Analysis Project (GAP), 2022, Protected
Areas Database of the United States (PAD-US) 3.0: U.S. Geological Survey data
release, https://doi.org/10.5066/P9Q9LQ4B.
Empty file.

0 comments on commit 4008780

Please sign in to comment.