Skip to content

Commit

Permalink
doc, outputs
Browse files Browse the repository at this point in the history
  • Loading branch information
zzeppozz committed Feb 12, 2024
1 parent a936101 commit 7a81126
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 26 deletions.
46 changes: 23 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
3. Assemble static ancillary inputs (local, AWS)
4. Resolve RIIS records to GBIF accepted taxa (local, copy to AWS S3)
5. Subset GBIF data to BISON (AWS Redshift/Glue)
6. Load BISON subset and ancillary inputs to Redshift
6. Load BISON subset and ancillary inputs to Redshift
7. Annotate BISON subset with regions and RIIS status (AWS Redshift)
8. Summarize BISON subset by regions and RIIS status (AWS Redshift)
9. Create Presence Absence Matrix (PAM) and compute statistics (local)
Expand All @@ -17,8 +17,8 @@
### Download the repository

The `LmBISON repository <https://github.com/lifemapper/bison>`_ can be installed by
downloading from Github. This code repository contains python code, scripts for AWS
tools, Docker composition files, configuration files, and test data for creating the
downloading from Github. This code repository contains python code, scripts for AWS
tools, Docker composition files, configuration files, and test data for creating the
outputs.

Type `git` at the command prompt to see if you have git installed. If you do not,
Expand All @@ -33,7 +33,7 @@ the command line:

When the clone is complete, move to the top directory of the repository, `bison`.
All hands-on commands will be executed in a command prompt window from this
directory location.
directory location.

### Install dependencies

Expand Down Expand Up @@ -63,8 +63,8 @@ Use the most current version of the United States Register of Introduced and Inv
* Year 4 data: https://doi.org/10.5066/P95XL09Q
* Year 5 data: TBA

The current file is named US-RIIS_MasterList_2021.csv, and is available in the
data/input directory of this repository. Upload this file to
The current file is named US-RIIS_MasterList_2021.csv, and is available in the
data/input directory of this repository. Upload this file to
s3://<S3 bucket>/input_data

### Census data for county/state
Expand All @@ -91,32 +91,32 @@ Upload the shapefile to s3://<S3 bucket>/input_data

### US Protected Areas Database (US-PAD)

Unable to intersect these data with records because of the complexity of the shapefiles.
Unable to intersect these data with records because of the complexity of the shapefiles.
Next time will try using AWS Redshift with a "flattened" version of the data.

Try:
* PAD-US 3.0 Vector Analysis File https://www.sciencebase.gov/catalog/item/6196b9ffd34eb622f691aca7
* PAD-US 3.0 Raster Analysis File https://www.sciencebase.gov/catalog/item/6196bc01d34eb622f691acb5

These are "flattened" though spatial analysis prioritized by GAP Status Code
(ie GAP 1 > GAP 2 > GAP > 3 > GAP 4), these are found on bottom of
https://www.usgs.gov/programs/gap-analysis-project/science/pad-us-data-download page.

The vector datasets are available only as ESRI Geodatabases. The raster datasets are
Erdas Imagine format. It appears to contain integers between 0 and 92, but may have
The vector datasets are available only as ESRI Geodatabases. The raster datasets are
Erdas Imagine format. It appears to contain integers between 0 and 92, but may have
additional attributes for those classifications. Try both in AWS Redshift.

Upload the raster and vector flattened zip files (test which works best later) to
Upload the raster and vector flattened zip files (test which works best later) to
s3://<S3 bucket>/input_data

## 4. Resolve RIIS records to GBIF accepted taxa

Run this locally until it is converted to an AWS step. Make sure that the
data/config/process_gbif.json file is present. From the bison repo top directory,
Run this locally until it is converted to an AWS step. Make sure that the
data/config/process_gbif.json file is present. From the bison repo top directory,
making sure the virtual environment is activated, run:

```commandline
python process_gbif.py --config_file=data/config/process_gbif.json resolve
python process_gbif.py --config_file=data/config/process_gbif.json resolve
```

Upload the output file (like data/input/US-RIIS_MasterList_2021_annotated_2024-02-01.csv
Expand All @@ -129,7 +129,7 @@ For all redshift steps, do the following with the designated script:

* In AWS Redshift console, open `Query Editor`, and choose the button `Script Editor`.
* Open existing or Create a new script (with +) and copy in the appropriate script.
* Update the date string this processing step with the first day of the current month,
* Update the date string this processing step with the first day of the current month,
for example, replace all occurrences of 2024_01_01 with 2024_02_01.
* Run

Expand All @@ -139,12 +139,12 @@ GBIF Input

* Use the Global Biodiversity Information Facility (GBIF) Species Occurrences on the
AWS Open Data Registry (ODR) in S3. https://registry.opendata.aws/gbif/
* These data are updated on the first day of every month, with the date string in
the S3 address.
* The date string is appended to all outputs, and referenced in the subset scripts
* These data are updated on the first day of every month, with the date string in
the S3 address.
* The date string is appended to all outputs, and referenced in the subset scripts
(Redshift and Glue)
* The data are available in each region, stay within the same AWS ODR region as the
BISON bucket.
BISON bucket.

### Redshift subset (3 min)

Expand All @@ -157,7 +157,7 @@ GBIF Input
* Run
* If this method is used, must still load the results into Redshift for steps 7, 8

## 6. Load ancillary inputs to from AWS S3 to AWS Redshift
## 6. Load ancillary inputs to from AWS S3 to AWS Redshift

* Perform Redshift steps, using script: `aws_script/rs_load_ancillary_data.sql`

Expand All @@ -170,7 +170,7 @@ GBIF Input
## 8. Summarize BISON subset by regions then export to S3 (AWS Redshift)

* Perform Redshift steps, using script: `aws_scripts/rs_aggregate_export`
* Outputs annotated records as CSV files in bucket/folder
* Outputs annotated records as CSV files in bucket/folder
s3://bison-321942852011-us-east-1/out_data
* aiannh_lists_<datestr>_000.csv
* state_lists_<datestr>_000.csv
Expand All @@ -181,9 +181,9 @@ GBIF Input

## 9. Create Presence Absence Matrix (PAM) and compute statistics (local)

* On a local machine, with the virtual environment activated, run the script
* On a local machine, with the virtual environment activated, run the script
aws_scripts/bison_matrix_stats.py

```commandline
python aws_scripts/bison_matrix_stats.py
```
Expand Down
6 changes: 3 additions & 3 deletions aws_scripts/bison_matrix_stats.py
Original file line number Diff line number Diff line change
Expand Up @@ -661,7 +661,7 @@ def write_to_csv(self, filename):

"""
from aws_scripts.bison_matrix_stats import (
read_s3_parquet_to_pandas, reframe_to_heatmatrix, reframe_to_pam, get_logger,
read_s3_parquet_to_pandas, reframe_to_heatmatrix, reframe_to_pam, get_logger,
SiteMatrix)
import boto3
Expand Down Expand Up @@ -737,14 +737,14 @@ def write_to_csv(self, filename):
# whittaker_new = pam.whittaker()
# species_ct_new = pam.num_species
# site_ct_new = pam.num_sites
#
#
# # new site stats
# beta_new = pam.beta()
# alpha_new = pam.alpha()
# alpha_proportional_new = pam.alpha_proportional()
# phi_new = pam.phi()
# phi_average_proportional_new = pam.phi_average_proportional()
#
#
# # new species stats
# omega_new = pam.omega()
# omega_proportional_new = pam.omega_proportional()
Expand Down

0 comments on commit 7a81126

Please sign in to comment.