From 601cd7a1dfaf5477a29afbc85016f6b615940259 Mon Sep 17 00:00:00 2001 From: cristinamullin <46969696+cristinamullin@users.noreply.github.com> Date: Thu, 2 May 2024 13:13:27 -0400 Subject: [PATCH] Update Harmonize_Pensacola.Rmd --- demos/Harmonize_Pensacola.Rmd | 236 +++++++++++++++++++++------------- 1 file changed, 150 insertions(+), 86 deletions(-) diff --git a/demos/Harmonize_Pensacola.Rmd b/demos/Harmonize_Pensacola.Rmd index faec75c..cf87051 100644 --- a/demos/Harmonize_Pensacola.Rmd +++ b/demos/Harmonize_Pensacola.Rmd @@ -1,112 +1,163 @@ --- title: "harmonize-wq in R" +format: html +editor: visual author: "Justin Bousquin, Cristina Mullin, Marc Weber" -date: '2022-08-31' -output: rmarkdown::html_vignette +date: "`r Sys.Date()`" +output: + rmarkdown::html_vignette: + toc: true + fig_caption: yes + fig_height: 8 + fig_width: 8 vignette: > + %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{harmonize-wq in R} - %\usepackage[utf8]{inputenc} %\VignetteEngine{knitr::rmarkdown} -editor_options: +editor_options: chunk_output_type: console + markdown: + wrap: 72 --- ```{r setup, include = FALSE} -# Set chunk options +library(knitr) + knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>" + echo = TRUE, + warning = FALSE, + message = FALSE ) ``` -
- ## Overview -Standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats using the harmonize_wq package. US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonize_wq package is intended to be a flexible water quality specific framework to help: - -* Identify differences in data units (including speciation and basis) -* Identify differences in sampling or analytic methods -* Resolve data errors using transparent assumptions -* Reduce data to the columns that are most commonly needed -* Transform data from long to wide format - -Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection. - -
- -
+Standardize, clean, and wrangle Water Quality Portal data into more +analytic-ready formats using the harmonize_wq package. US EPA's Water +Quality Portal (WQP) aggregates water quality, biological, and physical +data provided by many organizations and has become an essential resource +with tools to query and retrieval data using python or R. Given the +variety of data and variety of data originators, using the data in +analysis often requires data cleaning to ensure it meets the required +quality standards and data wrangling to get it in a more analytic-ready +format. Recognizing the definition of analysis-ready varies depending on +the analysis, the harmonize_wq package is intended to be a flexible +water quality specific framework to help: + +- Identify differences in data units (including speciation and basis) +- Identify differences in sampling or analytic methods +- Resolve data errors using transparent assumptions +- Reduce data to the columns that are most commonly needed +- Transform data from long to wide format + +Domain experts must decide what data meets their quality standards for +data comparability and any thresholds for acceptance or rejection. ## Installation & Setup -#### Install the harmonize-wq package (Command Line) +#### Option 1: Install the harmonize-wq Package Using the Command Line To install and set up the harmonize-wq package using the command line: -1. If needed, re-install [miniforge](https://github.com/conda-forge/miniforge). Once miniforge is installed. Go to your start menu and open the Miniforge Prompt. -2. At the Miniforge Prompt: - - conda create --name wq_harmonize - - activate wq_harmonize - - conda install geopandas pip dataretrieval pint - - may need to update conda - - conda update -n base -c conda-forge conda - - pip install harmonize-wq - - pip install git+https://github.com/USEPA/harmonize-wq.git (dev version) +1. If needed, re-install + [miniforge](https://github.com/conda-forge/miniforge). Once + miniforge is installed. Go to your start menu and open the Miniforge + Prompt. +2. At the Miniforge Prompt, run: + - conda create --name wq_harmonize + - activate wq_harmonize + - conda install geopandas pip dataretrieval pint + - may need to update conda + - conda update -n base -c conda-forge conda + - pip install harmonize-wq + - pip install git+ (dev + version) -
+#### Option 2: Install the harmonize-wq Package Using R -#### Install the harmonize-wq package (R) +**Alternatively**, you may be able to set up your environment and import +the required Python packages using R. -**Alternatively**, you may be able to set up your environment and import the required Python packages using the block of R code below: +First, run the chunk below to install the reticulate package to use Python in R. -```{r, results = 'hide', eval=FALSE} -# If needed, install the reticulate package to use Python in R +```{r, results = 'hide'} install.packages("reticulate") library(reticulate) +``` -# The reticulate package will automatically look for an installation of Conda -# However, you may specify the location if needed using options(reticulate.conda_binary = 'dir') -options(reticulate.conda_binary = '~/AppData/Local/miniforge3/Scripts/conda.exe') +Conda is required to use EPA's harmonize-wq package. -# Create a new Python environment called "wq-reticulate" -# Note that the environment name may need to include the full path (e.g. "~/AppData/Local/miniforge3/envs/wq_harmonize") -conda_create("wq-reticulate") +There are multiple installers available for Conda +(see: ). -# Install the following packages to the newly created environment -conda_install("wq-reticulate", "geopandas") -conda_install("wq-reticulate", "pint") -conda_install("wq-reticulate", "dataretrieval") +One example installer is +[miniforge](https://github.com/conda-forge/miniforge). We use miniforge3 in this +example. -# Install the harmonize-wq package -# This only works with py_install() (pip), which defaults to virtualenvs -# Note that the environment name may need to include the full path (e.g. "~/AppData/Local/miniforge3/envs/wq_harmonize") -py_install("harmonize-wq", pip = TRUE, envname = "wq-reticulate") +Once miniforge3 (or another installer of your choice) is installed, the +reticulate package will automatically look for the installation of Conda (conda.exe) +on your computer. -# To install the dev version of harmonize-wq from GitHub -# Note that the environment name may need to include the full path (e.g. "~/AppData/Local/miniforge3/envs/wq_harmonize") -py_install("git+https://github.com/USEPA/harmonize-wq.git@new_release_0-3-8", pip = TRUE, envname = "wq-reticulate") +```{r, results = 'hide'} +# options(reticulate.conda_binary = 'dir') +``` -# Specify the Python environment to be used -use_condaenv("wq_harmonize") +However, you may still need to specify the location. If needed, update the code chuck below to specify the location of conda.exe on your computer. -# Test that your Python environment is correctly set up -# Both imports should return "Module(package_name)" -import("harmonize_wq") -import("dataretrieval") +```{r, results = 'hide'} +# update the 'dir' in this chuck to specify the location of conda.exe on your computer +# Note that the environment name may need to include the full path (e.g. "C:/Users/USERNAME/AppData/Local/miniforge3/Scripts/conda.exe") +options(reticulate.conda_binary = "C:/Users/CMULLI01/AppData/Local/miniforge3/Scripts/conda.exe") ``` -
+Next, update the code chunk below to create a new Python environment in the envs +folder on your computer called "wq_harmonize". + +```{r, results = 'hide'} +# Note that the environment name may need to include the full path (e.g. "C:/Users/USERNAME/AppData/Local/miniforge3/envs/wq_harmonize") +reticulate::conda_create("C:/Users/CMULLI01/AppData/Local/miniforge3/envs/wq_harmonize") +``` -#### Import required libraries +Install the following python and R packages to the newly created +Python environment called "wq_harmonize". -The full list of dependencies that should be installed to use the harmonize-wq package can be found in [`requirements.txt`](https://github.com/USEPA/harmonize-wq/blob/new_release_0-3-8/requirements.txt). **Note that `reticulate::repl_python()` must be called to execute these commands using the reticulate package in R.** +```{r, results = 'hide'} +reticulate::conda_install("wq_harmonize", "geopandas") # Python package +reticulate::conda_install("wq_harmonize", "pint") # Python package +reticulate::conda_install("wq_harmonize", "dataretrieval") # R package +``` + +Install EPA's harmonize-wq package. + +```{r, results = 'hide'} +# Install the most recent release of the harmonize-wq package +# This only works with py_install() (pip = TRUE), which defaults to use virtualenvs +reticulate::py_install("harmonize-wq", pip = TRUE, envname = "wq_harmonize") + +# Uncomment below to install the development version of harmonize-wq from GitHub instead (optional) +# py_install("git+https://github.com/USEPA/harmonize-wq.git@new_release_0-3-8", pip = TRUE, envname = "wq_harmonize") +``` + +Specify the Python environment to be used, "wq_harmonize", and test that your Python +environment is set up correctly. ```{r} -# Use reticulate to execute python commands -reticulate::repl_python() +# Specify environment to be used +reticulate::use_condaenv("wq_harmonize") + +# Test set up is correct +# Both imports should return "Module(package_name)" +reticulate::import("harmonize_wq") +reticulate::import("dataretrieval") ``` -```{python} +#### Import additional required libraries + +The full list of dependencies that should be installed to use the +harmonize-wq package can be found in +[`requirements.txt`](https://github.com/USEPA/harmonize-wq/blob/new_release_0-3-8/requirements.txt). + +```{python, results = 'hide'} # Use these reticulate imports to test the modules are installed import harmonize_wq import dataretrieval @@ -114,6 +165,8 @@ import os import pandas import geopandas import dataretrieval.wqp as wqp +import pint +import mapclassify from harmonize_wq import harmonize from harmonize_wq import convert from harmonize_wq import wrangle @@ -122,24 +175,30 @@ from harmonize_wq import location from harmonize_wq import visualize ``` -
+## harmonize-wq Usage: FL Bays Example -
+The following example illustrates a typical harmonization process using +the harmonize-wq package on WQP data retrieved from Perdido and +Pensacola Bays, FL. -## Usage +**Note that `reticulate::repl_python()` must be called first to execute +these commands using the reticulate package in R.** -The following example illustrates a typical harmonization process using the harmonize-wq package on WQP data retrieved from Perdido and Pensacola Bays, FL. +```{r, results = 'hide'} +# Use reticulate to execute python commands +reticulate::repl_python() +``` -First, determine an area of interest (AOI), build a query, and retrieve water temperature and Secchi disk depth data from WQP for the AOI using the dataretrieval package: +First, determine an area of interest (AOI), build a query, and retrieve +water temperature and Secchi disk depth data from the Water Quality Portal (WQP) +for the AOI using the dataretrieval package: -```{python, message=FALSE, warning=FALSE, error=FALSE} +```{python, error = F} # File for area of interest (Pensacola and Perdido Bays, FL) aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson' # Build query and get WQP data with dataretrieval -query = {'characteristicName': ['Temperature, water', - 'Depth, Secchi disk depth', - ]} +query = {'characteristicName': ['Temperature, water', 'Depth, Secchi disk depth',]} # Use harmonize-wq to wrangle query['bBox'] = wrangle.get_bounding_box(aoi_url) @@ -152,10 +211,14 @@ res_narrow, md_narrow = wqp.get_results(**query) res_narrow ``` -Next, harmonize and clean all results: +Next, harmonize and clean all results using the harmonize.harmonize_all, +clean.datetime, and clean.harmonize_depth functions. -```{python, message=FALSE, warning=FALSE, error=FALSE} -df_harmonized = harmonize.harmonize_all(res_narrow, errors='raise') +Enter a ? followed by the function name, for example ?harmonize.harmonize_all, +into the console for more details. + +```{python, error = F} +df_harmonized = harmonize.harmonize_all(res_narrow, errors = 'raise') df_harmonized # Clean up the datetime and sample depth columns @@ -164,9 +227,14 @@ df_cleaned = clean.harmonize_depth(df_cleaned) df_cleaned ``` -There are many columns in the data frame that are characteristic specific, that is they have different values for the same sample depending on the characteristic. To ensure one result for each sample after the transformation of the data, these columns must either be split, generating a new column for each characteristic with values, or moved out from the table if not being used. +There are many columns in the data frame that are characteristic +specific, that is they have different values for the same sample +depending on the characteristic. To ensure one result for each sample +after the transformation of the data, these columns must either be +split, generating a new column for each characteristic with values, or +removed from the table if not needed. -```{python, message=FALSE, warning=FALSE, error=FALSE} +```{python, error = F} # Split the QA_flag column into multiple characteristic specific QA columns df_full = wrangle.split_col(df_cleaned) @@ -183,15 +251,11 @@ df_wide.head() Finally, the cleaned and wrangled data may be visualized as a map: -```{python, message=FALSE, warning=FALSE, error=FALSE} +```{python, error = F} # Get harmonized stations clipped to the AOI stations_gdf, stations, site_md = location.get_harmonized_stations(query, aoi=aoi_url) # Map average temperature results at each station gdf_temperature = visualize.map_measure(df_wide, stations_gdf, 'Temperature') -gdf_temperature.plot(column='mean', cmap='OrRd', legend=True) +gdf_temperature.plot(column = 'mean', cmap = 'OrRd', legend = True) ``` - -
- -
\ No newline at end of file