-
Notifications
You must be signed in to change notification settings - Fork 50
Gage Data USGS USACE
In troute's V3 mode, preprocessing of gage data is handled in the nwm_network_preprocess
function, which handles the preprocessing of a network for simulations involving both diffusive and non-diffusive components. Here’s how it ingests and utilizes USGS- and USACE gage data, and how the return variables include processed gage data:
- Function call:
connections, param_df, wbody_conn, gages = nnu.build_connections(supernetwork_parameters)
- Usage:
nnu.build_connections
is a function that constructs the network connections graph and provides a dictionary gages that maps gage IDs to segment IDs. - Data Handling: The gages dictionary contains gage data, which is converted into a DataFrame (link_gage_df) for further use. link_gage_df is a DataFrame mapping segment IDs to gage IDs.
- Configuration Check:
break_network_at_gages = False
andstreamflow_da.get('streamflow_nudging', False)
- Usage: The presence of streamflow data assimilation parameters (streamflow_da) determines whether the network should be broken at gage locations.
- Data Integration:
If
break_network_at_gages
is True, the function will incorporate gage locations into the network structure, allowing for more refined simulations.
- USGS and Usace Crosswalks:
If data assimilation parameters specify crosswalk files, these are read to obtain mappings of USGS and USACE gage data.
- Function Call:
waterbody_types_df, usgs_lake_gage_crosswalk, usace_lake_gage_crosswalk = nhd_io.read_reservoir_parameter_file(...)
- Purpose: The crosswalk files facilitate the integration of gage data into the network by providing mappings between reservoir and gage IDs.
link_gage_df
:
- Data: Contains gage IDs and their corresponding segment IDs.
- Purpose: Provides a reference for segment-to-gage relationships within the network.
usgs_lake_gage_crosswalk
:
- Data: Mapping of USGS gage IDs to lake IDs.
- Purpose: Used to cross-reference lake IDs with gage IDs for USGS reservoirs.
usace_lake_gage_crosswalk
:
- Data: Similar to the USGS crosswalk but for USACE gage IDs.
- Purpose: Provides cross-references for USACE reservoirs.
waterbody_types_df
:
- Data: Contains information about different types of waterbodies, including USGS and USACE types if specified.
- Purpose: Helps categorize waterbodies and integrate them into the network.
In troute's V4 mode, all data assimilation of gage data can be either through passing BMI arrays (Basic Model Interface), or from a file, as in V3.
USGS data are ingested in DataAssimilation
, when the NudgingDA
Class is initialized, which is derived from AbstractDA
. The class handles reading, storing, and updating datasets used specifically for nudging purposes in data assimilation. Upon initialization, the class sets up several parameters related to data assimilation and prepares member variables to store the data. It determines whether streamflow nudging is enabled, and if so, proceeds to handle USGS data ingestion.
Both of the branches in the NudgingDA
class initialization (BMI vs file input) share, as far as gage data is concerned, the task of reading and processing USGS timeslice files (which contain gage observations) and link these observations to stream segments in the model's network. The resulting dataframe (usgs_df
) is used in streamflow nudging and, optionally, for constructing reservoir dataframes. The following parameter sets are relevant for usgs data assimilation:
-
data_assimilation_parameters
(dict): Contains user-defined parameters related to data assimilation, such as the directory for USGS timeslice files and quality control settings. -
streamflow_da_parameters
(dict): Contains streamflow-specific data assimilation parameters, including the gage-segment crosswalk file. -
run_parameters
(dict): Contains configuration settings for running the model, such as the timestep (dt) and multiprocessing options (essentially important due to the cpu_pool). -
network
(object): Represents the hydrological network, including a DataFrame (link_gage_df) that links stream gages to stream segments. -
da_run
(list): List of usgs timeslice files that are processed during the data assimilation run.
After NudgingDA, PersistenceDA is initialized, which processes reservoir outflow data for both USGS-gages (as applicable), and for USACE gages (all of which are found at larger reservoirs). The class handles the ingestion of time series data related to reservoir operations and outflow persistence, formats the data into pandas dataframes, and prepares it for further reservoir persistence analysis. In addition to the same parameter sets that are important for NudgingDA setup (data_assimilation_parameters
, etc), the following parameter sets from reservoir_da_parameters
control the usgs- and usace-reservoir assimilation:
-
reservoir_persistence_da
(dict): Boolean flag for overall reservoir persistence -
reservoir_persistence_usgs
(dict): Boolean flag for USGS reservoir persistence -
reservoir_persistence_usace
(dict): Boolean flag for USACE reservoir persistence -
lake_gage_crosswalk
= network.usace_lake_gage_crosswalk or network.usgs_lake_gage_crosswalk (dict): USGS/USACE lake-gage crosswalk
In the NudgingDA initialization, in case BMI is to be used (not from_files
), the BMI array usgs_Array is read, along with the corresponding time- and station indexing information (datesSecondsArray_usgs
and stationStringLengthArray_usgs
, respectively). Subsequently, the library bmi_array2df
("a2df") is used to unflatten the BMI array into _usgs_df
, which is a member of NudgingDA:
a2df._unflatten_array
a2df._time_retrieve_from_arrays
a2df._stations_retrieve_from_arrays.
In the event of data ingestion from files, the helper function _create_usgs_df
is called, which is responsible for creating a DataFrame (usgs_df
) that contains USGS gage observations. After extracting some information such as folder paths from the input parameters, the function get_obs_from_timeslices
from the nhd_io library is called, which is designed to read USGS observation data from timeslice files, process it, and output a dataframe containing the observations linked to the model's network segments or waterbodies. This function plays a crucial role in integrating real-world gage observations into hydrological models for tasks like streamflow data assimilation and model calibration.
Key parameters processed in get_obs_from_timeslices
are:
-
crosswalk_df
(DataFrame): dataframe containing a crosswalk that maps USGS gage IDs to model destination IDs (e.g., segment IDs or waterbody IDs). -
timeslice_files
: A list of file paths to the USGS timeslice files that contain the observation data. -
qc_threshold
(int): Quality control threshold; observations with quality flags below this value are considered invalid and are removed. -
interpolation_limit
(int): The maximum gap duration (in minutes) over which missing observations can be interpolated. -
frequency_secs
(int): The desired frequency (in seconds) at which observations should be resampled and interpolated. -
cpu_pool
(int): Number of CPU cores to use for parallel processing.
Timeslice files are read in parallel processing, using the parallel function from the joblib library to enable parallel reading of multiple timeslice files. For each file in timeslice_files, the function _read_timeslice_file is called in parallel. The _read_timeslice_file function reads individual timeslice files and return two dataframes:
- Observation dataframe: contains the actual gage observations.
- Quality dataframe: contains quality flags corresponding to each observation.
After reading, the function checks if all dataframes are empty (in which case it logs a debug message and returns an empty dataframe). If they are not all empty, the Observation and Quality DataFrames from all timeslice files are concatenated into two dataframes:
-
timeslice_obs_df
: combined observation data. -
timeslice_qual_df
: combined quality flags. These observation and quality dataframes are then joined with the crosswalk dataframe and the gage ID (converted to strings), resulting in indexing by the crosswalk destination field, excluding non-numeric data (NaN).
Subsequently, quality control filtering consists of masking negative and out-of-range quality flags (setting to NaN), followed by resampling of the dataframe following the given frequency_secs
using the interpolate()
function with a limit set by interpolation_limit
. The interpolation is conducted with the dataframe transposed to time indexing and back again after resampling.
The processed usgs_df
containing the ingested USGS data has the following contents and format:
- Index: Gage IDs from the geomodel.
- Columns: Timestamps at the specified frequency (e.g., every 5 minutes).
- Data: Interpolated USGS gage observations. In principle, any observations with quality flags below qc_threshold have been removed, however, the feature is not implemented at the moment.
An example structure of usgs_df
follows:
After each simulation run, the run_results
are processed to update the last_obs_df
based on new observations in update_after_compute
, and usgs_df
is updated for the next iteration of the data assimilation loop in update_for_next_loop
.
In the initialization, BMI is to be used if the from_files flag is turned to False, at which point the following BMI array data and index data are read, first for usgs:
-
usgs_reservoir_Array
: 1D ndarray of usgs reservoir data -
datesSecondsArray_reservoir_usgs
: dates in seconds relative to dateNull -
stationArray_reservoir_usgs
andstationStringLengthArray_reservoir_usgs
: 1D ndarray of the ASCII encoded station array, along with a key indicating the length of each station ID (also as ndarray) for decoding -
nDates_reservoir_usgs
/nStations_reservoir_usgs
: dimensions of resulting usgs reservoir dataframe
And then for usace:
-
usace_reservoir_Array
: 1D ndarray of uuace reservoir data -
datesSecondsArray_reservoir_usace
: dates in seconds relative to dateNull -
stationArray_reservoir_usace
andstationStringLengthArray_reservoir_usace
: 1D ndarray of the ASCII encoded station array, along with a key indicating the length of each station ID (also as ndarray) for decoding -
nDates_reservoir_usace
/nStations_reservoir_usace
: dimensions of resulting usace reservoir dataframe
The imported BMI arrays are processed using the library bmi_array2df
("a2df") to unflatten the BMI arrays into reservoir_usgs_df
and reservoir_usace_df
:
a2df._unflatten_array
a2df._time_retrieve_from_arrays
-
a2df._stations_retrieve_from_arrays.
Further, the reservoir_usgs_param_df and reservoir_usace_param_df persistence parameters dataframe are created.
For usgs data: if usgs_df
has already been read in, the following dataframes are created from network, as well as by resampling from the usgs dataframe created in NudgingDA:
-
gage_lake_df
:usgs_lake_gage_crosswalk
, indexed to usgs gage ID -
gage_link_df
:link_gage_df
, reindexed to gages -
link_lake_df
: crosswalk of segment- to lake IDs -
usgs_df_15min
: resampling regularusgs_df
to 15 minutes, resulting in thereservoir_usgs_df
. Theusgs_df
dataframe is then subset and re-indexed to the lake IDs if they are available instead of network link IDs, and the dataframereservoir_usgs_param_df
, which will eventually hold the persistence parameters, is initialized.
In the event that usgs_df
does not exist yet, reservoir_usgs_df
and reservoir_usgs_param_df
are read in again in the function create_reservoir_df
, which is a wrapper for the function get_obs_from_timeslices
from the nhd_io library. The usgs timeslice list is passed to the latter through da_run as follows:
reservoir_usgs_df, reservoir_usgs_param_df = _create_reservoir_df(data_assimilation_parameters, reservoir_da_parameters, streamflow_da_parameters,
run_parameters, network, da_run, lake_gage_crosswalk = network.usgs_lake_gage_crosswalk, res_source = 'usgs')
The equivalent usace reservoir dataframes are read in without first checking whether usgs_df
already exists; the call to get_obs_from_timeslices
through create_reservoir_df
is analogous.
As in NudgingDA, the update_for_next_loop function is called for each iteration of the data assimilation loop in, where reservoir_usgs_param_df and reservoir_usace_param_df are updated.
- Overview
- Hydrofabric Integration
- Input Forcing
- Domain Data
- Data Formats
- CLI
- BMI Tutorial
- Lower Colorado, TX example
- Larger Domains (e.g. CONUS)