Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[13pt] Update CatFIM site filtering for non-CONUS regions to pull non-forecast points #1356

Open
EmilyDeardorff opened this issue Dec 4, 2024 · 8 comments · May be fixed by #1379
Open

[13pt] Update CatFIM site filtering for non-CONUS regions to pull non-forecast points #1356

EmilyDeardorff opened this issue Dec 4, 2024 · 8 comments · May be fixed by #1379
Assignees
Labels
CatFIM NWS Flood Categorical HAND FIM enhancement New feature or request

Comments

@EmilyDeardorff
Copy link
Contributor

EmilyDeardorff commented Dec 4, 2024

There are many sites in non-CONUS regions (AK, PR, HI) where we would like to run CatFIM but they are being excluded because they are not NWM forecast points.

Loosen filters so we are pulling non-forecast points in AK, PR, and HI from the WRDS API and adjust site processing/filtering downstream to prevent duplicate LIDs.

Preliminary research notes from Rob:

This guy is not an easy fix and it was a bug that existed in at least 4.4.0.0 and 4.5.2.11. It al revolves around a WRDS field nws_data.rfc_forecast_point, in our generate_categorical_fim_flows.py file.

Note: All metrics below are based on Stage site comparisons between 4.4.0.0, 4.5.2.11 and earlier full tests version of 4.5.11.1

Earlier version (4.4.0.0 and 4.2.5.11) had this code: (or similar)

all_meta_lists = [ ]
....
image

Notice:

  • First call WRDS (conus_list)

    • with two key parts of note: selector=['all'], and must_include='nws_data.rfc_forecast_point',`. This will result in including HI and PR (and would have included "AK" if it existed at that point.
    • Second call to WRDS (island_list).
      • with those two key parts now being selector=['HI', 'PR', 'AK'], and must_include=None,. It added another field of note in this story.. select_by='state'

        Append the lists

        all_meta_lists = conus_list + islands_list
        
  • A bunch of problems bubbled up with this in previous versions

  1. All sites in HI and PR were included twice (once from each set). so now in all_meta_lists, we had most (not all.. hold that thought), for HI and PR. We never noticed this earlier. Once we added AK, we started to really notice the problem with all of the duplicate AK sites.
  2. The field of 'select_by='state'' has been proven to be a very unstable field. Many records, especially in HI and PR (and AK) do not have that field selected. (Remember this fact as it is important later too.
  3. Because our code saw all of the HI and PR records from the first WRDS call (at least 736 (245 PR and 491 HR) records for HI and PR for 4.5.2.11), then in the Second call, it would pick up a little less and not all of the same 736 records.
    • Why did it not pull another 736 in the second Islands call? Because in the second call, it added the state filter and some of the HI and PR records didn't' have a state value. Also of note in 4.2.11.1, of the 736 HI and PR, none mapped successfully.
    • When we started coding 4.5.11.1 and discovered this, it tentatively was about to add at least another 710 AK records. 54 AK did have the flag of nws_data.rfc_forecast_point == True of which only 11 mapped successfully.

Net, Net.. to this point. We had a huge amount of exact HI and PR records in the all_meta_lists. Granted our code attempted to filter out the dups later but it proven to be not 100% stable (yes.. strange but true, it wasn't explicitly looking for dups, it just happened to catch most but not all). The fact that it didn't catch all is part of the problem. And some WIP files were showing the dup records in the them. Details of the accidentally dup catch not relevant at this point. It isn't where we need to fix it.

This explains we saw a very large ratio of mapped to unmapped sites.


Once I (Rob) discovered this, which showed up massively when we added AK as it added, I researched it heavily and experimented with possible fixes and none were easy and quick. As a result, testing again combination of code WRDS calls, reviewing WRDS records directly, and various other analysis, I presented all of this to Carson, Derek and Emily. It was decided to drop the entire second WRDS api call (aka.. only pulling records where nws_data.rfc_forecast_point == True from the first WRDS call. Result? no dups. By product? We now get 54 AK sites (13 mapped), 4 PR sites (non mapped) and 2 HI (no mapped). This is what we are releasing for 4.5.11.1. The overall total of sites natrually has dropped signifcantly. It was deicded to review this again with the new CatFIM 2.2, which we are doing now.

===============
None of the fix options are great:

  1. Ask WRDS to give us a two new selectors, one for "CONUS" (no AK, HI, PR or other) and another one for "NON_CONUS" (AK, PR, HI). Then we can make two calls like we did initially and not get any dups. We also would not need to trying to use the unreliable ` select_by='state' tag as that field has proven to be blank in a number of sites, especially in HI and PR.
  • Note: I tried to see if I could make/get a list of all states and call WRDS for multiple states at a time or every single state one at a time (ie.. 50 plus calls). Possible, but we need to make / get a state list and it is hard to maintain. Also.. pretty inefficient to make 50 plus WRDS calls, unless we have the two new selectors mentioned here. BUT.. we also loose records that to not have a state value. We might not be able to overcome that problem and WRDS will likely have the same problem.
  1. We could just get all records as we do in 4.5.11.1 but leave out the nws_data.rfc_forecast_point == True. We won't get dups and it does not attempt to filter by state. Then we interate through all of them to look at the HUC number if we want. Then we can decide which we want to keep. AKA. most CONUS records missing that field while we iterate, we will likely throw out like we have done all along from the part 1 CONUS WRDS calls. Not sure how many this happens to in CONUS. This is the easiest and quickest way to fix it, but not overly pretty, but it will work. We need to confirm that we want to throw out CONUS recs that do not have nws_data.rfc_forecast_point == True. But.. we can make different rules as we pick up the HUC number from each site rec, look at the first two digits and decide that to do. This basically means we do all of the filtering ourselves, which might be good anyways.

  2. Painfully ugly, but we could do the two calls, one for CONUS, one for AK, HI and PR, then do our own dup searches in code.

Remember.. no matter what we choose, we are only going to get a small amount that inundate for PR, HI and AK (only 13 for 4.5.11.1).

==============
I have more details and stats if needed, but this gives the overview.

@EmilyDeardorff EmilyDeardorff added the CatFIM NWS Flood Categorical HAND FIM label Dec 4, 2024
@EmilyDeardorff EmilyDeardorff changed the title [8pt] Redo CatFIM site filtering for non-CONUS regions. [8pt] Update CatFIM site filtering for non-CONUS regions to pull non-forecast points Dec 4, 2024
@EmilyDeardorff EmilyDeardorff added the enhancement New feature or request label Dec 4, 2024
@EmilyDeardorff EmilyDeardorff changed the title [8pt] Update CatFIM site filtering for non-CONUS regions to pull non-forecast points [13pt] Update CatFIM site filtering for non-CONUS regions to pull non-forecast points Dec 9, 2024
@EmilyDeardorff EmilyDeardorff self-assigned this Dec 9, 2024
@EmilyDeardorff
Copy link
Contributor Author

Current vs Proposed Metadata API Call Methods

Current (single) API Call Method: A single API call to get metadata for all regions, only selecting forecast points.

Proposed (double) API Call Method:* Two API calls, the first API call gets metadata for all regions, only selecting forecast points. The second API call gets metadata for Alaska, Hawaii, an Puerto Rico (regardless of whether it is a forecast point). Then, a simple filtering process removes any duplicate points that had already been pulled in the first API call. Points without an NWS LID are also filtered out.

*Note: The double API call (minus the filtering out duplicates) was our method of getting metadata up until recently. This brought us a lot of points but caused some problems because of the duplicate points. This proposed method implements some filtering code to fix the issue of duplicate points.

Stats by State: Comparing Current and Proposed API Call Methods

State Current Code: Sites pulled in single API call (only forecast points) Proposed Update: Sites pulled in double API call + removing duplicates
Alaska 145 1950**
Hawaii 2 495
Puerto Rico 5 238
State Current Code: Sites pulled in single API call (only forecast points) Proposed Update: Sites pulled in double API call + removing duplicates
Texas (similar size to Alaska) 380 380
Connecticut (similar size to Hawaii and Puerto Rico) 23 23

**There are a lot of Alaska points with this filter method, so some additional filtering might be useful. It will be important to test whether this impacts runtime and model readability. An easy option would be to filter out the Alaska HUCs that are not included in the NWM processing. Alternatively, we could decide not to pull the non-forecast points for Alaska (but to continue to pull the for Hawaii and Puerto Rico) if we find that it doesn’t add enough sites to warrant the processing time.

The next step is to compare the amount of CatFIM sites that this proposed API call + filteration method gets and compare it to the CatFIM results that are in production.

@RobHanna-NOAA
Copy link
Contributor

Yes. I agree with the double with filtering as a better idea, unless we use a third option which is to get WRDS to give us two new "selector" options, one for CONUS and one for "AK and others. A minor thing to keep track of with filtering on the CatFIM sides is that strangely enough, it looks like not all nws_lid records have the "state" value set on it, so we will not get as many records back for a "second" call to WRDS, based on state. Yes.. it is weird to think that not all records have a state value and this needs to be re-confirmed. If this is true about possible lid records missing "state" values, then we are fundamentally trying to compensate for a data issue. However, early tests around this topic bumped this number of possible AK, HI and PR sites to possibly double the number then came in for the second state-based WRDS calls. More research is required.

@RobHanna-NOAA
Copy link
Contributor

PS.. it appears that the "state" field is unstable from WRDS, but the HUC field is stable. So we could look at HUC values to determine who is who.

@EmilyDeardorff
Copy link
Contributor Author

Evaluating instability in the metadata state column

Since the proposed filtering method would add a metadata call based on state, it is important to evaluate how reliable the state column(s) are.

In the metadata list pulled in the get_metadata() function, there are four different ‘state’ columns: [’nws_data’][’state’], [’usgs_data’][’state’], ['nws_preferred']['state'], and ['usgs_preferred']['state'].

I used the current get_metadata() formulation to pull a list of metadata to test this on. This command should pull records (that have forecast points) in spite of any ‘state’ column inconsistencies.

At a glance, the metadata shows that the [’usgs_data’][’state’] column has a lot of ‘None’ values, whereas the others don’t seem to have that many.

state_column_disparity

Summarizing the metadata dataframe by number of ‘None’ values per column supports this observation. There’s only one site in this metadata set where the nws_preferred_state value and the usgs_preferred_state value are both empty.

state_column_disparity_summary

I then pulled the metadata for Alaska, Puerto Rico, and Hawaii using the state selector (and removing the ‘is_forecast_point’ filter that was previously present). This pulls a larger amount of points, but only points that are properly connected to a state label.

The metadata in this set has a larger amount of ‘None’ values in the nws_data_state and usgs_data_state columns, but no columns where there the nws_preferred_state or the usgs_preferred_state are null.

state_column_disparity_statepull_summary


Essentially, my thoughts are that the get_metadata() function seems to get the metadata based on the ‘preferred’ state columns, which seem to be a lot more stable. The ‘preferred’ state columns also seem to cover for a lot of inconsistencies in the other ‘state’ columns. The columns that don’t have a preferred state column seem to be missing a lot of other data, so I don’t think we’re missing out if we don’t pull them in.

Based on all that, I think it should be okay to pull in additional points for select regions using the ‘state’ columns.

@RobHanna-NOAA
Copy link
Contributor

oh cool. One test that I did was to see if I could use the state field and pass in a large number of state codes. That failed. The API can only have so many being passed in at one time. We know it can handle 3 States at one, but can't handle 20 states listed. So.. if we do it by state or preferred states, we might have to make 50+ (or maybe 50/3) calls..

@EmilyDeardorff
Copy link
Contributor Author

Oh neat! Yeah, I don't think doing the CONUS calls by state would be necessary or efficient. I like the setup of the two API calls, one for all forecast points and one to more points for AK, HI, and PR.

@RobHanna-NOAA
Copy link
Contributor

yup.. that makes the most sense. Then just figure out how to drop the dup AK, HI and PR from the two lists :)

@EmilyDeardorff
Copy link
Contributor Author

Testing Results

I ran some tests with the updated API method (two API pulls + filtering out duplicate sites) and summarized the results below. It seems like this update handles the duplicate point issue well while still widening the selection criteria enough to produce more CatFIM results for AK, HI, and PR.

It will be worth separately looking into why HI and PR are not producing any stage-based CatFIM.

FIM 4.5.2.11 [Currently Live] FIM 4.5.11.1 [Upcoming Release] With API Update
Metadata Retrieval Method two API pulls, no duplicate filtering one API pull (only forecast points) two API pulls, filters out duplicate LIDs

Flow-based, Number of sites (Mapped | Unmapped)

State/Region FIM 4.5.2.11 [Currently Live] FIM 4.5.11.1 [Upcoming Release] With API Update
Alaska NA* 14 | 39 31 | 614
Puerto Rico 68 | 177 4 | 1 59 | 181
Hawaii 50 | 441 1 | 1 47 | 442

Stage-based, Number of sites (Mapped | Unmapped)

State/Region FIM 4.5.2.11 [Currently Live] FIM 4.5.11.1 [Upcoming Release] With API Update
Alaska NA* 13 | 40 5 | 209
Puerto Rico 0 | 245 0 | 5 0 | 240
Hawaii 0 | 491 0 | 2 0 | 489

*FIM 4.5.2.11 has no CatFIM points for Alaska because it was added later, in FIM 4.5.11.1
** Unmapped sites weren’t tallied because the test did not produce any mapped sites for the region.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CatFIM NWS Flood Categorical HAND FIM enhancement New feature or request
Projects
None yet
2 participants