As a tester, I want to test the event based evaluation solution implemented in #130 #382

HankHerr-NOAA · 2025-01-21T17:44:10Z

See #130. In this ticket, I'll track testing of the new capability. I'll start with build the latest code and running a very basic example. I'll then workup a test plan with different tests to perform tracked via checkboxes in the description of this ticket (i.e., below). Note that testing will make use of the standalone, both in-memory and using a database. COWRES testing will come later.

Let me pull the code and make sure I can build it.

Thanks,

Hank

==========

Tests to be performed are below. As I work down the list, in some cases, tests higher in the list will be updated to include the mentioned capability being tested. I'll essentially be throwing the kitchen sink at the capability to see if/when something "breaks". Other tests may be added as I progress through this list.

HankHerr-NOAA · 2025-01-21T18:12:06Z

As my initial evaluation, I used observations for ABRN1 streamflow (part of the HEFS Test A evaluations), and simulations for its NWM feature id (acquired from WRDS), and came up with this:

label: Testing Event Based
observed:
  label: OBS Streamflow
  sources: /home/ISED/wres/wresTestData/issue92087/inputs/ABRN1_QME.xml
  variable: QME
  feature_authority: nws lid
  type: observations
  time_scale:
    function: mean
    period: 24
    unit: hours
predicted:
  label: "19161749 RetroSim CSVs"
  sources:
  - /home/ISED/wres/nwm_3_0_retro_simulations/wfo/OAX/19161749_nwm_3_0_retro_wres.csv.gz
  variable: streamflow
  feature_authority: nwm feature id
  type: simulations
features:
  - {observed: ABRN1, predicted: '19161749'}
time_scale:
  function: mean
  period: 24
  unit: hours

event_detection: observed

It appears as though 64 events were identified with standard statistics output; here is sampling of the last few pools listed:

2025-01-21T18:05:54.537+0000  [Pool Thread 5] INFO PoolReporter - [60/64] Completed statistics for a pool in feature group 'ABRN1-19161749'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-3
1T23:59:59.999999999Z, Earliest valid time: 1985-09-25T06:00:00Z, Latest valid time: 1985-09-28T06:00:00Z, Earliest lead duration: PT-2562047788015215H-30M-8S, Latest lead duration: PT2562047788015215H30M7.999999999S )                                       
2025-01-21T18:05:54.538+0000  [Pool Thread 2] INFO PoolReporter - [61/64] Completed statistics for a pool in feature group 'ABRN1-19161749'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-3
1T23:59:59.999999999Z, Earliest valid time: 1996-11-18T06:00:00Z, Latest valid time: 1996-12-22T06:00:00Z, Earliest lead duration: PT-2562047788015215H-30M-8S, Latest lead duration: PT2562047788015215H30M7.999999999S )                                       
2025-01-21T18:05:54.552+0000  [Pool Thread 1] INFO PoolReporter - [62/64] Completed statistics for a pool in feature group 'ABRN1-19161749'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-3
1T23:59:59.999999999Z, Earliest valid time: 1987-03-22T06:00:00Z, Latest valid time: 1987-10-03T06:00:00Z, Earliest lead duration: PT-2562047788015215H-30M-8S, Latest lead duration: PT2562047788015215H30M7.999999999S )                                       
2025-01-21T18:05:54.562+0000  [Pool Thread 6] INFO PoolReporter - [63/64] Completed statistics for a pool in feature group 'ABRN1-19161749'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-3
1T23:59:59.999999999Z, Earliest valid time: 1998-11-04T06:00:00Z, Latest valid time: 1998-12-09T06:00:00Z, Earliest lead duration: PT-2562047788015215H-30M-8S, Latest lead duration: PT2562047788015215H30M7.999999999S )                                       
2025-01-21T18:05:54.563+0000  [Pool Thread 4] INFO PoolReporter - [64/64] Completed statistics for a pool in feature group 'ABRN1-19161749'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-3
1T23:59:59.999999999Z, Earliest valid time: 1998-03-21T06:00:00Z, Latest valid time: 1998-09-07T06:00:00Z, Earliest lead duration: PT-2562047788015215H-30M-8S, Latest lead duration: PT2562047788015215H30M7.999999999S )

I don't have a good way to view them graphically at the moment. Let me see if I can spin up a quick-and-dirty spreadsheet to support viewing the XML observations and CSV simulations.

Hank

HankHerr-NOAA · 2025-01-21T18:13:22Z

That 7 month event in 1987 is kind of odd: "1987-03-22T06:00:00Z, Latest valid time: 1987-10-03T06:00:00Z". Again, I need to visualize the time series so that I can understand where the events are coming from.

Hank

HankHerr-NOAA · 2025-01-21T18:52:23Z

Here is a plot of the observations and simulation for ABRN1 stream, with the NWM retrosim being averaged to 24-hours ending at the times of the 24-hour observations (I believe that's how WRES would rescale it by default; observations are blue):

Its crude, but the spreadsheet should allow me to focus in on individual events identified by the WRES to see if it makes sense. I'll start by examining the data for Mar 22, 1987, through Oct 3, 1987, which the WRES identified as one, long event.

Hank

HankHerr-NOAA · 2025-01-21T19:04:41Z

Here is an image zoomed into about 3/22/1987 - 10/3/1987 (plus some buffer around it):

Can't really look at that an understand why the WRES would see this as a single, long event, based on the observations. I'm now going to look at a short event to see if I can make sense of the output.

Hank

james-d-brown · 2025-01-21T19:07:54Z

For what it's worth...

The events themselves won't always make a ton of sense and you will find that they are rather sensitive to the parameters.

It's the algorithm we have for now, but I am pretty certain it's an accurate implementation and it has a decent set of unit tests, including those ported across from python. There is probably no algorithm that produces completely satisfactory results, though.

Spending time looking at the events may lead to a different/better set of default parameter values, but it probably won't.

On the whole, it produces vaguely sensible looking events for synthetic time-series with strong peak signals. It starts to look more questionable for (some) real time-series.

I don't want to sway your UAT but, TBH, I am personally more concerned about the range of interactions between event detection and various other features and whether it produces any/sensible results across all of those various possibilities - there's only so much that can be covered with integration tests.

HankHerr-NOAA · 2025-01-21T19:10:53Z

James:

Thanks. I was about to post the same conclusion that the parameters are probably just not optimal for this location, given the number of multi-week/month events I'm seeing, and I'm not going to spend time trying to optimize them.

Next step is to generate outputs that make some sense to me. I'm going to add graphics to the evaluation to help visualize what the WRES produces.

Hank

HankHerr-NOAA · 2025-01-21T19:12:18Z

Oh, and understood about wanting me to look at the interactions between features. I'll start working on that once I can make sense of the a "simple" case using real data.

Hank

james-d-brown · 2025-01-21T19:24:30Z

On the visualization, as I noted in #130, if there were a quick win for visualizing these events, I would've implemented it, but there really isn't. The quickest way would be a sidecar in the event detection package that generated visualizations, but that is pretty yucky as it bypasses our graphics clients. The best way would be to add event detection as a metric as well as a declaration option in itself. That way, you could calculate and visualize the detected events alone using the normal workflow, but that was going to be a lot of work as it would require a new metric/statistics class with output composed of two instants (start and end times). Will have to wait but, honestly, it would probably be better for event detection to be a separate web service.

HankHerr-NOAA · 2025-01-21T19:31:33Z

Understood. That's why I'm trying to come up with something, myself, that I can do quickly enough to perform some basic checks of the output.

Here are the Pearson correlation coefficients for the various events:

Okay. I think I need to just test out some of the different options provided in the wiki to see what happens.

Hank

HankHerr-NOAA · 2025-01-21T19:33:42Z

Oh, and if there are other metrics I should be looking at, let me know. I was thinking the time-to-peak metric probably won't be meaningful until I use forecasts to the evaluation, which I do plan to do. For now, since I'm evaluating simulations, I just looked at a couple of single-valued metrics and shared only the correlation.

Hank

james-d-brown · 2025-01-21T19:38:18Z

The example in the wiki uses simulations. In general, event detection won't work for forecast datasets because they are too short and will only capture partial events, unless it's a long-range forecast, perhaps. Anyway, see the example in the wiki, which is supposed to be exemplary as much as illustrative. I think that is the best/most reasonable application of event detection as it stands, i.e., detecting events for both observations and predictions simultaneously and then comparing the detected events in terms of peak timing error (or possibly other hydrologic signatures that we could add in future). Anything with forecasts is going to be much more dubious, unless you use a non-forecast dataset for detection. Anything with traditional, pool-based, statistics is going to be somewhat dubious too, IMHO.

HankHerr-NOAA · 2025-01-21T19:39:22Z

Can I compute the average correlation across the identified events? I don't know how useful that would be; I just figured to give it a shot. Looking at the wiki, I don't think I can. I know you can summarize across feature pools, but I don't think we can summary across referenced date/time pools, right? I'll check the wiki.

For now, I just started an evaluation using the more complex declaration in,

https://github.com/NOAA-OWP/wres/wiki/Event-detection#how-do-i-declare-event-detection

just to see what happens.

Hank

james-d-brown · 2025-01-21T19:44:24Z

No, that would need to be added as an aggregation option for summary statistics. I think I speculated in #130 about this, so probably worth a ticket for a future release. But again, I am personally a bit doubtful about traditional statistics generated for event periods. In most cases, this sort of analysis probably makes more sense with thresholds, like an analysis for flows above a flood threshold.

james-d-brown · 2025-01-21T19:47:09Z

Anyway, I am going to mostly leave you alone now unless you have questions as I have probably already swayed you too much. I just wanted to emphasize that the events are quite sensitive to the parameters and the timing error type analysis probably makes most sense for this new feature, but users will do what they do and you are probably representing their thought process too...

HankHerr-NOAA · 2025-01-21T19:55:23Z

The run using the parameters in the aforementioned section of wiki does yield significantly different events:

So, yeah, sensitive to parameters.

I'm going to try to workup a checklist of things to look at as part of this testing now. I still have questions, but they should be answered as I work through the tests.

Hank

HankHerr-NOAA · 2025-01-21T20:12:43Z

I have a check list as a starting point. I'm sure that some of the items are nonsensical, but I want to see what happens when I combine different options. As I work through the list, I will likely use previously successful evaluations to add the new, specified feature, in order to see how the results are impacted. I'm probably overlooking tests to perform; I'll add those when I discover the oversight.

Thanks,

Hank

HankHerr-NOAA · 2025-01-21T20:17:38Z

FYI... all test declarations will be kept in the standard .../wresTestData/github382 directory. I've also already created a GitHub_382 folder in the WRES Redmine Ticket Large Data Sets folder in Google Drive for sharing data. It has the observed and simulated data sets I've used for testing so far (though they aren't particularly "large").

Hank

HankHerr-NOAA · 2025-01-22T16:47:55Z

I'm not sure I'm going to get to this today, except perhaps during the office hours (if no one is on). I've been working on the QPR slide deck and dealing with the proxy outage.

Hank

HankHerr-NOAA · 2025-01-22T19:49:57Z

Just talked with James some during the office hours. I ran an evaluation of single valued forecasts from WRDS against observations used for HEFS, with default event-based evaluation parameters, and obtained results for the pearson correlation coefficient, time to peak error, and sample size. First, we noticed that the evaluation.csv did not properly convey the reference_dates from the declaration; I need to write a ticket for that.

Second, James explained that each event period will yield time-to-peak errors for each single valued forecast for overlapping that period, and that each such error will be stored in the evaluation.csv with both the value and the forecast's corresponding issued time. It was hard for me to see this when looking at the CSV, directly, but became clearer when I looked at the CSV through Excel; here is a screenshot:

Each time to peak error is presented as an issued time and value pair on two rows. The output image would then look like:

Since I'm zoomed out so far, it appears that the points line up vertically, but that is actually not the case. Pay close attention to the 0-line at the top, you'll see that they do not exactly overlap.

I think this is reasonable output given the declaration I employed. I'll do a bit more testing, though, before checking the single-valued forecast box.

Hank

HankHerr-NOAA · 2025-01-22T20:01:43Z

I reported #385 for the CSV issue. I'll pick up testing again when I can tomorrow. I'm not making as much progress as I had hoped.

Hank

HankHerr-NOAA · 2025-01-23T19:23:35Z

My single valued example run used ABRN1 RFC forecasts from WRDS. I opened the evaluation.csv in Excel, counted the time to peak errors computed for each identified event, and compared that with what WRDS returns for time series when I constrain the request to be for the time period of the event. They are identical, meaning there is one time to peak error per time series overlapping the event. That is what I expected. Good.

As an aside, ABRN1 appears to be one of those forecast points (presumably in ABRFC) where forecasts are only generated when needed. So, for example, WRES identified an event spanning Jun 5 - Aug 5, 2014. The forecasts for that point that overlap all have issued times of Jun 4, meaning that the RFC generated the forecast only when the event was on the horizon. As for the event being two months long, that was likely due to the parameter options as discussed before.

I believe there are summary statistic options for time series metrics. Let me see if I can find those and give it a shot.

Hank

HankHerr-NOAA · 2025-01-23T19:30:21Z

First thing I found gave me overall summary statistics instead of one set of stats for the time to peak error instead of one per event. Let me revisit the event based wiki to see how I'm supposed to do it.

Hank

HankHerr-NOAA · 2025-01-23T19:34:13Z

No, I did it right. That number is intended to report the average time to peak error across all events. In other words, it answers the question, when an event occurs, what is the average time to peak error I can expect for forecasts of those events. If I want statistics related for a single event, then, I guess I would modify the declaration to focus on the single event of interest and run it again.

James: If that sounds wrong, please let me know.

I'm checking the single valued forecast evaluation. It works in a simple evaluation, which is the point of the checkbox, 'Evaluating single value forecasts for events (including time series metrics)'. More complicated stuff comes later.

Hank

james-d-brown · 2025-01-23T19:39:00Z

Yeah, there is one "time to peak" to for one "peak" aka one "event", so the "raw" numbers of the time-to-peak are the "per event" values and the summary statistics aggregate across all events.

HankHerr-NOAA · 2025-01-23T19:41:04Z

Thanks, James!

The next checkbox is for a basic ensemble forecast test. So I guess I'll point the declaration to the HEFS data for ABRN1 and see what happens.

Hank

HankHerr-NOAA · 2025-01-23T19:44:08Z

Ensemble forecasts don't allow for time to peak error:

- The declared or inferred data 'type' for the 'predicted' dataset is ensemble forecasts, but the following metrics are not currently supported for this data 'type': [TIME TO PEAK ERROR]. Please remove these metrics or change the data 'type'.

Makes sense: there is one different peak per member. Anyway, I'll use more traditional metrics, even if they aren't really as interesting in this case.

Hank

HankHerr-NOAA · 2025-02-06T19:52:25Z

I didn't even look for the pooling features wiki. Oops. With it referenced now, it should be easier to spot.

Thanks,

Hank

HankHerr-NOAA · 2025-02-06T20:45:51Z

This is going to be tracked with 6.30.

I moved the development ticket to 6.30 for tracking purposes, as well.

Hank

HankHerr-NOAA · 2025-02-07T20:00:20Z

Next week, I'll continue testing some combinations of features as time allows. When the tickets associated with unchecked checkboxes are addressed, I'll test them as well.

Thanks,

Hank

HankHerr-NOAA · 2025-02-11T16:19:41Z

I just ran a few of the evaluations I ran toward the end of last week to double check that they execute using a database correctly. In general, the standard scores look okay.

But, when I ran an evaluation of NWM retrospective simulations using the time-to-peak error, I noted that there were differences between in-memory and database for the graphic, ABRN1_19161749_19161749_RetroSim_CSVs_TIME_TO_PEAK_ERROR.png. In fact, I noticed that the output was not deterministic: that graphic would change with every run whether or not it used a database.

I'm going to post ticket,

Hank

HankHerr-NOAA · 2025-02-11T16:26:08Z

I think the problem I'm spotting is already covered in #399. The declaration, shared below, uses sampling_uncertainty, and the diagram I'm looking at is that for time to peak error. I think, when combined with thresholds, the issue becomes more obvious. Fix #399 and the outputs using thresholds will likely also be fixed.

Nothing new here,

Hank

==========

label: Testing Event Based
observed:
  label: OBS Streamflow
  sources: /home/ISED/wres/wresTestData/issue92087/inputs/ABRN1_QME.xml
  variable: QME
  feature_authority: nws lid
  type: observations
  time_scale:
    function: mean
    period: 24
    unit: hours

predicted:
  label: "19161749 RetroSim CSVs"
  sources: 
  - /home/ISED/wres/nwm_3_0_retro_simulations/wfo/OAX/19161749_nwm_3_0_retro_wres.csv.gz
  variable: streamflow
  feature_authority: nwm feature id
  type: simulations

features:
  - {observed: ABRN1, predicted: '19161749'}

# Weekly maximum time scale
time_scale:
  function: mean
  period: 24
  unit: hours

event_detection: observed

sampling_uncertainty:
  sample_size: 1000
  quantiles: [0.05,0.95]

metrics:
  - time to peak error
  - sample size
  - pearson correlation coefficient

threshold_sources:
- uri: https://WRDS/api/location/v3.0/nws_threshold
  operator: greater
  apply_to: observed
  type: value
  parameter: flow
  provider: NWS-NRLDB
  rating_provider: NRLDB
  feature_name_from: observed

output_formats:
  - csv2
  - png
  - pairs

HankHerr-NOAA · 2025-02-11T16:30:55Z

Other than the issue with time to peak error, #399, in-memory and database runs appear identical.

Hank

james-d-brown · 2025-02-11T17:41:29Z

Please see my recent comments in #397.

HankHerr-NOAA · 2025-02-12T16:17:45Z

Noting here that I plan to continue testing during office hours (unless we have a visitor) and until my workday ends this afternoon. There are some ticket ready for UAT, including #397, and I plan to exercise those both in memory and using a database to check the behavior. James noted in #397 that, "There is, however, a bigger issue, which means this can only be tested effectively using in-memory mode," which I'll take into consideration as I test it.

More this afternoon,

Hank

james-d-brown · 2025-02-12T16:23:13Z

Just pushed a fix for #414, which will remove that constraint and also achieve requested the results for #408, not merely consistency (which is nonetheless the proper criterion for that ticket, hence closed). Will merge in #414 shortly, at which point you can test without any constraints.

HankHerr-NOAA · 2025-02-12T16:27:01Z

Sweet! Thanks,

Hank

HankHerr-NOAA · 2025-02-12T20:33:33Z

I tested #397 and am seeing one oddity in the intersection maximum results. Since tomorrow is filled with 3 meetings, my plan is to continue testing on Friday and look at other tickets related to event detection that James has resolved and put into a UAT state.

Thanks,

Hank

HankHerr-NOAA · 2025-02-14T13:34:55Z

I just ran through the various tickets in UAT for 6.30, and they test out well.

One thing I still should test is the use of USGS NWIS as part of event detection and the resulting evaluation. Up to this point, I've only used HEFS observations and NWM retrospective simulations to identify events. HEFS observations are often times drawn from USGS NWIS, but are daily means, not at the raw USGS NWIS event times. I also want to do more exercising of multiple feature runs, perhaps using more WRDS in the evaluations. I'll try to do this today or early next week, so that (hopefully) we can wrap this up and have it ready to deploy by the end of next week.

I'll come back to this later this morning,

Hank

HankHerr-NOAA · 2025-02-14T14:18:28Z

I love it when I schedule a meeting for git administrators and no git administrators show up. Sigh. Anyway, that just gives me more time to look at this.

I'm going to design an event-based evaluation of NWM retrospective simulations against USGS NWIS observations based on the evaluations we performed on behalf of ISED a few months ago. I'll start with FROV2 as the feature of interest, expand that to a few more points, and then ramp it up to an entire WFO, which I'll let run while I'm at lunch.

Hank

HankHerr-NOAA · 2025-02-14T15:01:36Z

James,

My starting point declaration is below. Its the exact evaluation we ran for FROV2 months ago but with fewer thresholds.

I talked to Jason to ask about his event-based evaluation. He said that he used simulations (probably obvious to you) and added the following comments:

I tend to prefer simpler bias metrics like peak bias, volume bias, or flashiness bias.

Alex, Nels, and Fred often liked NNSE conditioned on events

I like peak, volume, and flashiness because they directly relate to conservation of mass and momentum. So, you can use them as diagnostic tools.

You can also events as a dichotomous transformer. Then you can assess categorical metrics. For example, the frequency bias of the number of events.

So, obviously the last one above is not possible, except via post-processing, I think. I'll skip that, because I'm not trying to give myself that much work to do.

Looking at the others...

I can compute the NNSE for each event, if I'm understanding what "conditioned on events" means. That is, if I include the NNSE metric, it should give me one number per pool (i.e., per event). But is that what he wants? Any summarizing across events would be post-processing which I'm not going to do.
The biases would be single values computed across each event which would then be summarized (averaged) across all events. Again, if I'm understanding correctly. I don't think I can compute the single values across events unless I specifically design one evaluation per event where the time scale is equal to the time period of the events. That is, I would run the WRES to identify the events, and then need to run it again once per event manually declared to compute the biases. The second part of that is not a test of event detection and is something we already know works (unless there is a bug somewhere not yet identified).

So, taking into consideration what Jason said, most of what he recommends requires post-processing and is beyond the scope of event detection and this test ticket. Perhaps those post-processing steps can be included in a future release of the WRES (would need to be discussed), but I'm not going to do that now since my focus is testing event detection and the metrics the WRES can produce out of the box in such an evaluation.

I'll make sure to add NNSE as a metric in the evaluation, if it isn't already, but I'm otherwise not expecting that the metrics output are going to be useful in a "real" scenario.

Thoughts?

Hank

==========

label: NWM Retro Sim Reproducing Jason Results
observed:
  label: USGS NWIS Streamflow Observations
  sources:
  - interface: usgs nwis
    uri: https://nwis.waterservices.usgs.gov/nwis/iv
  variable:
    name: '00060'
  type: observations
predicted:
  label: MARFC RetroSim CSVs
  sources:
  - uri: /home/ISED/wres/nwm_3_0_retro_simulations/rfc/MARFC/5907079_nwm_3_0_retro_wres.csv.gz
  variable:
    name: streamflow
  feature_authority: nwm feature id
  type: simulations
baseline:
  label: Persistence 1 Step USGS Streamflow
  sources:
  - interface: usgs nwis
    uri: https://nwis.waterservices.usgs.gov/nwis/iv
  variable:
    name: '00060'
  type: observations
  method:
    name: persistence
unit: ft3/s
valid_dates:
  minimum: '1980-01-01T00:00:00Z'
  maximum: '2021-01-01T00:00:00Z'
time_scale:
  function: maximum
  period: 24
  unit: hours
pair_frequency:
  period: 24
  unit: hours
rescale_lenience: none
feature_service:
  uri: https://WRDS/api/location/v3.0/metadata
#  groups:
#  - group: wfo
#    value: LWX
#    pool: false
features:
  - {predicted: '5907079'}
probability_thresholds:
- values:
  - 0.5
  - 0.55
  - 0.6
  - 0.65
  - 0.7
  - 0.75
  - 0.8
  - 0.85
  - 0.9
  - 0.95
  operator: greater
  apply_to: predicted
metrics:
- name: volumetric efficiency
- name: sample size
- name: mean absolute error skill score
- name: probability of detection
- name: false alarm ratio
- name: bias fraction
duration_format: hours
output_formats:
- format: png
- format: pairs
- format: csv
- format: csv

james-d-brown · 2025-02-14T15:30:27Z

I could really only speculate on what they would want, so it may be better to unpack that in detail. I would however note that this is a first implementation of the concept and there are many opportunities for enhancement if our users actually express an interest. Regarding summaries across events, that is the purpose of summary statistics, in general, although I will note that there is a difference between:

Filtering pairs using events as filters and then computing statistics from a single pool of event-conditional pairs; and
Summarizing event-specific statistics with summary statistics, like an average NNSE across events or a standard deviation etc.

Currently, our summary statistics (other than the ones for timing errors, which are declared inband to the metrics) are limited to the following dimensions:

      - features
      - feature groups

Naturally, it would be possible to allow for pools and hence "events" to be another summary dimension and I believe this was mentioned in #130, although I don't recall creating a ticket for that. Using events as filters for the evaluation of a single pool that contains event-specific pairs is a different formulation altogether, but also possible. In general, though, I wonder whether this style of evaluation really makes much sense. I think an evaluation conditional on threshold (not event) is probably more appropriate in that case. Conversely, I think event-specific measures are more appropriate for event-shaped pools, like timing errors. I absolutely think there is value in adding more, similar, event-scale measures, like volume error.

In summary, I would UAT what exists and I would create tickets for what doesn't exist that could make sense, but there may need to be some unpacking first. On the whole, I would value things like adding event-scale measures moreso than supporting the use of events as filters to compute event-conditional (traditional) measures, largely because I think thresholds achieve something similar in a more relevant/directed way.

HankHerr-NOAA · 2025-02-14T15:35:48Z

Thanks, James, for the response. I'll read it after I post this comment.

Without event detection, the evaluation gives us one set of numbers for the entire valid_dates period, with one value per threshold.

When I add event detection, I get one set of number per event detected. Since not every threshold has results for every event, it looks like a spaghetti plot. For example, looking at the MSE-SS:

That is expected. Here are the same results, but using the NWM v3.0 retrospective simulations to detect events:

There were 54 events detected based on the NWM data and 85 events detected based on USGS NWIS data. Here is the sample size plot for the USGS NWIS based events, just as an example:

I'm going to prepare the full WFO evaluation so that I can start it before I leave for lunch. My lunch is early today.

Thanks,

Hank

HankHerr-NOAA · 2025-02-14T15:42:52Z

I think the gist of your comment is this:

We can add tickets for summarizing results across events and event scale metrics, but we may need to "unpack" it a bit further. (I agree that lumping pairs across events into one pool doesn't make much sense.)
Let's see if users actually want this stuff, first.

Should we post the tickets now or wait for user feedback? I'm willing to create some tickets if it helps or I can ask Jason since he may be better able to describe what is wanted.

Thanks,

Hank

HankHerr-NOAA · 2025-02-14T15:44:14Z

The LWX WFO evaluation is on-going. The first event detection being performed is relative to NWM retrospective simulations, because I forgot to change it back to observed. But I was planning to run both anyway.

More after lunch (I'll be back around Noon),

Hank

james-d-brown · 2025-02-14T16:17:34Z

Right. I think you could add a ticket to summarize statistics across pools, as that may be more generally useful.

HankHerr-NOAA · 2025-02-14T16:57:05Z

Created #422,

Hank

HankHerr-NOAA · 2025-02-14T17:13:25Z

The WFO LWX resulted in 19,854 pools which are about half way through processing.

Events were detected for each feature independently, as expected, and here is a sampling of the log messages:

2025-02-14T16:10:46.615+0000 700747 [main] INFO EventsGenerator - Performing event detection for feature group 01495000-4763630-01495000...
2025-02-14T16:10:46.622+0000 700747 [main] INFO EventsGenerator - Getting time-series data to perform event detection for the following features: [Feature[name=4763630,description=,srid=0,wkt=]]
2025-02-14T16:10:49.277+0000 700747 [main] INFO EventsGenerator - Detected 57 events in the PREDICTED dataset for feature group 01495000-4763630-01495000.
2025-02-14T16:10:49.279+0000 700747 [main] INFO EventsGenerator - Detected 57 events across all datasets for feature group 01495000-4763630-01495000 when forming the UNION.
2025-02-14T16:10:49.281+0000 700747 [main] INFO EventsGenerator - Performing event detection for feature group 01578310-4726595-01578310...
2025-02-14T16:10:49.282+0000 700747 [main] INFO EventsGenerator - Getting time-series data to perform event detection for the following features: [Feature[name=4726595,description=,srid=0,wkt=]]
2025-02-14T16:10:51.562+0000 700747 [main] INFO EventsGenerator - Detected 60 events in the PREDICTED dataset for feature group 01578310-4726595-01578310.
2025-02-14T16:10:51.564+0000 700747 [main] INFO EventsGenerator - Detected 60 events across all datasets for feature group 01578310-4726595-01578310 when forming the UNION.
2025-02-14T16:10:51.566+0000 700747 [main] INFO EventsGenerator - Performing event detection for feature group 01578475-4726273-01578475...
2025-02-14T16:10:51.567+0000 700747 [main] INFO EventsGenerator - Getting time-series data to perform event detection for the following features: [Feature[name=4726273,description=,srid=0,wkt=]]
2025-02-14T16:10:53.624+0000 700747 [main] INFO EventsGenerator - Detected 42 events in the PREDICTED dataset for feature group 01578475-4726273-01578475.
2025-02-14T16:10:53.626+0000 700747 [main] INFO EventsGenerator - Detected 42 events across all datasets for feature group 01578475-4726273-01578475 when forming the UNION.

I'm scanning a subset of the outputs generated so far,

Hank

HankHerr-NOAA · 2025-02-14T17:18:12Z

Some example images for different features and metrics:

The false alarm ratio is pretty noisy, but the others aren't.

The FROV2 run is not done yet, so I can't confirm that results appear identical. I'll report on that later,

Hank

james-d-brown · 2025-02-14T17:23:05Z

Going to demote Getting time-series data to perform event detection to DEBUG as we don't really need that. May need to further curtail logging for large-in-space evaluations.

james-d-brown · 2025-02-14T17:23:56Z

Should probably also demote the aggregation message unless it was actually performed (i.e., several datasets).

HankHerr-NOAA · 2025-02-14T18:08:09Z

The FROV2 output looks reasonable.

I'll wait for the WFO evaluation to complete. Once it does, that will be the end of my testing today. On Tuesday, I'll test out some of the event_detection parameters to judge the effect. I only did very basic tests previously. Unless I find a bug related to the parameters, I anticipate wrapping up this testing on Tuesday.

Note that all of the evaluations I ran today are in .../ISED/wres/wresTestData/github382/nwm_retrosim_vs_usgs_obs'. The declaration 5907079.marfc_one_feature_template.ymlwas used for FROV2 testing andlwx_run.yml` was for the WFO test. When the WFO test completes, I'll copy the results into that directory. The FROV2-only results already have example outputs copied to subdirectories of that directory. Future UAT could include comparing results through staging COWRES with what I have in those directories.

Thanks,

Hank

HankHerr-NOAA · 2025-02-14T18:09:37Z

LWX run completed without errors. Results have been copied to lwx_run_results. Done with this for today,

Hank

HankHerr-NOAA added this to the v6.29 milestone Jan 21, 2025

HankHerr-NOAA self-assigned this Jan 21, 2025

HankHerr-NOAA added the testing Testing of new capabiltiies label Jan 21, 2025

HankHerr-NOAA mentioned this issue Jan 22, 2025

As a user, I expect the declared reference_dates to be reflected in the evaluation.csv output #385

Closed

HankHerr-NOAA modified the milestones: v6.29, v6.30 Feb 6, 2025

HankHerr-NOAA mentioned this issue Feb 13, 2025

As a developer, I don't expect an evaluation with one source for event detection and combination declared to combine all events into a single aggregated event #420

Closed

HankHerr-NOAA mentioned this issue Feb 14, 2025

As a user, I would like to summarize statistics across pools #422

Open

As a tester, I want to test the event based evaluation solution implemented in #130 #382

As a tester, I want to test the event based evaluation solution implemented in #130 #382

Comments

HankHerr-NOAA commented Jan 21, 2025 • edited Loading

HankHerr-NOAA commented Jan 21, 2025

HankHerr-NOAA commented Jan 21, 2025

HankHerr-NOAA commented Jan 21, 2025 • edited Loading

HankHerr-NOAA commented Jan 21, 2025

james-d-brown commented Jan 21, 2025

HankHerr-NOAA commented Jan 21, 2025

HankHerr-NOAA commented Jan 21, 2025

james-d-brown commented Jan 21, 2025

HankHerr-NOAA commented Jan 21, 2025

HankHerr-NOAA commented Jan 21, 2025

james-d-brown commented Jan 21, 2025

HankHerr-NOAA commented Jan 21, 2025

james-d-brown commented Jan 21, 2025

james-d-brown commented Jan 21, 2025

HankHerr-NOAA commented Jan 21, 2025

HankHerr-NOAA commented Jan 21, 2025

HankHerr-NOAA commented Jan 21, 2025

HankHerr-NOAA commented Jan 22, 2025

HankHerr-NOAA commented Jan 22, 2025

HankHerr-NOAA commented Jan 22, 2025

HankHerr-NOAA commented Jan 23, 2025

HankHerr-NOAA commented Jan 23, 2025

HankHerr-NOAA commented Jan 23, 2025

james-d-brown commented Jan 23, 2025

HankHerr-NOAA commented Jan 23, 2025

HankHerr-NOAA commented Jan 23, 2025

HankHerr-NOAA commented Feb 6, 2025

HankHerr-NOAA commented Feb 6, 2025

HankHerr-NOAA commented Feb 7, 2025

HankHerr-NOAA commented Feb 11, 2025

HankHerr-NOAA commented Feb 11, 2025

HankHerr-NOAA commented Feb 11, 2025

james-d-brown commented Feb 11, 2025

HankHerr-NOAA commented Feb 12, 2025

james-d-brown commented Feb 12, 2025

HankHerr-NOAA commented Feb 12, 2025

HankHerr-NOAA commented Feb 12, 2025

HankHerr-NOAA commented Feb 14, 2025

HankHerr-NOAA commented Feb 14, 2025

HankHerr-NOAA commented Feb 14, 2025 • edited Loading

james-d-brown commented Feb 14, 2025

HankHerr-NOAA commented Feb 14, 2025

HankHerr-NOAA commented Feb 14, 2025

HankHerr-NOAA commented Feb 14, 2025

james-d-brown commented Feb 14, 2025

HankHerr-NOAA commented Feb 14, 2025

HankHerr-NOAA commented Feb 14, 2025

HankHerr-NOAA commented Feb 14, 2025

james-d-brown commented Feb 14, 2025

james-d-brown commented Feb 14, 2025

HankHerr-NOAA commented Feb 14, 2025

HankHerr-NOAA commented Feb 14, 2025

HankHerr-NOAA commented Jan 21, 2025 •

edited

Loading

HankHerr-NOAA commented Jan 21, 2025 •

edited

Loading

HankHerr-NOAA commented Feb 14, 2025 •

edited

Loading