-
Notifications
You must be signed in to change notification settings - Fork 11
Scoring
This page documents the evaluation method used for generating the scores shown in the visualizer. First we describe the targets and what an error means for it. Then we describe how the true values are collected and used to score the models.
Before going any further, here are definitions for a few terms we will encounter repeatedly:
- An
epiweek
is the fundamental unit of time based on MMWR weeks and is uniquely identified using a combination of an year and MMWR week. For example 201348 is an epiweek representing MMWR week 48 of year 2013. - An epidemic
season
is an ordered set of epiweeks starting at 20xx30 and ending at 20yy29 where 20yy is 20xx + 1. A season is usually represented using both the consecutive years, e.g. 2013-2014, or using just the first year like 2013. Because the numbers of MMWR weeks in an year can be either 52 or 53, a season can also have either 52 or 53 epiweeks (e.g. season 2014-2015 has 53 weeks since year 2014 has 53 MMWR weeks). - A
target
is something that models try to predict at each time step. Targets which specify properties of a season, like the peak week areseasonal
targets. On the other hand, targets like n weeks ahead areweekly
targets. - There are 11 geographical
regions
. 10 identifying the 10 HHS regions and 1 for the complete nation. - Weighted influenza-like illness index,
wili%
, is the metric used in the time series. It is defined as the percentage of outpatient doctor visits for influenza-like illness, weighed by state population. This page on CDC.gov describes it in more details.
At each time point, each model provides predictions for the following 7 targets (for each of the 11 regions):
- 1 week ahead wili% value.
- 2 week ahead wili% value.
- 3 week ahead wili% value.
- 4 week ahead wili% value.
- Peak week. The epiweek with the maximum wili% in the season.
- Peak wili%. The wili% value at the peak week.
- Onset week. An onset week for a given season is derived using a baseline wili% value set by the CDC for that season and region. It is defined as the first of the first 3 consecutive weeks with wili% equaling or exceeding the baseline.
True values for all the targets, regions are derived using the weekly values of wili% for that season. Since the wili% values are revised as a season progress. The final wili% at epiweek 201802 (say) might not be equal to the wili% value for the same epiweek when queried at a different time. Both the seasonal and the weekly targets may vary during a live (one whose wili% values are not settled) season.
Delphi Epidata API provides ways to collect both the final wili% data and the (unsettled) data as observed at a certain time in season.
Some other subtleties related to seasonal targets follow:
- Since a peak week (and the value) can only be found when we get all the data of a season, it will not be available for a live season.
- A season is allowed to have multiple peak weeks if the corresponding values are close enough.
- Due to the way its defined, the onset week for a season might be unavailable for a few weeks and then be available for the rest of the season without changing.
- Onset week can also be null for the whole season.
Due to the variety in targets and the values they allow, there are multiple ways to score a target prediction. Scoring used in the visualizer uses the package flusight-csv-tools for collecting truth and calculating scores.
A few points to note:
- The truth is based on the latest revision of data (not the data observed at the time of prediction).
- Scores as calculated using single true value.
For each target, a model provides both a probability distribution and a point value corresponding to the bin with maximum probability. Therefore, we calculate the following two metrics, one based on point value and other on the distribution:
This is the absolute of error between the model's point estimate and the truth value. Lower is better.
Natural log of probability assigned to the true bin. Higher is better.
As an example, if (for wili% based target) the bins were the following:
...
[2.0, 2.1) 0.00
[2.1, 2.2) 0.00
[2.2, 2.3) 0.02
[2.3, 2.4) 0.10 // The true bin
[2.4, 2.5) 0.20
[2.5, 2.6) 0.08
[2.6, 2.7) 0.01
...
The single bin log score will be Math.log(0.10) = -2.3025850929940455
. Since
we take a mean of log scores over many weeks for evaluating a model, we clip
Math.log(0)
to -10
instead of its real value (-Infinity
) so that a model
is not penalized heavily. This also helps in model comparison for cases where
the difference between two models is only in the number of infinities they
produce.
This is calculated by finding the natural log of probability assigned to the true and a few neighbouring bins. Higher is better.
multi bin means that multiple bins around the truth are considered for scoring instead of just one bin. As an example, consider the truth is 2.3 and the bins (around the truth) with probabilities are:
...
[2.0, 2.1) 0.00
[2.1, 2.2) 0.00
[2.2, 2.3) 0.02
[2.3, 2.4) 0.10 // The true bin
[2.4, 2.5) 0.20
[2.5, 2.6) 0.08
[2.6, 2.7) 0.01
...
A single bin scoring rule will return a log score of Math.log(0.10) = -2.3025850929940455
. If we instead use multibin scoring with a window of 2 bins
around the truth (effectively 5 true bins), we get the score as:
Math.log(0.00 + 0.02 + 0.10 + 0.20 + 0.08) = -0.916290731874155
We use a window of 5 bins (total of 11 bins) for wili% targets (peak wili and week ahead targets) and a window of 1 bin (total 3 bins) for week targets (onset and peak wk).
As in the case of single bin log score, here also we clip the value of
Math.log(0)
to -10
(instead of going all the way back to -Infinity
).