Scoring

This page documents the evaluation method used for generating the scores shown in the visualizer. First we describe the targets and what an error means for it. Then we describe how the true values are collected and used to score the models.

Before going any further, here are definitions for a few terms we will encounter repeatedly:

An epiweek is the fundamental unit of time based on MMWR weeks and is uniquely identified using a combination of an year and MMWR week. For example 201348 is an epiweek representing MMWR week 48 of year 2013.
An epidemic season is an ordered set of epiweeks starting at 20xx30 and ending at 20yy29 where 20yy is 20xx + 1. A season is usually represented using both the consecutive years, e.g. 2013-2014, or using just the first year like 2013. Because the numbers of MMWR weeks in an year can be either 52 or 53, a season can also have either 52 or 53 epiweeks (e.g. season 2014-2015 has 53 weeks since year 2014 has 53 MMWR weeks).
A target is something that models try to predict at each time step. Targets which specify properties of a season, like the peak week are seasonal targets. On the other hand, targets like n weeks ahead are weekly targets.
There are 11 geographical regions. 10 identifying the 10 HHS regions and 1 for the complete nation.
Weighted influenza-like illness index, wili%, is the metric used in the time series. It is defined as the percentage of outpatient doctor visits for influenza-like illness, weighed by state population. This page on CDC.gov describes it in more details.

Targets

At each time point, each model provides predictions for the following 7 targets (for each of the 11 regions):

1 week ahead wili% value.
2 week ahead wili% value.
3 week ahead wili% value.
4 week ahead wili% value.
Peak week. The epiweek with the maximum wili% in the season.
Peak wili%. The wili% value at the peak week.
Onset week. An onset week for a given season is derived using a baseline wili% value set by the CDC for that season and region. It is defined as the first of the first 3 consecutive weeks with wili% equaling or exceeding the baseline.

Truth

True values for all the targets, regions are derived using the weekly values of wili% for that season. Since the wili% values are revised as a season progress. The final wili% at epiweek 201802 (say) might not be equal to the wili% value for the same epiweek when queried at a different time. Both the seasonal and the weekly targets may vary during a live (one whose wili% values are not settled) season.

Delphi Epidata API provides ways to collect both the final wili% data and the (unsettled) data as observed at a certain time in season.

Some other subtleties related to seasonal targets follow:

Since a peak week (and the value) can only be found when we get all the data of a season, it will not be available for a live season.
A season is allowed to have multiple peak weeks if the corresponding values are close enough.
Due to the way its defined, the onset week for a season might be unavailable for a few weeks and then be available for the rest of the season without changing.
Onset week can also be null for the whole season.

Scoring

Due to the variety in targets and the values they allow, there are multiple ways to score a target prediction. Scoring used in the visualizer uses the package flusight-csv-tools for collecting truth and calculating scores.

A few points to note:

The truth is based on the latest revision of data (not the data observed at the time of prediction).
Scores as calculated using single true value.

Scoring metrics

For each target, a model provides both a probability distribution and a point value corresponding to the bin with maximum probability. Therefore, we calculate the following two metrics, one based on point value and other on the distribution:

1. Absolute error

This is the absolute of error between the model's point estimate and the truth value. Lower is better.

2. Log score (single bin)

Natural log of probability assigned to the true bin. Higher is better.

As an example, if (for wili% based target) the bins were the following:

...
[2.0, 2.1) 0.00
[2.1, 2.2) 0.00
[2.2, 2.3) 0.02
[2.3, 2.4) 0.10 // The true bin
[2.4, 2.5) 0.20
[2.5, 2.6) 0.08
[2.6, 2.7) 0.01
...

The single bin log score will be Math.log(0.10) = -2.3025850929940455. Since we take a mean of log scores over many weeks for evaluating a model, we clip Math.log(0) to -10 instead of its real value (-Infinity) so that a model is not penalized heavily. This also helps in model comparison for cases where the difference between two models is only in the number of infinities they produce.

3. Log score (multi bin)

This is calculated by finding the natural log of probability assigned to the true and a few neighbouring bins. Higher is better.

multi bin means that multiple bins around the truth are considered for scoring instead of just one bin. As an example, consider the truth is 2.3 and the bins (around the truth) with probabilities are:

...
[2.0, 2.1) 0.00
[2.1, 2.2) 0.00
[2.2, 2.3) 0.02
[2.3, 2.4) 0.10 // The true bin
[2.4, 2.5) 0.20
[2.5, 2.6) 0.08
[2.6, 2.7) 0.01
...

A single bin scoring rule will return a log score of Math.log(0.10) = -2.3025850929940455. If we instead use multibin scoring with a window of 2 bins around the truth (effectively 5 true bins), we get the score as:

Math.log(0.00 + 0.02 + 0.10 + 0.20 + 0.08) = -0.916290731874155

We use a window of 5 bins (total of 11 bins) for wili% targets (peak wili and week ahead targets) and a window of 1 bin (total 3 bins) for week targets (onset and peak wk).

As in the case of single bin log score, here also we clip the value of Math.log(0) to -10 (instead of going all the way back to -Infinity).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scoring

Targets

Truth

Scoring

Scoring metrics

1. Absolute error

2. Log score (single bin)

3. Log score (multi bin)

Clone this wiki locally