Skip to content

Latest commit

 

History

History
224 lines (173 loc) · 15.1 KB

README.md

File metadata and controls

224 lines (173 loc) · 15.1 KB

Ground truth data for the COVID-19 Forecast Hub

The data-truth folder contains the "ground truth" data that forecasts are eventually compared to. The main files in this folder contain processed versions of data from JHU CSSE data while subfolders contain other data sources. As of February 20, 2023 we are no longer collecting data or analyzing COVID-19 cases and as of March 6, 2023 we are no longer collecting data or analyzing COVID-19 deaths. As of March 10, 2023, Johns Hopkins University's (JHU) Center for System Science and Engineering (CSSE) will no longer report COVID-19 cases or deaths.

Table of Contents

Data sources

The COVID-19 Forecast Hub collates daily deaths and confirmed cases from the Johns Hopkins University's (JHU) Center for System Science and Engineering (CSSE) group's COVID-19 github repository as the gold standard reference data for deaths in the US.

We also collate case and death data from NYTimes and USAFacts for comparison to JHU.

Hospitalization data are taken from the [HealthData.gov COVID-19 Reported Patient Impact and Hospital Capacity by State Timeseries](https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/g62h-syeh. More details on how these data are used are available in the technical README.

Some of these data are also available progammatically through the EpiData API.

Case and death data

There are several different sources for death data. All forecasts will be compared to the daily reports containing death data from the JHU CSSE group as the gold standard reference data for deaths in the US. Note that there are significant differences (especially in daily incident death data) between the JHU data and another commonly used source, from the New York Times. The team at UTexas-Austin has tracked this issue on a separate GitHub repository.

Data from a variety of sources are available via the COVIDcast Epidata API.

Daily Truth Data

We aggregate and format both Cumulative Death and Incident Death truth data from the JHU CSSE group. Although these csvs are not explicitly used in the visualization code, they match the "Actual" line in the visualization. This method in covidHubUtils package creates these truth data csvs.

There are also corresponding methods in covidHubUtils, for truths from NYTimes and USAFacts, that downloads and perform aggregation. The data is stored in data-truth/nytimes/ and data-truth/usafacts/

Weekly Truth Data

Weekly cumulative counts are the reported values as of the Saturday of each week. For example, the weekly cumulative count for the week ending Saturday, August 1, 2020 is equal to the reported daily cumulative count for Saturday, August 1, 2020.

Weekly incident counts are calculated as the difference between consecutive weekly cumulative counts. For example, the weekly incident count for the week ending Saturday, August 1, 2020 is the difference between the weekly cumulative count for Saturday, August 1, 2020 and the weekly cumulative count for Saturday, July 25, 2020.

Aggregation to State and National Level

The cumulative and incident counts at the state level are calculated by summing reported cumulative and incident counts in the JHU data file across all locations with the same value for the Province_State field. This includes some "county-level" records for which we do not request forecasts. These are records with a five-digit FIPS code beginning with 80 or 90, corresponding to "Out of State" or "Unassigned" locations. For this reason, the counts at the state level may in general be larger than the sum of the counts for the counties within a given state.

Special case: DC is recorded in the truth data with both county code 11001 and state code 11. We have made the decision to omit the county level data since it is duplicated by the state level data.

The counts at the national level are calculated as the sum of counts for all locations in the JHU data file. This includes counts for the Diamond Princess cruise ship, and so the counts for the state level again do not sum to the counts for the national level.

Hospitalization data

In the week of 16 Nov 2020, a proposal was been made to use HealthData.gov confirmed hospital admissions as the ground truth for hospitalizations. Prior to this week, no official source for hospitalization ground truth data had been identified. On 1 Dec 2020, a final determination of was made to treat this source as official for the Hub, as detailed below.

HealthData.gov Hospitalization Timeseries

The truth data that hospitalization forecasts (inc hosp targets) will be evaluated against are the HealthData.gov COVID-19 Reported Patient Impact and Hospital Capacity by State Timeseries. These data are typically updated daily. An archive of updates is available on this page.

A supplemental data source with daily counts that but does not include the full time-series is HealthData.gov COVID-19 Reported Patient Impact and Hospital Capacity by State.

Resources for Accessing Hospitalization Data

  1. We are working with our collaborators at the Delphi Group at CMU to make these data available through their Delphi Epidata API. The current weekly timeseries of the hospitalization data as well as prior versions of the data are available as the covid_hosp endpoint of the API. This endpoint is also available through the COVIDcast Epidata API.

  2. The Forecast Hub has developed the covidData R package which facilitates downloading and storing HealthData.gov data on hospitalizations (as well as JHU data on cases and deaths). This package requires a bit of set-up with python and make but it does provide tools to access all ground truth data used by the Hub. A vignette showing some basic functionality for the package is available in Rmarkdown (click here to view the HTML vignette).

Data processing

The hospitalization truth data is computed as the sum of the columns previous_day_admission_adult_covid_confirmed and previous_day_admission_pediatric_covid_confirmed which provide the new daily admission for adults and kids, respectively. (Other columns represent “suspected” COVID-19 hospitalizations, however because definitions and implementations of suspected cases vary widely, our public health collaborators have recommended using the above columns only.)

Since these admission data are listed as “previous day” admissions in the raw data, the truth data shifts values in the date column one day earlier so that inc hosp align with the date the admissions occurred.

As an example, the following data from HealthData.gov

   date    | previous_day_admission_adult_covid_confirmed | previous_day_admission_pediatric_covid_confirmed
-----------|----------------------------------------------|-------------------------------------------------
2020-10-30 |                  5                           |                       12                        

would turn into the following observed data for incident hospitalizations

   date    | incident_hospitalizations
-----------|----------------------------
2020-10-29 |          17               

National hospitalization, i.e. US, data are constructed from these data by summing the data across all 50 states, Washington DC (DC), Puerto Rico(PR), and the US Virgin Islands (VI). The HHS data do not include admissions for additional territories.

Additional resources

Here are a few additional resources that describe these hospitalization data:

Accessing truth data

While we go to some pains at the Forecast Hub to create accurate, verified, clean versions of the truth data, all of these should be seen as secondary sources to the original data at the JHU CSSE, HHS, and other sites.

CSV files

A set of comma-separated plain text files are automatically updated every week with the latest observed values for each of the following targets: Cumulative Cases, Cumlative Deaths, Cumulative Hospitalizations, Incident Cases, Incident Deaths, Incident Hospitalizations. For each of these six targets, a corresponding CSV file is created in data-truth/truth-[target name].csv. Details on the scripts that update and validate the contents of these files every week can be found on the Developer Wiki.

covidData R package

The Forecast Hub has developed the covidData R package which facilitates downloading and storing all data used by the Hub. This package requires a bit of set-up with python and make. A vignette showing some basic functionality for the package is available in Rmarkdown (click here to view the HTML vignette).

covidHubUtils R package

The Forecast Hub has developed the covidHubUtils R package to facilitate the basic operations with forecast data, especially downloading, plotting, and scoring forecasts. A vignette showing some basic functionality for the package is available in Rmarkdown (click here to view the HTML vignette).

Where truth data is used

Truth data is used primarily to support the hub in the following tasks:

Visualization Truth Data

The Actual line in the visualization is based on the JHU CSSE group truth data. The visualization uses this Cumulative Death JSON, and this Incident Death JSON. This python script creates these JSONS.

The actual data the visualization uses (Forecasts + Truth Data) is in this folder. These JSONs are created with the commands in 0-init-vis.sh using the truth data when the visualization is built. The file called "season-latest" is the default view, which is also Cumulative Deaths. For each State key in the JSON, there is an Actual object that contains the truth data in the visualization. More on the JSON structure here.

Zoltar Truth Data

The Zoltar truth data is created with this method in covidHubUtils and is storedhere.

Reporting anomalies

Some of these data sources documented above are occasionally revised and/or contain outlying observations. We are working to create a comprehensive documentation of those instances. You can read about the resources we provide on this in the data-anomalies README file