Skip to content

Data standards

Susan Paykin edited this page Jun 9, 2022 · 2 revisions

File Names

All final .csv datasets are named using the following convention:

[Theme Abbreviation][2-digit number]_[Spatial Scale].csv

For example, the Policy theme dataset on Prison Incarceration Rates (PS01) at the county-level is PS01_C.csv. The same dataset at the state level is PS01_S.csv, at the tract-level would be PS01_T.csv, and at the zip code level would be PS01_Z.csv.

Theme Abbreviations

  • Policy: PS
  • Health: Health, Access
  • Demographic: DS
  • Economic: EC
  • Physical Environment: BE
  • COVID-19: COVID

Spatial Scales

  • Tract: T
  • Zip/ZCTA: Z
  • County: C
  • State: S

Geographic Identifiers (GEOIDs)

All datasets have geographic identifiers included as a variable. We use the following labeling convention for each spatial scale:

Variable Variable ID Description
State STATEFP 2-digit State FIPS code
County COUNTYFP 5-digit County FIPS code (state + county)
ZIP Code/ZCTA ZCTA 5-digit assigned ZCTA
Census Tract GEOID 11-digit unique tract ID (state + county + tract)

Data Formatting

  • Watch for leading zeros. Some geographic identifiers for states, counties, zip codes, and tracts start with “0” or “00”; i.e. leading zeros. However, .csv and other text file formats drop leading zeros automatically upon opening. This means that a state FIPS code of “02” becomes “2”, a county code of “02004” becomes “2004”, a zip code of “07436” becomes “7436”, etc. If you are merging .csvs with any other data by their geographic identifier, you will need to add in the leading zeros (or conversely, drop the leading zeros in the other file) so that they match. This is particularly important when you are trying to merge with spatial format files (.shp, .gpkg, .geojson, etc), including the geographic boundary files.

  • Keep variable names to 10 characters or fewer for ease of data wrangling with shapefiles and GIS software. Some variable names are therefore shortened or abbreviated from the source data.

  • Numeric data are rounded to the nearest tenth (two decimal places).

  • Missing data are represented as “NA” or empty, depending on the language or platform you are working with. These should not be mistaken for or confused with the numeric “0”.

Key Guidelines

(tl;dr) If you are interested in contributing to the OEPS, please keep in mind the following key guidelines:

  • Variables names should be no more than 10 characters
  • Numeric observations should be rounded to the nearest tenth (two decimal places)
  • Remove any index columns
  • Remove quotations marks, commas, or other character punctuation
  • Code missing as unavailable data as NA or empty