Skip to content

Latest commit

 

History

History
323 lines (249 loc) · 11.1 KB

CDC-Wonder.md

File metadata and controls

323 lines (249 loc) · 11.1 KB

Extracting CDC Data

This notebook documents our process for extracting natality data from CDC Wonder web application and other sources, such as data.cdc.gov.

All raw, downloaded CDC data are stored in this project’s /data/ folder.

library(here)
library(dplyr)
library(vroom)
library(tidyr)
library(purrr)

CDC Wonder (Web App)

Expanded Natality data (2016 - 2021)

Here are the latest technical notes on CDC’s Natality data and the CDC data dictionary.

Low-risk deliveries

This dataset is inspired by the Maternal, Infant, and Child Health objective identified by Healthy People 2030 to reduce cesarean births among low-risk women with no prior births to a target of 23.6% nationwide. A low-risk birth is defined as

  • nulliparous: first birth,

  • singleton: a single fetus (not multiple),

  • term: at least 37 weeks of gestation based on obstetric estimate of gestation at delivery, and

  • vertex: not breech / head is facing in a downward position for delivery.

In order to get the totals for low-risk births and low-risk cesarean births we must identify the equivalent criteria in CDC Wonder’s Natality for 2016 - 2021 (expanded) data collection. For low-risk cesarean births, we set the following:

In section 1. “Organize table layout”

  • Group Results by: Delivery characteristics - Year

In section 5. “Select pregnancy history and prenatal care characteristics”

  • Live Birth Order: 1

In section 10. “Select delivery characteristics”

  • Year: All Years

  • Fetal Presentation: Cephalic

  • Delivery Method Expanded: Primary C-section; Repeat C-section; C-section (unknown if previous c-section)

In section 12. “Select infant characteristics”

  • OE Gestational Age Recode 11: 37-38 weeks; 39 weeks; 40 weeks; 41 weeks; 42 weeks or more

  • Plurality: Single

Use the send button to run the query in CDC Wonder. The export of this dataset is saved as the US Natality, 2016-2021 expanded_low-risk cesarean.txt data file in /data/.

We can use vroom to load this data in R and mutate in formatted version of the counts.

low_risk_cesarean_totals <- 
  vroom(
    here("data", "US Natality, 2016-2021 expanded_low-risk cesarean.txt"),
    n_max = 6,
    col_types = cols("c", "i", "i", "i"),
    delim = "\t"
  )  %>% 
  select(Year, Cesarean_births = Births) %>%  
  as.data.frame()

low_risk_cesarean_totals %>%
  mutate(Cesarean_births = prettyNum(Cesarean_births, big.mark = ",")) %>% 
  arrange(desc(Year))
##   Year Cesarean_births
## 1 2021         316,349
## 2 2020         310,303
## 3 2019         314,016
## 4 2018         319,022
## 5 2017         325,086
## 6 2016         329,614

These match the low-risk cesarean totals reported on page 36, Table 17 Jan. 2023 National Vital Statistics Report.

Next, we’ll get total low-risk births so that we can calculate the proportion of low-risk cesareans among these. Note that these totals are not provided in the report linked above but they are the denominator used in the low-risk cesarean rates calculation as stated in footnote 6 of Table 17.

The low-risk birth data is obtained by using the same criteria above with the exception of changing the delivery method to All Methods:

In section 1. “Organize table layout”

  • Group Results by: Delivery characteristics - Year

In section 5. “Select pregnancy history and prenatal care characteristics”

  • Live Birth Order: 1

In section 10. “Select delivery characteristics”

  • Year: All Years

  • Fetal Presentation: Cephalic

  • Delivery Method Expanded: All Methods

In section 12. “Select infant characteristics”

  • OE Gestational Age Recode 11: 37-38 weeks; 39 weeks; 40 weeks; 41 weeks; 42 weeks or more

  • Plurality: Single

The export of this dataset is the US Natality, 2016-2021 expanded_low-risk births.txt data file stored in /data/. The total counts are displayed by year below.

low_risk_all_deliveries_totals <- 
  vroom(
    here("data", "US Natality, 2016-2021 expanded_low-risk births.txt"),
    n_max = 6,
    col_types = cols("c", "i", "i", "i"),
    delim = "\t"
  ) %>% 
  select(Year, Births) %>%  
  as.data.frame()

low_risk_all_deliveries_totals %>%
  mutate(Births = prettyNum(Births, big.mark = ",")) %>% 
  arrange(desc(Year))
##   Year    Births
## 1 2021 1,204,358
## 2 2020 1,198,613
## 3 2019 1,226,476
## 4 2018 1,231,332
## 5 2017 1,250,875
## 6 2016 1,280,607

We join the low-risk cesarean and low-risk births data into a single dataset and calculate the national low-risk rates from 2016 to 2021.

df_low_risk_births <- 
  left_join(
    low_risk_all_deliveries_totals, 
    low_risk_cesarean_totals, 
    by = "Year"
  ) %>% 
  mutate(low_risk_cesarean_rate = Cesarean_births/Births)

df_low_risk_births %>% 
  mutate(
    low_risk_cesarean_rate = scales::percent(low_risk_cesarean_rate, accuracy = .1),
    Births = prettyNum(Births, big.mark = ","),
    Cesarean_births = prettyNum(Cesarean_births, big.mark = ",")
  ) %>% 
  arrange(desc(Year))
##   Year    Births Cesarean_births low_risk_cesarean_rate
## 1 2021 1,204,358         316,349                  26.3%
## 2 2020 1,198,613         310,303                  25.9%
## 3 2019 1,226,476         314,016                  25.6%
## 4 2018 1,231,332         319,022                  25.9%
## 5 2017 1,250,875         325,086                  26.0%
## 6 2016 1,280,607         329,614                  25.7%

The total low-risk Cesarean births and low-risk Cesarean rates match the low-risk cesarean totals and percentages reported in Table 17, page 36 of the Jan. 2023 National Vital Statistics Report.

This processed dataframe is saved as a csv in /publish/ and RDS in /save/ with the filename “US_low_risk_births_2016_to_2021”.

saveRDS(df_low_risk_births, here("save","US_low_risk_births_2016_to_2021.RDS"))
write.csv(df_low_risk_births, here("publish", "US_low_risk_births_2016_to_2021.csv"), row.names = FALSE)

For the latest rates as of 2023, visit the NCHS - VSRR Quarterly provisional estimates for selected birth indicators dataset on Socrata. This does not include totals but has cesarean and low-risk cesarean birth rates at the national level by race/ethnicity.

State-level low-risk deliveries

In CDC Wonder we use the same criteria for low-risk cesarean births and low-risk births defined above, and add on an additional variable to Section 1. “Organize table layout” to

  • Group Results by: Year And By State of Residence

Send this result in CDC Wonder, and the output provides totals for all 50 US states and the District of Columbia from 2016 to 2021. The output files for low-risk births and low-risk cesarean delivery totals are, respectively, the State-level Natality, 2016-2021 expanded_low-risk births.txt and State-level Natality, 2016-2021 expanded_low-risk cesarean.txt data files stored in /data/.

These are combined into a single dataset using vroom and purrr::map2_dfr below. The low-risk cesarean rates for each jurisdiction are calculated in using a mutate.

state_filenames <- c(
  here("data", "State-level Natality, 2016-2021 expanded_low-risk births.txt"),
  here("data", "State-level Natality, 2016-2021 expanded_low-risk cesarean.txt")
)

datasets <- c(
  "All low-risk births",
  "Low-risk cesarean births"
)

df_low_risk_births_by_state <-
  map2_dfr(
    state_filenames, datasets,
    ~ vroom(
        file = .x,
        n_max = 306,
        col_types = cols("c", "i", "i", "c", "i", "i")
      ) %>% 
      drop_na(`State of Residence`) %>% 
      select(-Notes, -`Year Code`) %>% 
      mutate(type = .y)
  ) %>% 
  pivot_wider(
    names_from = type,
    values_from = Births
  ) %>% 
  rename(FIPS = `State of Residence Code`, State = `State of Residence`) %>% 
  mutate(low_risk_cesarean_rate = `Low-risk cesarean births`/`All low-risk births`)

df_low_risk_births_by_state
## # A tibble: 306 × 6
##     Year State                 FIPS `All low-risk births` Low-risk ces…¹ low_r…²
##    <int> <chr>                <int>                 <int>          <int>   <dbl>
##  1  2016 Alabama                  1                 19029           5318   0.279
##  2  2016 Alaska                   2                  3437            656   0.191
##  3  2016 Arizona                  4                 25824           5591   0.217
##  4  2016 Arkansas                 5                 11788           2944   0.250
##  5  2016 California               6                162397          40487   0.249
##  6  2016 Colorado                 8                 22603           4613   0.204
##  7  2016 Connecticut              9                 12425           3623   0.292
##  8  2016 Delaware                10                  3627            862   0.238
##  9  2016 District of Columbia    11                  3882           1051   0.271
## 10  2016 Florida                 12                 76403          23932   0.313
## # … with 296 more rows, and abbreviated variable names
## #   ¹​`Low-risk cesarean births`, ²​low_risk_cesarean_rate

This processed dataframe is saved as a csv in /publish/ and RDS in /save/ with the filename “state_low_risk_births_2016_to_2021”.

saveRDS(df_low_risk_births_by_state, here("save","state_low_risk_births_2016_to_2021.RDS"))
write.csv(df_low_risk_births_by_state, here("publish", "state_low_risk_births_2016_to_2021.csv"), row.names = FALSE)

Other data sources

CDC offers an API for US-level data from CDC Wonder:

https://wonder.cdc.gov/wonder/help/WONDER-API.html

We did not use this service for this project since the API is not setup to provide state-level data (or any granular geographic data). However, we did learn about the wonderapi R package for querying the CDC Wonder API using R. At the time of this write-up (3/6/2023) the wonderapi package had a branch for adding support to query CDC’s Expanded Natality dataset.

Additional sources for CDC vital statistics data include: