Skip to content

Latest commit

 

History

History
296 lines (253 loc) · 14.4 KB

README.md

File metadata and controls

296 lines (253 loc) · 14.4 KB

srvyrexploR Hex sticker for srvyrexploR package - a fake world with hills made out of pie charts, fields that look like matrix plots, and buildings that have error bars. A dragon in the water.

The srvyexploR package provides datasets used in the book Exploring Complex Survey Data Analysis Using R: A Tidy Introduction with {srvyr} and {survey}. This will help readers follow along with the examples and work through the exercises.

Installation

To install the development version from GitHub, use:

# install.packages("pak")
pak::pak("tidy-survey-r/srvyrexploR")

To load the package, use:

library(srvyrexploR)

About the data

This package includes data from three surveys including the American National Election Studies (ANES), the National Crime Victimization Survey (NCVS), and the Residential Energy Consumption Survey (RECS).

ANES

The ANES data is based on the publicly available 2020 ANES data with additional derived variables and is subset to people who completed both pre and post-election interviews. The ANES Times Series Studies collect data on political polling in the United States and has been conducted since 1948.For more information about the 2020 study, see the American National Election Studies website. On the ANES website, you can learn more about the study, see codebooks and methodology reports, and download the data (after registering). We received permission to distribute this data for the purpose of the book. Once the package is loaded, you can use the data immediately as follows:

head(anes_2020)
#> # A tibble: 6 × 65
#>   V200001 CaseID V200002 InterviewMode V200010b Weight V200010c VarUnit V200010d
#>     <dbl>  <dbl> <hvn_l> <fct>            <dbl>  <dbl>    <dbl> <fct>      <dbl>
#> 1  200015 200015       3 Web              1.01   1.01         2 2              9
#> 2  200022 200022       3 Web              1.16   1.16         2 2             26
#> 3  200039 200039       3 Web              0.769  0.769        1 1             41
#> 4  200046 200046       3 Web              0.521  0.521        2 2             29
#> 5  200053 200053       3 Web              0.966  0.966        1 1             23
#> 6  200060 200060       3 Web              0.235  0.235        2 2             37
#> # ℹ 56 more variables: Stratum <fct>, V201006 <hvn_lbll>,
#> #   CampaignInterest <fct>, V201023 <hvn_lbll>, EarlyVote2020 <fct>,
#> #   V201024 <hvn_lbll>, V201025x <hvn_lbll>, V201028 <hvn_lbll>,
#> #   V201029 <hvn_lbll>, V201101 <hvn_lbll>, V201102 <hvn_lbll>,
#> #   VotedPres2016 <fct>, V201103 <hvn_lbll>, VotedPres2016_selection <fct>,
#> #   V201228 <hvn_lbll>, V201229 <hvn_lbll>, V201230 <hvn_lbll>,
#> #   V201231x <hvn_lbll>, PartyID <fct>, V201233 <hvn_lbll>, …

See ?anes_2020 for more information about the data.

Also, included in the package is a Stata version of the ANES data with a subset of the columns and is subset to people who completed both pre and post-election interviews. To load this dataset, we recommend using the {haven} package as follows:

anes_stata <- haven::read_dta(system.file("extdata", "anes_2020_stata_example.dta", package = "srvyrexploR"))

NCVS

The NCVS data is based off of publicly available data for the 2021 NCVS. The NCVS is a survey conducted by the Bureau of Justice Statistics and asks people age 12 and over about their crime victimizations. The study has been conducted continuously since 1992. This package includes three datasets - one for household-level data (ncvs_2021_household), one for person-level data (ncvs_2021_person), and one for incident-level data (ncvs_2021_incident) where each includes a subset of the columns of the full data available from 2021 at ICPSR. This data is reproduced here with permission from ICPSR.

head(ncvs_2021_household)
#> # A tibble: 6 × 12
#>   YEARQ IDHH    WGTHHCY V2117 V2118 V2015 V2143 SC214A V2122 V2126B V2127B V2129
#>   <dbl> <chr>     <dbl> <dbl> <dbl> <fct> <fct> <fct>  <fct> <fct>  <fct>  <fct>
#> 1 2021. 171005…      0    139     1 <NA>  3     12     33    0      2      3    
#> 2 2021. 171005…   1072.    63     2 2     2     8      32    17     2      1    
#> 3 2021. 171005…      0    140     1 <NA>  2     5      33    13     2      3    
#> 4 2021. 171005…      0    139     1 <NA>  3     13     33    0      2      3    
#> 5 2021. 171005…   1200.   138     1 1     2     11     29    18     2      1    
#> 6 2021. 171005…   1254.   138     1 1     2     8      24    13     2      2
head(ncvs_2021_person)
#> # A tibble: 6 × 11
#>   YEARQ IDHH           IDPER WGTPERCY V3014 V3015 V3018 V3023A V3024 V3084 V3086
#>   <dbl> <chr>          <chr>    <dbl> <dbl> <fct> <fct> <fct>  <fct> <fct> <fct>
#> 1 2021. 1710051365368… 1710…    1216.    84 3     2     1      2     6     2    
#> 2 2021. 1710053925458… 1710…    1362.    70 5     2     1      2     2     2    
#> 3 2021. 1710053925458… 1710…       0     43 5     1     1      2     <NA>  <NA> 
#> 4 2021. 1710053925458… 1710…       0     15 5     1     1      2     <NA>  <NA> 
#> 5 2021. 1710053965345… 1710…    1422.    89 1     2     1      2     2     2    
#> 6 2021. 1710053965345… 1710…       0     90 1     1     1      2     <NA>  <NA>
head(ncvs_2021_incident)
#> # A tibble: 6 × 60
#>   YEARQ IDHH     IDPER V4012 WGTVICCY V4016 V4017 V4018 V4019 V4021B V4022 V4024
#>   <dbl> <chr>    <chr> <dbl>    <dbl> <dbl> <fct> <fct> <fct> <fct>  <fct> <fct>
#> 1 2021. 1710071… 1710…     1    1780.     1 1     <NA>  <NA>  9      3     6    
#> 2 2021. 1710071… 1710…     1    1990.     2 1     <NA>  <NA>  8      3     7    
#> 3 2021. 1710071… 1710…     2    1990.     2 1     <NA>  <NA>  8      3     7    
#> 4 2021. 1710073… 1710…     1    4653.     1 1     <NA>  <NA>  1      3     5    
#> 5 2021. 1710074… 1710…     1    2302.     1 1     <NA>  <NA>  2      3     21   
#> 6 2021. 1710074… 1710…     1    2308.     1 1     <NA>  <NA>  8      3     5    
#> # ℹ 48 more variables: V4049 <fct>, V4050 <fct>, V4051 <fct>, V4052 <fct>,
#> #   V4053 <fct>, V4054 <fct>, V4055 <fct>, V4056 <fct>, V4057 <fct>,
#> #   V4058 <fct>, V4234 <fct>, V4235 <fct>, V4241 <fct>, V4242 <fct>,
#> #   V4243 <fct>, V4244 <fct>, V4245 <fct>, V4248 <dbl>, V4256 <fct>,
#> #   V4257 <fct>, V4258 <fct>, V4259 <fct>, V4260 <fct>, V4261 <fct>,
#> #   V4262 <fct>, V4263 <fct>, V4264 <fct>, V4265 <fct>, V4266 <fct>,
#> #   V4267 <fct>, V4268 <fct>, V4269 <fct>, V4270 <fct>, V4271 <fct>, …

RECS

Three files are included associated with RECS - a dataset with the 2015 data with some derived variables created for the book (recs_2015), the 2020 data with some derived variables created for the book (recs_2020), and the 2020 data with the original variables (recs_2020_raw). RECS is a survey about energy consumption and expenditure among residential households in the United States and has been conducted since 1979 by the Energy Information Administration. More information about the original data is available at the RECS website.

head(recs_2015)
#> # A tibble: 6 × 141
#>   DOEID REGIONC Region    Division MSAStatus Urbanicity HousingUnitType YearMade
#>   <dbl>   <dbl> <fct>     <fct>    <fct>     <fct>      <fct>           <ord>   
#> 1 10001       4 West      Pacific  Metropol… Urban Area Single-family … 2000-20…
#> 2 10002       3 South     West So… None      Rural      Single-family … 1980-19…
#> 3 10003       3 South     East So… Metropol… Urban Area Single-family … 1970-19…
#> 4 10004       2 Midwest   West No… Micropol… Urban Clu… Single-family … 1950-19…
#> 5 10005       1 Northeast Middle … Metropol… Urban Area Single-family … 1970-19…
#> 6 10006       1 Northeast New Eng… None      Urban Clu… Apartment: 5 o… 1980-19…
#> # ℹ 133 more variables: SpaceHeatingUsed <lgl>, HeatingBehavior <fct>,
#> #   WinterTempDay <dbl>, WinterTempAway <dbl>, WinterTempNight <dbl>,
#> #   ACUsed <lgl>, ACBehavior <fct>, SummerTempDay <dbl>, SummerTempAway <dbl>,
#> #   SummerTempNight <dbl>, TOTCSQFT <dbl>, TOTHSQFT <dbl>, TOTSQFT_EN <dbl>,
#> #   TOTUCSQFT <dbl>, TOTUSQFT <dbl>, NWEIGHT <dbl>, BRRWT1 <dbl>, BRRWT2 <dbl>,
#> #   BRRWT3 <dbl>, BRRWT4 <dbl>, BRRWT5 <dbl>, BRRWT6 <dbl>, BRRWT7 <dbl>,
#> #   BRRWT8 <dbl>, BRRWT9 <dbl>, BRRWT10 <dbl>, BRRWT11 <dbl>, BRRWT12 <dbl>, …
head(recs_2020)
#> # A tibble: 6 × 100
#>    DOEID ClimateRegion_BA Urbanicity Region    REGIONC   Division     STATE_FIPS
#>    <dbl> <fct>            <fct>      <fct>     <chr>     <fct>        <chr>     
#> 1 100001 Mixed-Dry        Urban Area West      WEST      Mountain So… 35        
#> 2 100002 Mixed-Humid      Urban Area South     SOUTH     West South … 05        
#> 3 100003 Mixed-Dry        Urban Area West      WEST      Mountain So… 35        
#> 4 100004 Mixed-Humid      Urban Area South     SOUTH     South Atlan… 45        
#> 5 100005 Mixed-Humid      Urban Area Northeast NORTHEAST Middle Atla… 34        
#> 6 100006 Hot-Humid        Urban Area South     SOUTH     West South … 48        
#> # ℹ 93 more variables: state_postal <fct>, state_name <fct>, HDD65 <dbl>,
#> #   CDD65 <dbl>, HDD30YR <dbl>, CDD30YR <dbl>, HousingUnitType <fct>,
#> #   YearMade <ord>, TOTSQFT_EN <dbl>, TOTHSQFT <dbl>, TOTCSQFT <dbl>,
#> #   SpaceHeatingUsed <lgl>, ACUsed <lgl>, HeatingBehavior <fct>,
#> #   WinterTempDay <dbl>, WinterTempAway <dbl>, WinterTempNight <dbl>,
#> #   ACBehavior <fct>, SummerTempDay <dbl>, SummerTempAway <dbl>,
#> #   SummerTempNight <dbl>, NWEIGHT <dbl>, NWEIGHT1 <dbl>, NWEIGHT2 <dbl>, …
head(recs_2020_raw)
#> # A tibble: 6 × 789
#>    DOEID REGIONC   DIVISION        STATE_FIPS state_postal state_name BA_climate
#>    <dbl> <chr>     <chr>           <chr>      <chr>        <chr>      <chr>     
#> 1 100001 WEST      Mountain South  35         NM           New Mexico Mixed-Dry 
#> 2 100002 SOUTH     West South Cen… 05         AR           Arkansas   Mixed-Hum…
#> 3 100003 WEST      Mountain South  35         NM           New Mexico Mixed-Dry 
#> 4 100004 SOUTH     South Atlantic  45         SC           South Car… Mixed-Hum…
#> 5 100005 NORTHEAST Middle Atlantic 34         NJ           New Jersey Mixed-Hum…
#> 6 100006 SOUTH     West South Cen… 48         TX           Texas      Hot-Humid 
#> # ℹ 782 more variables: IECC_climate_code <chr>, UATYP10 <chr>, HDD65 <dbl>,
#> #   CDD65 <dbl>, HDD30YR_PUB <dbl>, CDD30YR_PUB <dbl>, TYPEHUQ <dbl>,
#> #   CELLAR <dbl>, CRAWL <dbl>, CONCRETE <dbl>, BASEOTH <dbl>, BASEFIN <dbl>,
#> #   ATTIC <dbl>, ATTICFIN <dbl>, STORIES <dbl>, PRKGPLC1 <dbl>,
#> #   SIZEOFGARAGE <dbl>, KOWNRENT <dbl>, YEARMADERANGE <dbl>, BEDROOMS <dbl>,
#> #   NCOMBATH <dbl>, NHAFBATH <dbl>, OTHROOMS <dbl>, TOTROOMS <dbl>,
#> #   STUDIO <dbl>, WALLTYPE <dbl>, ROOFTYPE <dbl>, HIGHCEIL <dbl>, …

Examples

To analyze the survey data, we recommend using the {srvyr} package as follows:

# install.packages("pak")
pak::pak("gergness/srvyr")
library(srvyr)

recs_des <- recs_2020 %>%
  as_survey_rep(
    weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60,
    type = "JK1", scale = 59 / 60, mse = TRUE,
    variables = c(ACUsed, Region)
  )

recs_des
#> Call: Called via srvyr
#> Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.
#> Sampling variables:
#>   - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 +
#>     NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 +
#>     NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 + NWEIGHT17 +
#>     NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 + NWEIGHT22 + NWEIGHT23 +
#>     NWEIGHT24 + NWEIGHT25 + NWEIGHT26 + NWEIGHT27 + NWEIGHT28 + NWEIGHT29 +
#>     NWEIGHT30 + NWEIGHT31 + NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 +
#>     NWEIGHT36 + NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 +
#>     NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 + NWEIGHT47 +
#>     NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 + NWEIGHT52 + NWEIGHT53 +
#>     NWEIGHT54 + NWEIGHT55 + NWEIGHT56 + NWEIGHT57 + NWEIGHT58 + NWEIGHT59 +
#>     NWEIGHT60` 
#>   - weights: NWEIGHT 
#> Data variables: 
#>   - ACUsed (lgl), Region (fct)

recs_des %>%
  group_by(Region) %>%
  summarize(
    p = survey_mean(ACUsed, vartype = "ci", proportion = TRUE, prop_method = "logit")
  )
#> # A tibble: 4 × 4
#>   Region        p p_low p_upp
#>   <fct>     <dbl> <dbl> <dbl>
#> 1 Northeast 0.890 0.877 0.901
#> 2 Midwest   0.933 0.922 0.943
#> 3 South     0.942 0.936 0.947
#> 4 West      0.745 0.729 0.760

The above example estimates the proportion of residential households that use air-conditioning by region with a 95% confidence interval.

License

Data are available by CC BY 4.0 license. Additionally, re-distributing the ANES or NCVS datasets is subject to their policies.

Additional data use information

Anyone interested in redistributing the NCVS data should refer to ICPSR: Requests for Permission to Redistribute ICPSR Data.

Anyone interested in redistributing the ANES data should refer to the ANES FAQ - disseminate.

References

Data citations:

ANES:

  • American National Election Studies. 2021. ANES 2020 Time Series Study Full Release [dataset and documentation]. July 19, 2021 version. https://www.electionstudies.org

NCVS:

  • United States. Bureau of Justice Statistics. National Crime Victimization Survey, [United States], 2021. Inter-university Consortium for Political and Social Research [distributor], 2022-09-19. https://doi.org/10.3886/ICPSR38429.v1

RECS: