-
Notifications
You must be signed in to change notification settings - Fork 13
/
extract_DCMS_sectors.Rd
94 lines (81 loc) · 3.81 KB
/
extract_DCMS_sectors.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/extract_DCMS_sectors.R
\name{extract_DCMS_sectors}
\alias{extract_DCMS_sectors}
\title{extract list of DCMS sectors from ONS working file spreadsheet}
\usage{
extract_DCMS_sectors(x, sheet_name = "Working File", skip = 7,
sectors = c("creative", "digital", "culture", "telecoms", "gambling",
"sport", "tourism", "all_dcms"))
}
\arguments{
\item{x}{Location of the input spreadsheet file. Named something like
"working_file_dcms_VXX.xlsm".}
\item{sheet_name}{The name of the spreadsheet in which the data are stored.
Defaults to \code{New ABS Data}.}
\item{skip}{Number of lines to skip when reading the worksheet, inherits from
\code{readxl::read_excel}.}
\item{sectors}{A character vector of the sectors for which DCMS is
responsible, currently: \code{c('creative', 'digital', 'culture',
'telecoms', 'gambling', 'sport', 'tourism', 'all_dcms')}.}
}
\value{
The function returns nothing, but saves the extracted dataset to
\code{file.path(output_path, 'DCMS_sectors.Rds')}. This is an R data
object, which retains the column types which would be lost if converted to
a flat format like CSV.
}
\description{
The data which underlies the Economic Sectors for DCMS sectors
data is typically provided to DCMS as a spreadsheet from the Office for
National Statistics. This function extracts the list of sectors that DCMS
are responsible for from the working_file_dcms_V13.xlsm spreadsheet. This
information is also recorded in the methodology note which accompanies the
publication at
\url{https://www.gov.uk/government/publications/dcms-sectors-economic-estimates-methodology}
and a version correct at the time of the 2016 release is included in the
package as \code{eesectors::DCMS_sectors}. Hence, it is not necessary to
run this function every time - only if changes to the DCMS sectors are
made.
}
\details{
The best way to understand what happens when you run this function
is to look at the source code, which is available at
\url{https://github.com/ukgovdatascience/eesectors/blob/master/R/}. A brief
explanation of what the function does here:
1. The function calls \code{readxl::read_excel} to load the appropriate
page from the underlying spreadsheet.
2. The column names are sanitised and cleaned to remove extraneous
characters, and are made all lower case
3. The dataframe is limited to the columns: \code{'sic'},
\code{'description'}, and those contained in the \code{sector} argument:
\code{c('creative', 'digital', 'culture', 'telecoms', 'gambling', 'sport',
'tourism', 'all_dcms')}.
4. The data are pivoted into long form using \code{tidyr::gather_}. This
converts the data from a wide dataframe with \code{'sector'} as the key
column, and \code{present} as the value column (i.e. present in DCMS
sector?). The result is a much longer dataframe which is much easier to
subset.
5. For consistency with later steps, the \code{'sic'} column is renamed to
\code{'SIC'}.
6. The asterisks used in the spreadsheet to denote presence in a DCMS
sector are replaced by a binary variable with \code{TRUE} and \code{FALSE}
in place of \code{*} and \code{NA}.
7. \code{NA} values created by step 6 are removed from the dataframe.
8. The tourism entry which is formatted differently in the 'working file'
worksheet in working_file_dcms_V13.xlsm is fixed to ensure that it has both
a description and a SIC (where previously it just had a SIC), and 'tourism'
is labelled as a DCMS sector under tourism and all_dcms.
9. The data are saved out to a .Rds file, and a check run to ensure that
the file exists. The size of the new file is reported in bytes.
}
\examples{
\dontrun{
library(eesectors)
extract_DCMS_sectors(
x = 'OFFICIAL_working_file_dcms_V13.xlsm',
sheet_name = 'Working File',
output_path = '../OFFICIAL/'
)
}
}