-
Notifications
You must be signed in to change notification settings - Fork 14
/
080-elementary-queries.Rmd
310 lines (233 loc) · 12.7 KB
/
080-elementary-queries.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
# Introduction to DBMS queries {#chapter_dbms-queries-intro}
> This chapter demonstrates how to:
>
> * Download all or part of a table from the DBMS, including different kinds of subsets
> * See how `dplyr` code is translated into `SQL` commands and how they can be mixed
> * Get acquainted with some useful functions and packages for investigating a single table
> * Begin thinking about how to divide the work between your local R session and the DBMS
## Setup
The following packages are used in this chapter:
```{r package list, echo=TRUE, message=FALSE, warning=FALSE}
library(tidyverse)
library(DBI)
library(RPostgres)
library(dbplyr)
require(knitr)
library(bookdown)
library(sqlpetr)
library(skimr)
library(connections)
sleep_default <- 3
```
Assume that the Docker container with PostgreSQL and the adventureworks database are ready to go. If not go back to [Chapter 6][#chapter_setup-adventureworks-db]
```{r check on adventureworks}
sqlpetr::sp_docker_start("adventureworks")
Sys.sleep(sleep_default)
```
Connect to the database:
```{r connect to postgresql}
# con <- connection_open( # use in an interactive session
con <- dbConnect( # use in other settings
RPostgres::Postgres(),
# without the previous and next lines, some functions fail with bigint data
# so change int64 to integer
bigint = "integer",
host = "localhost",
user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
dbname = "adventureworks",
port = 5432
)
```
## Methods for downloading a single table
For the moment, assume you know something about the database and specifically what table you need to retrieve. We return to the topic of investigating the whole database later on.
```{r}
dbExecute(con, "set search_path to sales, humanresources;")
```
### Read the entire table
There are a few different methods of getting data from a DBMS, and we'll explore the different ways of controlling each one of them.
`DBI::dbReadTable` will download an entire table into an R [tibble](https://tibble.tidyverse.org/).
```{r}
salesorderheader_tibble <- DBI::dbReadTable(con, "salesorderheader")
str(salesorderheader_tibble)
```
That's very simple, but if the table is very large it may not be a problem, since R is designed to keep the entire table in memory. The tables that are found in an enterprise database such as `adventureworks` may be large, they are most often records kept by people. That somewhat limits their size (relative to data generated by machines) and expands the possibilities for human error.
Note that the first line of the str() output reports the total number of observations.
Later on we'll use this tibble to demonstrate several packages and functions, but use only the first 13 columns for simplicity.
```{r}
salesorderheader_tibble <- salesorderheader_tibble[,1:13]
```
### Create a pointer to a table that can be reused
The `dplyr::tbl` function gives us more control over access to a table by enabling control over which columns and rows to download. It creates an object that might **look** like a data frame, but it's actually a list object that `dplyr` uses for constructing queries and retrieving data from the DBMS.
```{r}
salesorderheader_table <- dplyr::tbl(con, "salesorderheader")
class(salesorderheader_table)
```
### Controlling the number of rows returned with `collect()`
The `collect` function triggers the creation of a tibble and controls the number of rows that the DBMS sends to R. For more complex queries, the `dplyr::collect()` function provides a mechanism to indicate what's processed on on the DBMS server and what's processed by R on the local machine. The chapter on [Lazy Evaluation and Execution Environment](#chapter_lazy-evaluation-and-timing) discusses this issue in detail.
```{r}
salesorderheader_table %>% dplyr::collect(n = 3) %>% dim()
salesorderheader_table %>% dplyr::collect(n = 500) %>% dim()
```
### Retrieving random rows from the DBMS
When the DBMS contains many rows, a sample of the data may be plenty for your purposes. Although `dplyr` has nice functions to sample a data frame that's already in R (e.g., the `sample_n` and `sample_frac` functions), to get a sample from the DBMS we have to use `dbGetQuery` to send native SQL to the database. To peek ahead, here is one example of a query that retrieves 20 rows from a 1% sample:
```{r}
one_percent_sample <- DBI::dbGetQuery(
con,
"SELECT orderdate, subtotal, taxamt, freight, totaldue
FROM salesorderheader TABLESAMPLE BERNOULLI(3) LIMIT 20;
"
)
one_percent_sample
```
**Exact sample of 100 records**
This technique depends on knowing the range of a record index, such as the `businessentityid` in the `salesorderheader` table of our `adventureworks` database.
Start by finding the min and max values.
```{r}
DBI::dbListFields(con, "salesorderheader")
salesorderheader_df <- DBI::dbReadTable(con, "salesorderheader")
(max_id <- max(salesorderheader_df$salesorderid))
(min_id <- min(salesorderheader_df$salesorderid))
```
Set the random number seed and draw the sample.
```{r}
set.seed(123)
sample_rows <- sample(1:max(salesorderheader_df$salesorderid), 10)
salesorderheader_table <- dplyr::tbl(con, "salesorderheader")
```
Run query with the filter verb listing the randomly sampled rows to be retrieved:
```{r}
salesorderheader_sample <- salesorderheader_table %>%
dplyr::filter(salesorderid %in% sample_rows) %>%
dplyr::collect()
str(salesorderheader_sample)
```
### Sub-setting variables
A table in the DBMS may not only have many more rows than you want, but also many more columns. The `select` command controls which columns are retrieved.
```{r}
salesorderheader_table %>% dplyr::select(orderdate, subtotal, taxamt, freight, totaldue) %>%
head()
```
That's exactly equivalent to submitting the following SQL commands directly:
```{r}
DBI::dbGetQuery(
con,
'SELECT "orderdate", "subtotal", "taxamt", "freight", "totaldue"
FROM "salesorderheader"
LIMIT 6')
```
We won't discuss `dplyr` methods for sub-setting variables, deriving new ones, or sub-setting rows based on the values found in the table, because they are covered well in other places, including:
* Comprehensive reference: [https://dplyr.tidyverse.org/](https://dplyr.tidyverse.org/)
* Good tutorial: [https://suzan.rbind.io/tags/dplyr/](https://suzan.rbind.io/tags/dplyr/)
In practice we find that, **renaming variables** is often quite important because the names in an SQL database might not meet your needs as an analyst. In "the wild", you will find names that are ambiguous or overly specified, with spaces in them, and other problems that will make them difficult to use in R. It is good practice to do whatever renaming you are going to do in a predictable place like at the top of your code. The names in the `adventureworks` database are simple and clear, but if they were not, you might rename them for subsequent use in this way:
```{r}
tbl(con, "salesorderheader") %>%
dplyr::rename(order_date = orderdate, sub_total_amount = subtotal,
tax_amount = taxamt, freight_amount = freight, total_due_amount = totaldue) %>%
dplyr::select(order_date, sub_total_amount, tax_amount, freight_amount, total_due_amount ) %>%
show_query()
```
That's equivalent to the following SQL code:
```{r}
DBI::dbGetQuery(
con,
'SELECT "orderdate" AS "order_date",
"subtotal" AS "sub_total_amount",
"taxamt" AS "tax_amount",
"freight" AS "freight_amount",
"totaldue" AS "total_due_amount"
FROM "salesorderheader"' ) %>%
head()
```
The one difference is that the `SQL` code returns a regular data frame and the `dplyr` code returns a `tibble`. Notice that the seconds are grayed out in the `tibble` display.
## Translating `dplyr` code to `SQL` queries
Where did the translations we've shown above come from? The `show_query` function shows how `dplyr` is translating your query to the dialect of the target DBMS.
> The `show_query()` function shows you what dplyr is sending to the DBMS. It might be handy for inspecting what dplyr is doing or for showing your code to someone who is more SQL- than R-literate. In general we have used the function extensively in writing this book but in the final product we will not use it unless there is something in the SQL or the translation process that needs to be explained.
```{r}
salesorderheader_table %>%
dplyr::tally() %>%
dplyr::show_query()
```
Here is an extensive discussion of how `dplyr` code is translated into SQL:
* [https://dbplyr.tidyverse.org/articles/sql-translation.html](https://dbplyr.tidyverse.org/articles/sql-translation.html)
If you prefer to use SQL directly, rather than `dplyr`, you can submit SQL code to the DBMS through the `DBI::dbGetQuery` function:
```{r}
DBI::dbGetQuery(
con,
'SELECT COUNT(*) AS "n"
FROM "salesorderheader" '
)
```
When you create a report to run repeatedly, you might want to put that query into R markdown. That way you can also execute that SQL code in a chunk with the following header:
{`sql, connection=con, output.var = "query_results"`}
```{sql, connection=con, output.var = "query_results"}
SELECT COUNT(*) AS "n"
FROM "salesorderheader";
```
R markdown stores that query result in a tibble which can be printed by referring to it:
```{r}
query_results
```
## Mixing dplyr and SQL
When dplyr finds code that it does not know how to translate into SQL, it will simply pass it along to the DBMS. Therefore you can interleave native commands that your DBMS will understand in the middle of dplyr code. Consider this example that's derived from [@Ruiz2019]:
```{r}
salesorderheader_table %>%
dplyr::select_at(vars(subtotal, contains("date"))) %>%
dplyr::mutate(today = now()) %>%
dplyr::show_query()
```
That is native to PostgreSQL, not [ANSI standard](https://en.wikipedia.org/wiki/SQL#Interoperability_and_standardization) SQL.
Verify that it works:
```{r}
salesorderheader_table %>%
dplyr::select_at(vars(subtotal, contains("date"))) %>%
head() %>%
dplyr::mutate(today = now()) %>%
dplyr::collect()
```
## Examining a single table with R
Dealing with a large, complex database highlights the utility of specific tools in R. We include brief examples that we find to be handy:
+ Base R structure: `str`
+ Printing out some of the data: `datatable`, `kable`, and `View`
+ Summary statistics: `summary`
+ `glimpse` in the `tibble` package, which is included in the `tidyverse`
+ `skim` in the `skimr` package
### `str` - a base package workhorse
`str` is a workhorse function that lists variables, their type and a sample of the first few variable values.
```{r}
str(salesorderheader_tibble)
```
### Always **look** at your data with `head`, `View`, or `kable`
There is no substitute for looking at your data and R provides several ways to just browse it. The `head` function controls the number of rows that are displayed. Note that tail does not work against a database object. In every-day practice you would look at more than the default 6 rows, but here we wrap `head` around the data frame:
```{r}
sqlpetr::sp_print_df(head(salesorderheader_tibble))
```
### The `summary` function in `base`
The `base` package's `summary` function provides basic statistics that serve a unique diagnostic purpose in this context. For example, the following output shows that:
* `businessentityid` is a number from 1 to 16,049. In a previous section, we ran the `str` function and saw that there are 16,044 observations in this table. Therefore, the `businessentityid` seems to be sequential from 1:16049, but there are 5 values missing from that sequence. _Exercise for the Reader_: Which 5 values from 1:16049 are missing from `businessentityid` values in the `salesorderheader` table? (_Hint_: In the chapter on SQL Joins, you will learn the functions needed to answer this question.)
* The number of NA's in the `return_date` column is a good first guess as to the number of DVDs rented out or lost as of 2005-09-02 02:35:22.
```{r}
summary(salesorderheader_tibble)
```
So the `summary` function is surprisingly useful as we first start to look at the table contents.
### The `glimpse` function in the `tibble` package
The `tibble` package's `glimpse` function is a more compact version of `str`:
```{r}
tibble::glimpse(salesorderheader_tibble)
```
### The `skim` function in the `skimr` package
The `skimr` package has several functions that make it easy to examine an unknown data frame and assess what it contains. It is also extensible.
```{r}
skimr::skim(salesorderheader_tibble)
skimr::skim_to_wide(salesorderheader_tibble) #skimr doesn't like certain kinds of columns
```
## Disconnect from the database and stop Docker
```{r}
dbDisconnect(con)
# or if using the connections package, use:
# connection_close(con)
sp_docker_stop("adventureworks")
```
## Additional reading
* [@Wickham2018]
* [@Baumer2018]