-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy path090-lazy-evalutation-intro.Rmd
261 lines (188 loc) · 12.7 KB
/
090-lazy-evalutation-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
# Lazy Evaluation and Lazy Queries {#chapter_lazy-evaluation-queries}
> This chapter:
>
> * Reviews lazy loading, lazy evaluation and lazy query execution
> * Demonstrates how `dplyr` code gets executed (and how R determines what is translated to SQL and what is processed locally by R)
> * Offers some further resources on lazy loading, evaluation, execution, etc.
>
## Setup
The following packages are used in this chapter:
```{r chapter package list, echo=TRUE, message=FALSE, warning=FALSE}
library(tidyverse)
library(DBI)
library(RPostgres)
library(dbplyr)
require(knitr)
library(bookdown)
library(sqlpetr)
library(connections)
sleep_default <- 3
```
Start your `adventureworks` container:
```{r check on adventureworks}
sqlpetr::sp_docker_start("adventureworks")
Sys.sleep(sleep_default)
```
Connect to the database:
```{r connect to postgresql}
# con <- connection_open( # use in an interactive session
con <- dbConnect( # use in other settings
RPostgres::Postgres(),
# without the previous and next lines, some functions fail with bigint data
# so change int64 to integer
bigint = "integer",
user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
dbname = "adventureworks",
host = "localhost",
port = 5432)
```
## R is lazy and comes with guardrails
By design, R is both a language and an interactive development environment (IDE). As a language, R tries to be as efficient as possible. As an IDE, R creates some guardrails to make it easy and safe to work with your data. For example `getOption("max.print")` prevents R from printing more rows of data than you want to handle in an interactive session, with a default of `r getOption("max.print")` lines, which may or may not suit you.
On the other hand SQL is a *"Structured Query Language (SQL): a standard computer language for relational database management and data manipulation."* ^[https://www.techopedia.com/definition/1245/structured-query-language-sql]. SQL has various database-specific Interactive Development Environments (IDEs), such as [pgAdmin](https://www.pgadmin.org/) for PostgreSQL. Roger Peng explains in [R Programming for Data Science](https://bookdown.org/rdpeng/rprogdatascience/history-and-overview-of-r.html#basic-features-of-r) that:
> R has maintained the original S philosophy, which is that it provides a language that is both useful for interactive work, but contains a powerful programming language for developing new tools.
This is complicated when R interacts with SQL. In a [vignette for dbplyr](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html) Hadley Wickham explains:
> The most important difference between ordinary data frames and remote database queries is that your R code is translated into SQL and executed in the database on the remote server, not in R on your local machine. When working with databases, dplyr tries to be as lazy as possible:
>
> * It never pulls data into R unless you explicitly ask for it.
>
> * It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.
>
Exactly when, which, and how much data is returned from the dbms is the topic of this chapter. Exactly how the data is represented in the dbms and then translated to a data frame is discussed in the [DBI specification](https://cran.r-project.org/web/packages/DBI/vignettes/spec.html#_fetch_records_from_a_previously_executed_query_).
Eventually, if you are interacting with a dbms from R you will need to understand the differences between lazy loading, lazy evaluation, and lazy queries.
### Lazy loading
"*Lazy loading is always used for code in packages but is optional (selected by the package maintainer) for datasets in packages.*"^[https://cran.r-project.org/doc/manuals/r-release/R-ints.html#Lazy-loading] Lazy loading means that the code for a particular function doesn't actually get loaded into memory until the last minute -- when it's actually being used.
### Lazy evaluation
Essentially "Lazy evaluation is a programming strategy that allows a symbol to be evaluated only when needed." ^[https://colinfay.me/lazyeval/] That means that lazy evaluation is about **symbols** such as function arguments ^[http://adv-r.had.co.nz/Functions.html#function-arguments] when they are evaluated. Tidy evaluation complicates lazy evaluation. ^[https://colinfay.me/tidyeval-1/]
### Lazy Queries
"*When you create a "lazy" query, you're creating a pointer to a set of conditions on the database, but the query isn't actually run and the data isn't actually loaded until you call "next" or some similar method to actually fetch the data and load it into an object.*" ^[https://www.quora.com/What-is-a-lazy-query]
## Lazy evaluation and lazy queries
When does a lazy query trigger data retrieval? It depends on a lot of factors, as we explore below:
### Create a black box query for experimentation
Define the three tables discussed in the previous chapter to build a _black box_ query:
```{r}
sales_person_table <- tbl(con, in_schema("sales", "salesperson")) %>%
select(-rowguid) %>%
rename(sale_info_updated = modifieddate)
employee_table <- tbl(con, in_schema("humanresources", "employee")) %>%
select(-modifieddate, -rowguid)
person_table <- tbl(con, in_schema("person", "person")) %>%
select(-modifieddate, -rowguid)
```
Here is a typical string of `dplyr` verbs strung together with the magrittr `%>%` pipe command that will be used to tease out the several different behaviors that a lazy query has when passed to different R functions. This query joins three connection objects into a query we'll call `Q`:
```{r}
Q <- sales_person_table %>%
dplyr::left_join(employee_table, by = c("businessentityid" = "businessentityid")) %>%
dplyr::left_join(person_table , by = c("businessentityid" = "businessentityid")) %>%
dplyr::select(firstname, lastname, salesytd, birthdate)
```
The `str` function gives us a hint at how R is collecting information that can be used to construct and execute a query later on:
```{r}
str(Q, max.level = 2)
```
### Experiment overview
Think of `Q` as a black box for the moment. The following examples will show how `Q` is interpreted differently by different functions. It's important to remember in the following discussion that the "**and then**" operator (`%>%`) actually wraps the subsequent code inside the preceding code so that `Q %>% print()` is equivalent to `print(Q)`.
**Notation**
> ![](screenshots/green-check.png) A single green check indicates that some rows are returned. <br>
> ![](screenshots/green-check.png) ![](screenshots/green-check.png) Two green checks indicate that all the rows are returned. <br>
> ![](screenshots/red-x.png) The red X indicates that no rows are returned.
>
> R code | Result
> -------| --------------
> [`Q %>% print()`](#lazy_q_print) | ![](screenshots/green-check.png) Prints x rows; same as just entering `Q`
> [`Q %>% dplyr::as_tibble()`](#Q-as-tibble) | ![](screenshots/green-check.png)![](screenshots/green-check.png) Forces `Q` to be a tibble
> [`Q %>% head()`](#lazy_q_head) | ![](screenshots/green-check.png) Prints the first 6 rows
> [`Q %>% tail()`](#lazy_q_tail) | ![](screenshots/red-x.png) Error: tail() is not supported by sql sources
> [`Q %>% length()`](#lazy_q_length) | ![](screenshots/red-x.png) Counts the rows in `Q`
> [`Q %>% str()`](#lazy_q_str) | ![](screenshots/red-x.png)Shows the top 3 levels of the **object** `Q`
> [`Q %>% nrow()`](#lazy_q_nrow) | ![](screenshots/red-x.png) **Attempts** to determine the number of rows
> [`Q %>% dplyr::tally()`](#lazy_q_tally) | ![](screenshots/green-check.png) ![](screenshots/green-check.png) Counts all the rows -- on the dbms side
> [`Q %>% dplyr::collect(n = 20)`](#lazy_q_collect) | ![](screenshots/green-check.png) Prints 20 rows
> [`Q %>% dplyr::collect(n = 20) %>% head()`](#lazy_q_collect) | ![](screenshots/green-check.png) Prints 6 rows
> [`Q %>% ggplot`](#lazy_q_plot-categories) | ![](screenshots/green-check.png) ![](screenshots/green-check.png) Plots a barchart
> [`Q %>% dplyr::show_query()`](#lazy-q-show-query) | ![](screenshots/red-x.png) **Translates** the lazy query object into SQL
>
The next chapter will discuss how to build queries and how to explore intermediate steps. But first, the following subsections provide a more detailed discussion of each row in the preceding table.
### Q %>% print(){#lazy_q_print}
Remember that `Q %>% print()` is equivalent to `print(Q)` and the same as just entering `Q` on the command line. We use the magrittr pipe operator here, because chaining functions highlights how the same object behaves differently in each use.
```{r}
Q %>% print()
```
![](screenshots/green-check.png) R retrieves 10 observations and 3 columns. In its role as IDE, R has provided nicely formatted output that is similar to what it prints for a tibble, with descriptive information about the dataset and each column:
>
> # Source: lazy query [?? x 4]
> # Database: postgres
> # [postgres@localhost:5432/adventureworks]
> firstname lastname salesytd birthdate
R has not determined how many rows are left to retrieve as it shows with `[?? x 4]` and `... with more rows` in the data summary.
### Q %>% dplyr::as_tibble() {#lazy_q_as-tibble}
![](screenshots/green-check.png) ![](screenshots/green-check.png) In contrast to `print()`, the `as_tibble()` function causes R to download the whole table, using tibble's default of displaying only the first 10 rows.
```{r}
Q %>% dplyr::as_tibble()
```
### Q %>% head() {#lazy_q_head}
![](screenshots/green-check.png) The `head()` function is very similar to print but has a different "`max.print`" value.
```{r}
Q %>% head()
```
### Q %>% tail() {#lazy_q_tail}
![](screenshots/red-x.png) Produces an error, because `Q` does not hold all of the data, so it is not possible to list the last few items from the table:
```{r}
try(
Q %>% tail(),
silent = FALSE,
outFile = stdout()
)
```
### Q %>% length() {#lazy_q_length}
![](screenshots/red-x.png) Because the `Q` object is relatively complex, using `str()` on it prints many lines. You can glimpse what's going on with `length()`:
```{r}
Q %>% length()
```
### Q %>% str() {#lazy_q_str}
![](screenshots/red-x.png) Looking inside shows some of what's going on (three levels deep):
```{r}
Q %>% str(max.level = 3)
```
### Q %>% nrow() {#lazy_q_nrow}
![](screenshots/red-x.png) Notice the difference between `nrow()` and `tally()`. The `nrow` functions returns `NA` and does not execute a query:
```{r}
Q %>% nrow()
```
### Q %>% dplyr::tally() {#lazy_q_tally}
![](screenshots/green-check.png) The `tally` function actually counts all the rows.
```{r}
Q %>% dplyr::tally()
```
The `nrow()` function knows that `Q` is a list. On the other hand, the `tally()` function tells SQL to go count all the rows. Notice that `Q` results in 1,000 rows -- the same number of rows as `film`.
### Q %>% dplyr::collect(){#lazy_q_collect}
![](screenshots/green-check.png) The dplyr::[collect](https://dplyr.tidyverse.org/reference/compute.html) function triggers a call to the `DBI:dbFetch()` function behind the scenes, which forces R to download a specified number of rows:
```{r}
Q %>% dplyr::collect(n = 20)
Q %>% dplyr::collect(n = 20) %>% head()
```
The `dplyr::collect` function triggers the creation of a tibble and controls the number of rows that the DBMS sends to R. Notice that `head` only prints 6 of the 20 rows that R has retrieved.
If you do not provide a value for the `n` argument, _all_ of the rows will be retrieved into your R workspace.
### Q %>% ggplot {#lazy_q_plot-categories}
Passing the `Q` object to `ggplot` executes the query and plots the result.
```{r}
Q %>% ggplot2::ggplot(aes(birthdate, salesytd)) + geom_point()
```
* Rewrite previous query and this comment with adventureworks in mind.
Comment on the plot...
### Q %>% dplyr::show_query() {#lazy_q_show-query}
```{r}
Q %>% dplyr::show_query()
```
Hand-written SQL code to do the same job will probably look a lot nicer and could be more efficient, but functionally `dplyr` does the job.
## Disconnect from the database and stop Docker
```{r}
dbDisconnect(con)
# or if using the connections package, use:
# connection_close(con)
sp_docker_stop("adventureworks")
```
## Other resources
* Benjamin S. Baumer. 2017. A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data. [https://arxiv.org/abs/1708.07073](https://arxiv.org/abs/1708.07073)
* dplyr Reference documentation: Remote tables. [https://dplyr.tidyverse.org/reference/index.html#section-remote-tables](https://dplyr.tidyverse.org/reference/index.html#section-remote-tables)
* Data Carpentry. SQL Databases and R. [https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html](https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html)