080-elementary-queries.Rmd

# Introduction to DBMS queries {#chapter_dbms-queries-intro}

> This chapter demonstrates how to:
> 
> * Download all or part of a table from the DBMS, including different kinds of subsets
> * See how `dplyr` code is translated into `SQL` commands and how they can be mixed
> * Get acquainted with some useful functions and packages for investigating a single table
> * Begin thinking about how to divide the work between your local R session and the DBMS

## Setup

The following packages are used in this chapter:
```{r package list, echo=TRUE, message=FALSE, warning=FALSE}
library(tidyverse)
library(DBI)
library(RPostgres)
library(dbplyr)
require(knitr)
library(bookdown)
library(sqlpetr)
library(skimr)
library(connections)
sleep_default <- 3
```
Assume that the Docker container with PostgreSQL and the adventureworks database are ready to go. If not go back to [Chapter 6][#chapter_setup-adventureworks-db]
```{r check on adventureworks}
sqlpetr::sp_docker_start("adventureworks")
Sys.sleep(sleep_default)
```
Connect to the database:
```{r connect to postgresql}
# con <- connection_open(  # use in an interactive session
con <- dbConnect(          # use in other settings
  RPostgres::Postgres(),
  # without the previous and next lines, some functions fail with bigint data 
  #   so change int64 to integer
  bigint = "integer",  
  host = "localhost",
  user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
  password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
  dbname = "adventureworks",
  port = 5432
)
```

## Methods for downloading a single table

For the moment, assume you know something about the database and specifically what table you need to retrieve.  We return to the topic of investigating the whole database later on. 

```{r}
dbExecute(con, "set search_path to sales, humanresources;") 
```

### Read the entire table

There are a few different methods of getting data from a DBMS, and we'll explore the different ways of controlling each one of them.

`DBI::dbReadTable` will download an entire table into an R [tibble](https://tibble.tidyverse.org/).  
```{r}

salesorderheader_tibble <- DBI::dbReadTable(con, "salesorderheader")

str(salesorderheader_tibble)
```
That's very simple, but if the table is very large it may not be a problem, since R is designed to keep the entire table in memory.  The tables that are found in an enterprise database such as `adventureworks` may be large, they are most often records kept by people.  That somewhat limits their size (relative to data generated by machines) and expands the possibilities for human error.

Note that the first line of the str() output reports the total number of observations.

Later on we'll use this tibble to demonstrate several packages and functions, but use only the first 13 columns for simplicity.
```{r}
salesorderheader_tibble <- salesorderheader_tibble[,1:13]
```

### Create a pointer to a table that can be reused

The `dplyr::tbl` function gives us more control over access to a table by enabling control over which columns and rows to download.  It creates  an object that might **look** like a data frame, but it's actually a list object that `dplyr` uses for constructing queries and retrieving data from the DBMS.  

```{r}
salesorderheader_table <- dplyr::tbl(con, "salesorderheader")
class(salesorderheader_table)
```

### Controlling the number of rows returned with `collect()`

The `collect` function triggers the creation of a tibble and controls the number of rows that the DBMS sends to R.  For more complex queries, the `dplyr::collect()` function provides a mechanism to indicate what's processed on on the DBMS server and what's processed by R on the local machine. The chapter on [Lazy Evaluation and Execution Environment](#chapter_lazy-evaluation-and-timing) discusses this issue in detail.
```{r}
salesorderheader_table %>% dplyr::collect(n = 3) %>% dim()

salesorderheader_table %>% dplyr::collect(n = 500) %>% dim()
```

### Retrieving random rows from the DBMS

When the DBMS contains many rows, a sample of the data may be plenty for your purposes.  Although `dplyr` has nice functions to sample a data frame that's already in R (e.g., the `sample_n` and `sample_frac` functions), to get a sample from the DBMS we have to use `dbGetQuery` to send native SQL to the database. To peek ahead, here is one example of a query that retrieves 20 rows from a 1% sample:

```{r}
one_percent_sample <- DBI::dbGetQuery(
  con,
  "SELECT orderdate, subtotal, taxamt, freight, totaldue
  FROM salesorderheader TABLESAMPLE BERNOULLI(3) LIMIT 20;
  "
)

one_percent_sample
```
**Exact sample of 100 records**

This technique depends on knowing the range of a record index, such as the `businessentityid` in the `salesorderheader` table of our `adventureworks` database.

Start by finding the min and max values.
```{r}
DBI::dbListFields(con, "salesorderheader")
salesorderheader_df <- DBI::dbReadTable(con, "salesorderheader")

(max_id <- max(salesorderheader_df$salesorderid))
(min_id <- min(salesorderheader_df$salesorderid))
```

Set the random number seed and draw the sample.
```{r}
set.seed(123)
sample_rows <- sample(1:max(salesorderheader_df$salesorderid), 10)
salesorderheader_table <- dplyr::tbl(con, "salesorderheader")
```

Run query with the filter verb listing the randomly sampled rows to be retrieved:
```{r}
salesorderheader_sample <- salesorderheader_table %>% 
  dplyr::filter(salesorderid %in% sample_rows) %>% 
  dplyr::collect()

str(salesorderheader_sample)

```

### Sub-setting variables

A table in the DBMS may not only have many more rows than you want, but also many more columns.  The `select` command controls which columns are retrieved.

```{r}
salesorderheader_table %>% dplyr::select(orderdate, subtotal, taxamt, freight, totaldue) %>% 
  head() 

```
That's exactly equivalent to submitting the following SQL commands directly:
```{r}
DBI::dbGetQuery(
  con,
  'SELECT "orderdate", "subtotal", "taxamt", "freight", "totaldue"
    FROM "salesorderheader"
    LIMIT 6') 
```


We won't discuss `dplyr` methods for sub-setting variables, deriving new ones, or sub-setting rows based on the values found in the table, because they are covered well in other places, including:

  * Comprehensive reference: [https://dplyr.tidyverse.org/](https://dplyr.tidyverse.org/)
  * Good tutorial: [https://suzan.rbind.io/tags/dplyr/](https://suzan.rbind.io/tags/dplyr/) 

In practice we find that, **renaming variables** is often quite important because the names in an SQL database might not meet your needs as an analyst.  In "the wild", you will find names that are ambiguous or overly specified, with spaces in them, and other problems that will make them difficult to use in R.  It is good practice to do whatever renaming you are going to do in a predictable place like at the top of your code.  The names in the `adventureworks` database are simple and clear, but if they were not, you might rename them for subsequent use in this way:

```{r}
tbl(con, "salesorderheader") %>%
  dplyr::rename(order_date = orderdate, sub_total_amount = subtotal,
              tax_amount = taxamt, freight_amount = freight, total_due_amount = totaldue) %>% 
  dplyr::select(order_date, sub_total_amount, tax_amount, freight_amount, total_due_amount ) %>%
  show_query()
```
That's equivalent to the following SQL code:
```{r}
DBI::dbGetQuery(
  con,
    'SELECT "orderdate" AS "order_date", 
    "subtotal" AS "sub_total_amount", 
    "taxamt" AS "tax_amount", 
    "freight" AS "freight_amount", 
    "totaldue" AS "total_due_amount"
    FROM "salesorderheader"' ) %>% 
  head()
```
The one difference is that the `SQL` code returns a regular data frame and the `dplyr` code returns a `tibble`.  Notice that the seconds are grayed out in the `tibble` display.  

## Translating `dplyr` code to `SQL` queries

Where did the translations we've shown above come from?  The `show_query` function shows how `dplyr` is translating your query to the dialect of the target DBMS.  

> The `show_query()` function shows you what dplyr is sending to the DBMS.  It might be handy for inspecting what dplyr is doing or for showing your code to someone who is more SQL- than R-literate.  In general we have used the function extensively in writing this book but in the final product we will not use it unless there is something in the SQL or the translation process that needs to be explained.

```{r}
salesorderheader_table %>%
  dplyr::tally() %>%
  dplyr::show_query()
```
Here is an extensive discussion of how `dplyr` code is translated into SQL:

* [https://dbplyr.tidyverse.org/articles/sql-translation.html](https://dbplyr.tidyverse.org/articles/sql-translation.html) 

If you prefer to use SQL directly, rather than `dplyr`, you can submit SQL code to the DBMS through the `DBI::dbGetQuery` function:
```{r}
DBI::dbGetQuery(
  con,
  'SELECT COUNT(*) AS "n"
     FROM "salesorderheader"   '
)
```

When you create a report to run repeatedly, you might want to put that query into R markdown. That way you can also execute that SQL code in a chunk with the following header:

  {`sql, connection=con, output.var = "query_results"`}

```{sql, connection=con, output.var = "query_results"}
SELECT COUNT(*) AS "n"
     FROM "salesorderheader";
```
R markdown stores that query result in a tibble which can be printed by referring to it:
```{r}
query_results
```

## Mixing dplyr and SQL

When dplyr finds code that it does not know how to translate into SQL, it will simply pass it along to the DBMS. Therefore you can interleave native commands that your DBMS will understand in the middle of dplyr code.  Consider this example that's derived from [@Ruiz2019]:

```{r}
salesorderheader_table %>%
  dplyr::select_at(vars(subtotal, contains("date"))) %>% 
  dplyr::mutate(today = now()) %>%
  dplyr::show_query()
```
That is native to PostgreSQL, not [ANSI standard](https://en.wikipedia.org/wiki/SQL#Interoperability_and_standardization) SQL.

Verify that it works:
```{r}
salesorderheader_table %>%
  dplyr::select_at(vars(subtotal, contains("date"))) %>% 
  head() %>% 
  dplyr::mutate(today = now()) %>%
  dplyr::collect()
```


## Examining a single table with R

Dealing with a large, complex database highlights the utility of specific tools in R.  We include brief examples that we find to be handy:

  + Base R structure: `str`
  + Printing out some of the data: `datatable`, `kable`, and `View`
  + Summary statistics: `summary`
  + `glimpse` in the `tibble` package, which is included in the `tidyverse`
  + `skim` in the `skimr` package

### `str` - a base package workhorse

`str` is a workhorse function that lists variables, their type and a sample of the first few variable values.
```{r}
str(salesorderheader_tibble)
```

### Always **look** at your data with `head`, `View`, or `kable`

There is no substitute for looking at your data and R provides several ways to just browse it.  The `head` function controls the number of rows that are displayed.  Note that tail does not work against a database object.  In every-day practice you would look at more than the default 6 rows, but here we wrap `head` around the data frame: 
```{r}
sqlpetr::sp_print_df(head(salesorderheader_tibble))
```

### The `summary` function in `base`

The `base` package's `summary` function provides basic statistics that serve a unique diagnostic purpose in this context. For example, the following output shows that:

    * `businessentityid` is a number from 1 to 16,049. In a previous section, we ran the `str` function and saw that there are 16,044 observations in this table. Therefore, the `businessentityid` seems to be sequential from 1:16049, but there are 5 values missing from that sequence. _Exercise for the Reader_: Which 5 values from 1:16049 are missing from `businessentityid` values in the `salesorderheader` table? (_Hint_: In the chapter on SQL Joins, you will learn the functions needed to answer this question.)
    * The number of NA's in the `return_date` column is a good first guess as to the number of DVDs rented out or lost as of 2005-09-02 02:35:22.

```{r}
summary(salesorderheader_tibble)
```

So the `summary` function is surprisingly useful as we first start to look at the table contents.

### The `glimpse` function in the `tibble` package

The `tibble` package's `glimpse` function is a more compact version of `str`:
```{r}
tibble::glimpse(salesorderheader_tibble)
```
### The `skim` function in the `skimr` package

The `skimr` package has several functions that make it easy to examine an unknown data frame and assess what it contains. It is also extensible.
```{r}

skimr::skim(salesorderheader_tibble)

skimr::skim_to_wide(salesorderheader_tibble) #skimr doesn't like certain kinds of columns

```

## Disconnect from the database and stop Docker

```{r}
dbDisconnect(con)
# or if using the connections package, use:
# connection_close(con)

sp_docker_stop("adventureworks")
```

## Additional reading

* [@Wickham2018]
* [@Baumer2018]