Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recreat the "Get Started" vignette #1364

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions inst/extdata/deaths.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Date file created:,02/12/2022,,,,
File created by:,Sally,Brown,,,
Name,Profession,Age,Has kids,Date of birth,Date of death
David Bowie,musician,69,TRUE,1/8/1947,1/10/2016
Carrie Fisher,actor,60,TRUE,10/21/1956,12/27/2016
Chuck Berry,musician,90,TRUE,10/18/1926,3/18/2017
Bill Paxton,actor,61,TRUE,5/17/1955,2/25/2017
Prince,musician,57,TRUE,6/7/1958,4/21/2016
Alan Rickman,actor,69,FALSE,2/21/1946,1/14/2016
Florence Henderson,actor,82,TRUE,2/14/1934,11/24/2016
Harper Lee,author,89,FALSE,4/28/1926,2/19/2016
Zsa Zsa Gábor,actor,99,TRUE,2/6/1917,12/18/2016
George Michael,musician,53,FALSE,6/25/1963,12/25/2016
4 changes: 4 additions & 0 deletions inst/extdata/funny_quotes.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Size, Cost
'Small', $5.00
'Medium, Large, Extra Large',$10.00

7 changes: 7 additions & 0 deletions inst/extdata/interpret-nas.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
x,y
1,45
2,-
3,66
4,30
5,-

6 changes: 6 additions & 0 deletions inst/extdata/mini_gap_Europe.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
country,continent,year,lifeExp,pop,gdpPercap
Albania,Europe,1952,55.23,1282697,1601.056136
Austria,Europe,1952,66.8,6927772,6137.076492
Belgium,Europe,1952,68,8730405,8343.105127
Bosnia and Herzegovina,Europe,1952,53.82,2791000,973.5331948
Bulgaria,Europe,1952,59.6,7274900,2444.286648
6 changes: 6 additions & 0 deletions inst/extdata/mini_gap_Oceania.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
country,continent,year,lifeExp,pop,gdpPercap
Australia,Oceania,1952,69.12,8691212,10039.59564
New Zealand,Oceania,1952,69.39,1994794,10556.57566
Australia,Oceania,1957,70.33,9712569,10949.64959
New Zealand,Oceania,1957,70.26,2229407,12247.39532
Australia,Oceania,1962,70.93,10794968,12217.22686
25 changes: 25 additions & 0 deletions inst/extdata/norway.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
date,value
1 januar 2022,"1,23"
15 januar 2022,"3,21"
30 januar 2022,"4,55"
1 februar 2022,"10,67"
15 februar 2022,"333,23"
28 februar 2022,"343,33"
1 mars 2022,"11,32"
15 mars 2022,"3,98"
30 mars 2022,"44,12"
1 april 2022,"1,22"
15 april 2022,"9,04"
30 April 2022,"45,05"
1 mai 2022,"788,02"
15 mai 2022,"65,21"
30 mai 2022,"67,55"
1 juni 2022,"127.897,23"
15 juni 2022,"322.222,21"
30 juni 2022,"240,55"
1 juli 2022,"170,23"
15 juli 2022,"3,09"
30 juli 2022,"42,06"
1 august 2022,"89,43"
15 august 2022,"7,21"
30 august 2022,"1,55"
8 changes: 8 additions & 0 deletions inst/extdata/problem_numbers.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
value
1
2
3
4
5.00
6
7
138 changes: 138 additions & 0 deletions vignettes/column-types.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,143 @@ knitr::opts_chunk$set(
)
```

The key problem that readr solves is __parsing__ a flat file into a tibble.
Parsing is the process of taking a text file and turning it into a rectangular tibble where each column is the appropriate part.
Parsing takes place in three basic stages:

1. The flat file is parsed into a rectangular matrix of
strings.

1. The type of each column is determined.

1. Each column of strings is parsed into a vector of a
more specific type.

## Vector parsers

It's easiest to learn the vector parsers using `parse_` functions.
These all take a character vector and some options.
They return a new vector the same length as the old, along with an attribute describing any problems.

First, let's load the readr library

```{r}
library(readr)
```

### Atomic vectors

`parse_logical()`, `parse_integer()`, `parse_double()`, and `parse_character()` are straightforward parsers that produce the corresponding atomic vector.

```{r}
parse_integer(c("1", "2", "3"))
parse_double(c("1.56", "2.34", "3.56"))
parse_logical(c("true", "false"))
```

By default, readr expects `.` as the decimal mark and `,` as the grouping mark.
You can override this default using `locale()`, as described in `vignette("locales")`.

### Flexible numeric parser

`parse_integer()` and `parse_double()` are strict: the input string must be a single number with no leading or trailing characters.
`parse_number()` is more flexible: it ignores non-numeric prefixes and suffixes, and knows how to deal with grouping marks.
This makes it suitable for reading currencies and percentages:

```{r}
parse_number(c("0%", "10%", "150%"))
parse_number(c("$1,234.5", "$12.45"))
```

### Date/times

readr supports three types of date/time data:

* dates: number of days since 1970-01-01.
* times: number of seconds since midnight.
* datetimes: number of seconds since midnight 1970-01-01.

```{r}
parse_datetime("2010-10-01 21:45")
parse_date("2010-10-01")
parse_time("1:00pm")
```

Each function takes a `format` argument which describes the format of the string.
If not specified, it uses a default value:

* `parse_datetime()` recognises
[ISO8601](https://en.wikipedia.org/wiki/ISO_8601)
datetimes.

* `parse_date()` uses the `date_format` specified by
the `locale()`. The default value is `%AD` which uses
an automatic date parser that recognises dates of the
format `Y-m-d` or `Y/m/d`.

* `parse_time()` uses the `time_format` specified by
the `locale()`. The default value is `%At` which uses
an automatic time parser that recognises times of the
form `H:M` optionally followed by seconds and am/pm.

In most cases, you will need to supply a `format`, as documented in `parse_datetime()`:

```{r}
parse_datetime("1 January, 2010", "%d %B, %Y")
parse_datetime("02/02/15", "%m/%d/%y")
```

### Factors

When reading a column that has a known set of values, you can read directly into a factor.
`parse_factor()` will generate a warning if a value is not in the supplied levels.

```{r}
parse_factor(c("a", "b", "a"), levels = c("a", "b", "c"))
parse_factor(c("a", "b", "d"), levels = c("a", "b", "c"))
```


## Column specification

readr uses some heuristics to guess the type of each column which can be accessed using `guess_parser()`:

```{r}
guess_parser(c("a", "b", "c"))
guess_parser(c("1", "2", "3"))
guess_parser(c("1,000", "2,000", "3,000"))
guess_parser(c("2001/10/10"))
```

The guessing policies are described in the documentation for the individual functions.
Guesses are fairly strict.
For example, we don't guess that currencies are numbers, even though we can parse them:

```{r}
guess_parser("$1,234")
parse_number("$1,234")
```

There are two parsers that will never be guessed: `col_skip()` and `col_factor()`.
You will always need to supply these explicitly.

You can see the specification that readr would generate for a column file by using `spec_csv()`, `spec_tsv()` and so on:

```{r}
x <- spec_csv(readr_example("challenge.csv"))
```

For bigger files, you can often make the specification simpler by changing the default column type using `cols_condense()`

```{r}
mtcars_spec <- spec_csv(readr_example("mtcars.csv"))
mtcars_spec

cols_condense(mtcars_spec)
```

## Column type guessing

readr will guess column types from the data if the user does not specify the types.
The `guess_max` parameter controls how many rows of the input file are used to form these guesses.
Ideally, the column types would be completely obvious from the first non-header row and we could use `guess_max = 1`.
Expand Down Expand Up @@ -162,3 +299,4 @@ Clean up the temporary tricky csv file.
```{r}
file.remove(tfile)
```

Loading