-
Notifications
You must be signed in to change notification settings - Fork 3
/
article.Rmd
527 lines (374 loc) · 21 KB
/
article.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
---
title: "Elasticsearch and R"
output:
html_document:
toc: True
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
if(packageVersion("knitr") < 1.20){
stop("Please update the package 'knitr' before you run this script. You'll need at least 1.20 for this.")
}
if(packageVersion("uptasticsearch") < 0.3){
stop("Please update the package 'uptasticsearch' before you run this script. You'll need at least 0.3.0 for this.")
}
```
## Introduction
Many of us who work in the data science space struggle with the issues of storing data. How should we organize large datasets so that they are easily navigable, downloadable, and usable from our preferred analytics workspace? Elasticsearch is a great option for people with very large data that needs to be searchable fast, and has the advantage of being very accessible to start small with and combine with the analytics tool you prefer.
In this notebook, I'll walk you through the process of creating, populating, and using an Elasticsearch database using exclusively R code. My company, Uptake, created and maintains an open source package on CRAN called `uptasticsearch` that is designed to make querying data from Elasticsearch painless for a wide range of users. I'll also use a package called `elastic` here, which includes some other functionality useful for populating the database as well.
Before we begin, a note:
* The data being used in this tutorial is from [data.world](data.world), and comes out of the hard work done by Annie Millerbernd of the San Antonio Express-News. You can learn more about it and see the original dataset here: https://data.world/amillerbernd/ut-system-post-grad-earnings
### Prerequisites
* Experience with the specific topic: None required
* Professional experience: None required
* Familiarity with R language recommended
* Familiarity with some form of data storage recommended
* R packages required: `uptasticsearch`, `feather`, `DT`, and `elastic`
* Software required: `Docker` (https://www.docker.com/community-edition) and `RStudio`.
***
## Environment
As will be familiar to many R users, we need to begin by getting the packages that our project requires loaded. In this case, `uptasticsearch`, `feather`, and `elastic` are needed, and you can acquire these using `install.packages()`. I am also employing the package `DT` just to make our data samples render attractively in HTML.
It's also important to make sure you are in the correct working directory, so that our data and other materials can be easily found.
```{r}
library(uptasticsearch)
library(feather)
library(elastic)
library(DT)
getwd()
# If you aren't in the source folder, make sure you get there.
```
***
## Data
As I mentioned above, the data we are using comes from the San Antonio Express-News. This dataset examines the earnings of graduates of the University of Texas system according to several metrics and categorizations, including campus, degree level, and major.
Here we load the data into R, and take a quick look at it. You can see that we have cohort level summaries (not individual level data) but they are pretty granular. When earnings are not listed, the data is likely to have too few individuals. When only a small number (less than 30 in the case of this data) people are in a group, it is best practice to redact sensitive data like earnings information for the purposes of privacy. `Cellcount` tells you the number of individuals in the grouping, and -1 indicates no one is being shown for that group because it is less than 30 individuals.
```{r}
dataset <- feather::read_feather("supporting_materials/ut_grad_earn.feather")
DT::datatable(tail(dataset)
,options = list(
scrollX = TRUE,
searching = FALSE,
pageLength = 5))
names(dataset)
```
***
## Database Preparation
At this point, we are ready to actually build the Elasticsearch database that we will be pushing this data to. However, we need to pay attention to a few things first when we are setting the database up.
### Elasticsearch
Elasticsearch is a system of storing and retrieving data that is optimized for searching, rather than other methods of interaction. In Elasticsearch, what you might think of as a single database table will actually be an object called an index, which contains a self contained pool of data. An index is not tabular - don't think of this like a relational database or SQL! Elasticsearch is structured more like nested JSON documents. (For an R user, this might remind you of nested lists.) The JSON document contains your data and its traits, as you'll see a little later.
There are some excellent tutorials out there that will teach you about the theoretical concepts behind a database like this, and whole classes you could take if you are really interested in the deep dive, so for our purposes I won't go into a ton of that detail, but here are a few links to get you started.
#### For Further Reference
* [The Official Elasticsearch Reference](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/index.html) You might find this reference a bit dense, but for figuring out what your query is missing, or how to specify a very particular criteria, it's invaluable.
* [Elastic.co Practical Introduction](https://www.elastic.co/blog/a-practical-introduction-to-elasticsearch) This documentation gives you a bit more insight on how Elasticsearch works under the hood, without getting too in the weeds. It's particularly useful for interpreting responses from your database.
* [Video on the architecture](https://www.youtube.com/watch?v=YsYUgZu9-Y4) Understanding how distributed storage and "shards" work is helpful for understanding why you might want to use Elasticsearch instead of a relational database option.
Much of the real advantage of using Elasticsearch over some other kind of data storage is achieved when you start working with really "big" data - for this tutorial, and for most non-enterprise users, the amount of data we are using doesn't require the power that Elasticsearch has to offer. But that doesn't mean it's not a cool and useful way to store your data!
(Be advised: there are some other good resources out there, but many them are trying to sell you something, so be on your guard. There is no reason you should have to pay anybody to have your own Elasticsearch database.)
***
### Mapping
One thing I do want to spend a little more time discussing is mappings. This is honestly not all that different on the surface level from any other kind of database; you need to tell the database what sort of data to expect, what fields and types it will be receiving. To do this in Elasticsearch, you'll construct a JSON file. I have already built that for this tutorial, and packaged it with the rest of these documents.
I'm not going to display the entire mapping here, but just give you a few snippets to illustrate what it looks like.
```{r, eval=FALSE}
'{
"mappings": {
"_default_": {
"properties": {
"institution_id": {
"type": "integer"
},
"institution_name": {
"type": "string",
"fielddata": true
},
"deglevl_code": {
"type": "integer"
},
"deglevel": {
"type": "string",
"fielddata": true
},
"degcip_4dig": {
"type": "integer"
},
"ciptitle": {
"type": "string",
"fields": {
"raw" : {
"type": "string",
"index": "not_analyzed"
}
}
},'
```
For users with familiarity with database fields, this likely seems intuitive. The field is named, and given a type. However, for some, `fielddata` is shown- what's that? It is informing the database that this is a field it should allow querying on later. This is only really needed for string or character fields - if you don't plan to query on a field, you can leave this out, and the database will not have to parse all that data, saving time and processing energy.
Let's also take a look at that `ciptitle` field. We have a lot more specifications in place there! In short, these specifications allow us to use the major for searching by name.
***
## Starting Up
Now, with our mappings in order, we can get our database running! To make all this less complicated for you, I have set up the script that creates the local Elasticsearch database and applies the mapping we want. This needs to be run from the terminal, but in R we can use the `system()` call to do that right here in our notebook.
This will take a moment, so wait until the next code chunk is fully complete before you move on.
(The `5.5` at the end of the call is telling our database to use Elasticsearch version 5.5.)
```{r}
system("./supporting_materials/setup_texas.sh 5.5")
```
You probably noticed that the `system` call above didn't return anything - this is normal! Once the chunk is done you're all set. The database is running in the background in docker.
```{r}
# Find out what fields our index has available, and types
uptasticsearch::get_fields(es_host = "http://localhost:9200", es_indices = "utexas")
```
### Load Data
As you can see, that looks like the mappings and types we asked for. Notice `ciptitle.raw` - this is what is produced by that special mapping I applied to `ciptitle` to make it searchable in the way we want. But, this database is empty. The next job is to load our data in - we will format the dataframe with the `elastic` package and then pass it with a system call.
```{r}
elastic::docs_bulk_prep(dataset, index = "utexas", type = "data", path = "ut_data.json", chunk_size = 100000)
system("curl -X POST 'http://localhost:9200/utexas/_bulk' -H 'Content-Type: application/json' --data-binary @ut_data.json")
```
Notice how these preparation and upload commands are working with `ut_data.json`. What happened to our feather format? You'll remember from earlier how we discussed the fact that Elasticsearch is essentially a set of nested JSON files - in order to push data to an Elasticsearch database, it must be in JSON format.
Let's look at our database again and see if anything changed.
```{r}
uptasticsearch::get_fields(es_host = "http://localhost:9200", es_indices = "utexas")
```
Now we have the type `data` as well as the type `_default_`, which is what we want. Our `data` rows have the correct types that we assigned in the beginning in our mapping. Our data is ready to go!
***
## Querying
A database is only useful if we can get out our data and use it when we need it. With `uptasticsearch`, you have very robust and flexible tools for querying your Elasticsearch database.
To start, we will query all fields, unfiltered, and get 10 records. This is something we can do because this dataset is small- beware of doing this for a dataset with many, many fields!
(Note: records are known as "documents" in the Elasticsearch parlance- not "rows".)
```{r}
query_string <- '{"query": {"match_all":{}}}'
DT::datatable(
uptasticsearch::es_search(
es_host = "http://localhost:9200"
, es_index = "utexas"
, query_body = query_string
, max_hits = 10
, size = 10)
, options = list(
scrollX = TRUE,
searching = FALSE,
pageLength = 5))
```
Let's talk a little about the anatomy of the query and function calls above. You may be able to tell that we're using JSON styling for the query, with all the curly braces this implies. Querying Elasticsearch can be very complex, but it's very powerful. As we continue here, I'll give you some more useful examples of how to make different types of query, and there are lots of resources out there for writing more complex queries if you need them.
What are all those options in the `es_search` function call? That is a great question for the help docs!
```{r}
?uptasticsearch::es_search
```
When you run this command in your own console, you'll see an assortment of handy information.
![](supporting_materials/help_docs.png)
`Uptasticsearch` has excellent and robust documentation that will help you learn to use the tools - and if you have other questions, visit us on github and ask for help!
***
### Filtering and Criteria
Now we will begin to filter our data, using the `institution_id` field. What do the records look like from University of Texas Rio Grande Valley campus? We can see from above that the ID for that campus is 3599. We'll also limit the number of records, to be responsible.
```{r}
query_string <- '{"query": {"match": {"institution_id": "3599"}}}'
DT::datatable(
uptasticsearch::es_search(
es_host = "http://localhost:9200"
, es_index = "utexas"
, query_body = query_string
, max_hits = 10
, size = 10)
,options = list(
scrollX = TRUE,
searching = FALSE,
pageLength = 5))
```
This shows us just records of cohorts' post-graduate earnings if they graduated from the Rio Grande Valley campus.
### By Major
There are lots of ways you might want to filter your data. One is by using text fields. In this case, we're going to take a look at cohorts who majored in "Area Studies". Because of the way we set the field parameters for `ciptitle.raw` earlier in this tutorial, we have the ability to search for that text string below and it will return precise matches.
```{r}
# Look for a major - set up explanation of text field stuff
query_string <- '{"query": {"match": {"ciptitle.raw": "AREA STUDIES"}}}'
DT::datatable(uptasticsearch::es_search(
es_host = "http://localhost:9200"
, es_index = "utexas"
, query_body = query_string
, max_hits = 10
, size = 10)
,options = list(
scrollX = TRUE,
searching = FALSE,
pageLength = 5))
```
### Major and Population Size
Now that you're familiar with a few types of filter, we will put multiple filters together. Below, in the query syntax, you'll notice that we use two distinct `must` clauses. This is a tricky part of querying Elasticsearch that you will want to remember. It handles different kinds of search in special ways. You can't string together many different kinds of filters in one big `must` - for example, a range and a match. You have to break them apart. But, you CAN use more than one match type of search in the same `must` - you'll see that below.
```{r}
query_string <- '{
"query": {
"bool" : {
"must" : {
"match": { "ciptitle.raw": "AREA STUDIES" }
},
"must" : {
"range" : {
"cellcount" : { "gte" : 0 }
}
}
}
}
}'
DT::datatable(uptasticsearch::es_search(
es_host = "http://localhost:9200"
, es_index = "utexas"
, query_body = query_string
, max_hits = 10
, size = 10)
,options = list(
scrollX = TRUE,
searching = FALSE,
pageLength = 5))
```
### Major, Population Size, and Campus
Now we'll add complexity and make this query three criteria. We are looking for just cohorts who majored in "Communications and Media Studies" at the Rio Grande Valley campus, and omitting records where the values will be NA due to the small number of students. Notice we have two `must` clauses - one for our matches, and one for the range search. If you find it confusing to put more than one match in the same `must`, you can also just choose to make a separate `must` clause for every search term and it will also work fine.
(Remember, to do a clause with multiple matches, we have to enclose all our match arguments in brackets.)
```{r}
query_string <- '{
"query": {
"bool" : {
"must" : [
{"match": { "ciptitle.raw": "COMMUNICATION AND MEDIA STUDIES" }},
{"match": { "institution_id": "3599" }}
],
"must" : {
"range" : {
"cellcount" : { "gte" : 0 }
}
}
}
}
}'
DT::datatable(uptasticsearch::es_search(
es_host = "http://localhost:9200"
, es_index = "utexas"
, query_body = query_string
, max_hits = 10
, size = 10)
,options = list(
scrollX = TRUE,
searching = FALSE,
pageLength = 5))
```
## Summarizing with Queries
Finally, let's talk a little about the summary capabilities of Elasticsearch. Users of SQL will be familiar with this concept - using the query space to not only extract records, but to group them and then give back details about the groups. Elasticsearch can do this too! The query below will look for our cohorts of graduates from the Rio Grande Valley campus again, and summarize them according to their major.
```{r}
# Summarize with queries- finish explaining the .raw stuff
query_string <- '{
"query":
{"match": {"institution_id": "3599" }}
, "aggs" : {
"common_majors" : {
"terms" : {
"field" : "ciptitle.raw"
}
}
}
}'
DT::datatable(uptasticsearch::es_search(
es_host = "http://localhost:9200"
, es_index = "utexas"
, query_body = query_string)
, options = list(
scrollX = TRUE,
searching = FALSE,
pageLength = 5))
```
### Summarize Earnings by Major with Filters
But you might encounter a case where you don't want to just count the cohorts, but want to measure other attributes of the group. Let's do this with an average of the median earnings for those graduates, again by major.
```{r}
# Summarize by the earnings amount per major
query_string <- '{
"query": { "match": { "institution_id": "3599" } },
"aggs": {
"group_by_major": {
"terms": {
"field": "ciptitle.raw"
},
"aggs": {
"average_50th_pctile": {
"avg": {
"field": "p50_earnings"
}
}
}
}
}}'
DT::datatable(uptasticsearch::es_search(
es_host = "http://localhost:9200"
, es_index = "utexas"
, query_body = query_string)
,options = list(
scrollX = TRUE,
searching = FALSE,
pageLength = 5))
```
### Omit Redacted Groups
Now, we'll combine this with what we did on filtering above. We can filter by the campus again, and we can also now add the filter that omits groups where the sample is so small they are redacted anyway. You'll notice a difference in the summary values! Always watch out for those NA records that might be hiding in your dataset.
```{r}
query_string <- '{
"query": { "bool" : {
"must" : {
"match": { "institution_id": "3599" }
},
"must" : {
"range" : {
"cellcount" : { "gte" : 0 }
}
}
}},
"aggs": {
"group_by_major": {
"terms": {
"field": "ciptitle.raw"
},
"aggs": {
"average_50th_pctile": {
"avg": {
"field": "p50_earnings"
}
}
}
}
}}'
DT::datatable(uptasticsearch::es_search(
es_host = "http://localhost:9200"
, es_index = "utexas"
, query_body = query_string)
,options = list(
scrollX = TRUE,
searching = FALSE,
pageLength = 5))
```
### Sort the Results
The last functionality we're going to look at is the ability to sort results of your queries. This works with summary data or with individual records being returned. Here we are sorting by the summary value, so we can see which majors had the highest median earnings for their graduates.
```{r}
query_string <- '{"query":
{ "match": { "institution_id": "3599" } }
, "aggs": {
"group_by_major": {
"terms": {
"field": "ciptitle.raw",
"order": {"average_50th_pctile": "desc"}
},
"aggs": {
"average_50th_pctile": {
"avg": {
"field": "p50_earnings"
}
}
}
}
}
}'
DT::datatable(uptasticsearch::es_search(
es_host = "http://localhost:9200"
, es_index = "utexas"
, query_body = query_string)
,options = list(
scrollX = TRUE,
searching = FALSE,
pageLength = 5))
```
***
## Cleaning Up
Now we can shut down our database, remove temporary files, and clean up our workspace. I have provided a script that takes care of this for you, but you can open it up and see how it works if you like. Remember, the database will cease to exist when you run this script, so be careful!
```{r}
system("./supporting_materials/cleanup_local.sh")
```
Thanks for taking the time to work through this tutorial. I hope that you will find it helpful when you use your own Elasticsearch databases with R in the future!
To learn more about Uptake, please visit us at https://www.uptake.com/. You can also find me at www.stephaniekirmer.com, [email protected], or @data_stephanie on Twitter.