-
Notifications
You must be signed in to change notification settings - Fork 7
/
README.Rmd
executable file
·468 lines (322 loc) · 20.8 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
---
title: Scimeetr
output:
github_document:
toc: true
toc_depth: 4
---
# Install
`scimeetr` can be installed directly from the R console using the following lines :
>if (!require("devtools")) install.packages("devtools")
>devtools::install_github("MaximeRivest/scimeetr")
#Introduction
Scimeetr helps explore the scholarly literature. It contains a suit of function that let someone:
- load bibliometric data into R
- make a map of peer reviewed papers by creating various networks
- find research community
- characterise the research communities
- generate reading list
This tutorial is composed of two self-contained section. The first section show case the whole process with all the default parameters. The second section describes each function in more detail by presenting the rational for the function, the algorithms used and the options.
<a href="#top">Back to top</a>
#From data to reading list
You can automatically generate a reading list of seminal papers in a research litterature by using only those three functions: `ìmport_wos_files`, `scimap`, and `scilist`. This first section describes this process in more details.
<a href="#top">Back to top</a>
##loading and exploring bibliometric data
The first step in exploring the literature is to retrieve bibliometric data from the *Web of Science* or *Scopus*. In this first tutorial I use a dataset from the *Web of Science* about ecological networks.
```{r, include=FALSE}
library(scimeetr)
```
```{r, eval=FALSE}
library(scimeetr)
scimeetr_list <- import_wos_files("path/to/folder/")
```
Then,`summary` can be used to get a quick characterisation of the data.
```{r}
summary(scimeetr_list)
```
From this summary, we see that there is 396 papers in my data set which overal cites 16567 different elements. On average, each paper cites 53 elements.
Than we learn that, in this research community, 25% of the papers are cited less than 2 times, 50% are cited less than 9 times and 75% are cited less than ~23 times. There are papers that are cited up to 1333 times. The average citation per paper is ~25. This is much higher than the median (9), thus most paper are cited only a few times and a few papers are profusely cited. When correcting for the age of the paper, we learn that papers are cited 2 times per year on average.
By looking at the most frequent keyword and journals, we learn that this community of research is about biodiversity, agriculture, ecosystem services and policy. Keyword and journal frequency tables efficiently reveal the theme of a scientific community.
<a href="#top">Back to top</a>
##Mapping scientific community
The previous characterisation is great, but it is limited if your dataset contains many different scientific communities. By detecting the scientific communities present within a dataset a map of science can be drawn and each cluster can be characterised on its own. The function `scimap` can be used for this task.
```{r, error=FALSE}
scimap_result <- scimap(scimeetr_list)
```
The function returns all the data that scimeetr_list contained and more. For example communities have been identified and now if the function `summary` is used on scim_result. In addition of the previous information. The descriminant keywords of each communities constituating the main community are listed.
```{r}
summary(scimap_result)
```
Except for the last tables, all of the output is identical to the `summary` output above. Those last tables now reveals that the papers in our database can be clustered in two communities. One that is about x and the other that is about y.
The function `plot` can be used on the output of the function summary for a graphical representation of the sub-communities.
```{r, fig.height = 7, fig.width=7}
plot(summary(scimap_result, com_size = 30))
```
<a href="#top">Back to top</a>
##Automatically generating a reading list of seminal papers
Now that we have characterise the main community and seen of which community it is constituted, we can decide if it is the community that we wish to join / review. If it is, we use the function `scilist` to get reading lists. The defaul readin list will find the seminal papers of each communitiy.
```{r, eval=FALSE}
reading_list <- scilist(scimap_result)
reading_list$com1
```
```{r, echo = FALSE}
reading_list <- scilist(scimap_result)
knitr::kable(reading_list$com1)
```
<a href="#top">Back to top</a>
# In depth description of each steps
##How to get bibliometric data?
Biliometric data can be obtained from either *Scopus* or the *Web of Science*. Most university library have access to either one and some have access to both.
<a href="#top">Back to top</a>
### Retrieving data from Scopus
![](./vignettes/scopus.png)
*Scopus home page.*
![](./vignettes/scopus1.png)
*Select all and export*
![](./vignettes/scopus2.png)
*Export as CSV file and select all fields for exportation*
Following the previous steps will get you one or several .csv files. Then, to import this/these file(s) in `R`, you need to put it/them in a **new folder which contains only the files to import into `R`**
<a href="#top">Back to top</a>
### Retrieving data from Web of Science
![Web of Science home page. Make sure that Select a database corresponds to Web of Science Core Collection](./vignettes/wos.png)
*Web of Science home page. Make sure that Select a database corresponds to Web of Science Core Collection*
![Save to Other Files Formats](./vignettes/wos1.png)
*Save to Other Files Formats*
![You can download only 500 items at a time. You should select Full Record and Cited References. And select the Tab-delimeted (UTF-8) as file format.](./vignettes/wos2.png)
*You can download only 500 items at a time. You should select Full Record and Cited References. And select the Tab-delimeted (UTF-8) as file format.*
Following the previous steps will get you one or several .txt files. Then, to import this/these file(s) in `R`, you need to put it/them in a **new folder which contains only the files to import into `R`**
<a href="#top">Back to top</a>
## How to upload bibliometric data into R
The bibliometric data obtained from Scopus or Web of science are either in .csv or .txt format. These are standard file formats and you most likely know them. There are built in function in `R` that let you import .csv and .txt files. So why does scimeetr provide you with `import_scopus_files` and `import_wos_files`? There are four reasons. The main one is that bibliometric data contains author names from around the world, which means that all alphabets are used and this leads to problems with file encoding. Scimeetr's import functions solves that problem. Second, Scopus do not provide standard, uniform and consisten cited reference list. Thus, `import_scopus_files` has to standardize it at import. This explains the additional time required to load scopus files versus wos files. Third, Scopus and Web of Science do not use the same column names so they have to be homogenized at import. Finally, the data can be transformed into a scimeetr object so that `summary`, `plot` and `print` will know what to do with it.
```{r, eval=FALSE}
scimeetr_list <- import_wos_files(files_directory = "/path/to/folder/")
scimeetr_list <- import_scopus_files(files_directory = "/path/to/folder/")
```
Do not forget that this function take in a path to a folder not a file. Thus, it need a slash at the end of the folder path.
<a href="#top">Back to top</a>
## Exploring scimeetr data
###Printing and summary
Printing `scimeetr_list` that we just created will provide some informations about it, but `summary` will provide more.
```{r}
scimeetr_list
summary(scimeetr_list)
```
###Characterizing the corpus of papers contained by a scimeetr object
Within the scimeetr package there are several function that help us characterize our corpus of papers.
The corpus of papers can be characterized be:
* keywords with `characterize_kw()`
* title-words with `characterize_ti()`
* astract-words with `characterize_ab()`
* journals with `characterize_jo()`
* authors with `characterize_au()`
* universities with `characterize_un()`
* countries with `characterize_co()`
####Characterize the corpus with keywords
To get even more information about the corpus of papers contained within `scimeetr_list` we can use `characterize_kw`. This function will generate a list of data frames, one data frame per communities within `scimeetr_list`. The first column of any of these data frames will contain the keywords themselves. The second column contains the frequency of the keywords (i.e. the number of papers that mentions this keyword).
```{r, eval=FALSE}
kw <- characterize_kw(scimeetr_list)
head(kw$com1)
```
```{r, echo=F}
kw <- characterize_kw(scimeetr_list)
knitr::kable(head(kw$com1), format = "markdown")
```
<a href="#top">Back to top</a>
####Characterize the corpus with journals
We can also use `characterize_jo`. Just like `characterize_kw`, this function will generate a list of data frames, one data frame per communities within `scimeetr_list`. The first column of any of these data frames will contain the journals' names themselves. The other columns contains several journal based metrics.
```{r, eval=FALSE}
jo <- characterize_jo(scimeetr_list)
head(jo$com1)
```
```{r, echo=F}
jo <- characterize_jo(scimeetr_list)
knitr::kable(head(jo$com1), format = "markdown")
```
<a href="#top">Back to top</a>
####Characterize the corpus with authors
We can also use `characterize_au`. The first column of any of these data frames will contain the authors' names themselves. The other columns contains several author based metrics.
```{r, eval=FALSE}
au <- characterize_au(scimeetr_list)
head(au$com1)
```
```{r, echo=F}
au <- characterize_au(scimeetr_list)
knitr::kable(head(au$com1), format = "markdown")
```
<a href="#top">Back to top</a>
### Scimeetr object structure and navigation
A scimeetr object such as `scimeetr_list` contains more data than what can be seen with `print` and `summary`. A scimeetr object is in fact a list of communities list which are themselves list of up to 9 elements. Each communities contain a data.frame called `dfsci`. This dataframe contains all the bibliometric data that was importedinto `R`.
```{r, eval=FALSE}
scimeetr_list$com1$dfsci
```
```{r, echo=FALSE}
knitr::kable(subset(scimeetr_list$com1$dfsci[1,],select = c(-AB, -CR, -FX)), format = "markdown")
```
<a href="#top">Back to top</a>
## Making reading lists
If we are confident that the papers contained in `scimeetr_list` are those for which we want a reading list we can used the function `scilist` to find various lists of papers. The default list given by `scilist` contains the seminal papers for the community analysed. That is, it rank the paper by the number of times they were cited by all the papers and list them by citation frequency.
```{r, eval=FALSE}
scilist(scimeetr_list)
```
```{r, echo=FALSE}
knitr::kable(scilist(scimeetr_list), format = "markdown")
```
With the parameter `k`, we can control the length of the reading list.
```{r, eval=FALSE}
scilist(scimeetr_list, k = 3)
```
```{r, echo=FALSE}
knitr::kable(scilist(scimeetr_list, k = 3), format = "markdown")
```
With the parameter `reading_list`, we can get any of the following 12 reading lists that fits into three categories:
* Core
- core_papers
- core_yr
- core_residual
* Experts
- by_expert_LC
- by_expert_TC
- group_of_experts_TC
- group_of_experts_LC
* Centrality
- cite_most_others
- betweeness
- closeness
- connectness
- page_rank
The default reading list is `core_papers`.
<a href="#top">Back to top</a>
### Core
I categorise the reading lists as **core** because they are reading lists of core papers as they are all a variation of the number of times papers within our community of interest refers to the paper listed. Although the number of citation is not a perfect measure of a papers importance for a community it should be a good proxy. A weekness of the number of citation as a measure of papers importance is that not all citations are equal. For example, sometimes a paper is cited because it is criticized or because it contrasts with other findings. This as been realised by others before and some have attempted to fix it by creating the concept of influential citation. Influential citation is a great concept by to be calculated it requires advance text processing and access to the full text of each papers. As it is notoriosly time consuming to get full text and even harder to get it in the right format, we are left with citation count.
#### Most cited per year
Using `scilist` with `reading_list = "core_yr"` will list the most cited paper for each year from three years before present to ten years before present. The parameter `k` controls the number of paper per year to list.
```{r, eval=FALSE}
scilist(scimeetr_list, reading_list = "core_yr", k = 2)
```
```{r, echo=FALSE}
knitr::kable(scilist(scimeetr_list, reading_list = "core_yr", k = 2), format = "markdown")
```
<a href="#top">Back to top</a>
#### More cited than expected
Using `scilist` with `reading_list = "core_residual"` will list the papers that diverge most from the expected number of citation for this particular paper. This can be visualised in the figure below. The point that have the biggest difference between their frequency value and the fitted blue lines are listed in the `core_residual` reading list.
```{r, message=FALSE, echo=FALSE}
splt_cr <- split_cr(scimeetr_list)
cr_df <- dplyr::inner_join(splt_cr, scimeetr_list$com1$cr, by = 'ID')
library(dplyr, quietly = T)
x <- cr_df %>%
mutate(age=as.integer(stringr::str_extract(Sys.Date(), '^[0-9]{4}')) - as.integer(as.character(cr_df$year))) %>%
filter(age <= 40 & age > 2) %>%
group_by(age) %>%
top_n(n = 10, wt = Frequency.x) %>%
select(record, Frequency.x, age) %>%
arrange(age, desc(Frequency.x)) %>%
ungroup()
library(ggplot2)
ggplot(x, aes(x = age, y = Frequency.x)) +
geom_point()+
geom_smooth(method = "gam", formula = y ~ s(x, k=10), se = FALSE)
```
Here is an example of the code and its result.
```{r, eval=FALSE}
scilist(scimeetr_list, reading_list = "core_residual", k = 3)
```
```{r, echo=FALSE}
knitr::kable(scilist(scimeetr_list, reading_list = "core_residual", k = 3), format = "markdown")
```
<a href="#top">Back to top</a>
### Experts
The reading lists that I categorise as **expert** are built from authors information. Experts within a community are identified based on the number of papers they published and the number of times each of their papers are cited.
#### Recent paper of a few experts
Using `scilist` with `reading_list = "by_expert_LC"` we will get a list of recent papers by one or a few experts in the community. For the option `by_expert_LC`, authors are ranked based on their *harmonic local H-index*. The H-index is a measure of an other productivity and impact. An author with an H-index of 10 means that he has published at least 10 papers with 10 or more citation each. A local H-index means that only citations from other papers in the community are counted. A harmonic local H-index means that authors do not get the full credit for each citation their paper received. It is corrected depending on the authos position in the authors list. First authors gets most of the credit, then the last author gets the second most, and the authors gets credit as a proportion of their position. Once the authors harmonic-local-H-index is found they are ranked and the `m` most recent publication of the `k` most 'expert' authors are listed as a reading list.
```{r, eval=FALSE}
scilist(scimeetr_list, reading_list = "by_expert_LC", k = 2, m = 2)
```
```{r, echo=FALSE}
knitr::kable(scilist(scimeetr_list, reading_list = "by_expert_LC", k = 2, m = 2), format = "markdown")
```
Using `scilist` with `reading_list = "by_expert_TC"` instead of `reading_list = "by_expert_LC"`, notice the `_TC` instead of the `_LC` will based the ranking calculation on **t**otal citation of it's publications instead of only the **l**ocal citations.
<a href="#top">Back to top</a>
#### Paper of experts group
Using `scilist` with `reading_list = "group_of_experts_LC"` we will get a list of papers for which many authors are experts in the community. For this option, authors are assigned a *harmonic local H-index* like described in the previous section. But this time, a weighted sum of the harmonic-local-H-index of each authors of a paper is calculated.
```{r, eval=FALSE}
scilist(scimeetr_list, reading_list = "group_of_experts_LC", k = 5)
```
```{r, echo=FALSE}
knitr::kable(scilist(scimeetr_list, reading_list = "group_of_experts_LC", k = 5), format = "markdown")
```
Using `scilist` with `reading_list = "group_of_experts_TC"` instead of `reading_list = "group_of_experts_LC"`, notice the `_TC` instead of the `_LC` will based the ranking calculation on **t**otal citation of it's publications instead of only the **l**ocal citations.
<a href="#top">Back to top</a>
### Centrality
Their are several measures of [nodes centrality in graph theory](https://en.wikipedia.org/wiki/Centrality). The most central papers of a community of papers can be found with `scilist`.
#### Betweeness
Betweeness measures the importance of a paper in connecting two clusters of papers. Papers with a high betweeness would therefore be a paper that tend to be more interdisciplinary.
```{r, eval=FALSE}
scilist(scimeetr_list, reading_list = "betweeness", k = 5)
```
```{r, echo=FALSE}
knitr::kable(scilist(scimeetr_list, reading_list = "betweeness", k = 5), format = "markdown")
```
<a href="#top">Back to top</a>
#### Closeness
Closeness measures the average number of link between a paper and all other papers. Papers with a high closeness would therefore be a paper that tend to have a large and wide list of citations.
```{r, eval=FALSE}
scilist(scimeetr_list, reading_list = "closeness", k = 5)
```
```{r, echo=FALSE}
knitr::kable(scilist(scimeetr_list, reading_list = "closeness", k = 5), format = "markdown")
```
<a href="#top">Back to top</a>
#### Connectness
Connectness measures the number of links a paper has. Papers with a high connectness would therefore be a paper that tend to have cited what most other studies cited.
```{r, eval=FALSE}
scilist(scimeetr_list, reading_list = "connectness", k = 5)
```
```{r, echo=FALSE}
knitr::kable(scilist(scimeetr_list, reading_list = "connectness", k = 5), format = "markdown")
```
<a href="#top">Back to top</a>
#### Page rank
[Page rank](https://en.wikipedia.org/wiki/PageRank) was developped by Larry Page at google and it's a way to measure web page importance. The algorithm was applied to directed graph, so I am not sure of the consequence of applying it on the undirected graph that we have here.
```{r, eval=FALSE}
scilist(scimeetr_list, reading_list = "page_rank", k = 5)
```
```{r, echo=FALSE}
knitr::kable(scilist(scimeetr_list, reading_list = "page_rank", k = 5), format = "markdown")
```
<a href="#top">Back to top</a>
#### Cite most others
With the option `cite_most_others`, the papers that cite most other papers of the community can be found. This is not a centrality measure but it is also based on papers connection to each other. It should tend to find litterature review and recent papers that have an especially good grasp on the community.
```{r, eval=FALSE}
scilist(scimeetr_list, reading_list = "cite_most_others", k = 5)
```
```{r, echo=FALSE}
knitr::kable(scilist(scimeetr_list, reading_list = "cite_most_others", k = 5), format = "markdown")
```
<a href="#top">Back to top</a>
## Finding the main communities of research
In the previous sections we have looked at only the main research community. But, splitting the main community in sub-communities can provide a more detail picture of the litterature. It can also help identify and then remove irrelevant sub-communities. To achieve any of this, the sub-communities have to be identified and characterized. The function `scimap`, as in science map, was developped for this task. By default, the graph use bibliographic coupling to calculate connections between papers, but coupling can also be done based on abstract words (abc), title words (tic) or keywords (kec).
```{r}
summary(scimap(scimeetr_list, coupling_by = 'bic', community_algorithm = 'louvain', min_com_size = 100))
summary(scimap(scimeetr_list, coupling_by = 'abc', community_algorithm = 'louvain', min_com_size = 100))
summary(scimap(scimeetr_list, coupling_by = 'tic', community_algorithm = 'louvain', min_com_size = 100))
summary(scimap(scimeetr_list, coupling_by = 'kec', community_algorithm = 'louvain', min_com_size = 100))
```
<a href="#top">Back to top</a>
## Focusing on a sub-community
With the function `focus_on`, it is possible to change focus on a sub-community.
```{r}
scil <- scimap(scimeetr_list)
scil
subscil <- focus_on(scil, grab = 'com1_1')
subscil
```
<a href="#top">Back to top</a>
## Dive to a sub-community
With the function `dive_to`, it is possible to move down to a sub-community and keep it's sub-communities.
```{r}
scil <- scimap(scimap(scimeetr_list))
scil
subscil <- dive_to(scil, aim_at = 'com1_1')
subscil
```
<a href="#top">Back to top</a>