textstem is a tool-set for stemming and lemmatizing words. Stemming is a process that removes affixes. Lemmatization is the process of grouping inflected forms together as a single base form.
The main functions, task category, & descriptions are summarized in the table below:
Function | Task | Description |
---|---|---|
stem_words |
stemming | Stem words |
stem_strings |
stemming | Stem strings |
lemmatize_words |
lemmatizing | Lemmatize words |
lemmatize_strings |
lemmatizing | Lemmatize strings |
make_lemma_dictionary_words |
lemmatizing | Generate a dictionary of lemmas for a text |
To download the development version of textstem:
Download the zip
ball or tar
ball, decompress
and run R CMD INSTALL
on it, or use the pacman package to install
the development version:
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/textstem")
You are welcome to:
- submit suggestions and bug-reports at: https://github.com/trinker/textstem/issues
- send a pull request on: https://github.com/trinker/textstem/
- compose a friendly e-mail to: [email protected]
The following examples demonstrate some of the functionality of textstem.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(textstem, dplyr)
data(presidential_debates_2012)
Before moving into the meat these two examples let's highlight the difference between stemming and lemmatizing.
dw <- c('driver', 'drive', 'drove', 'driven', 'drives', 'driving')
stem_words(dw)
## [1] "driver" "drive" "drove" "driven" "drive" "drive"
lemmatize_words(dw)
## [1] "driver" "drive" "drive" "drive" "drive" "drive"
bw <- c('are', 'am', 'being', 'been', 'be')
stem_words(bw)
## [1] "ar" "am" "be" "been" "be"
lemmatize_words(bw)
## [1] "be" "be" "be" "be" "be"
Stemming is the act of removing inflections from a word not necessarily "identical to the morphological root of the word" (wikipedia). Below I show stemming of several small strings.
y <- c(
'the dirtier dog has eaten the pies',
'that shameful pooch is tricky and sneaky',
"He opened and then reopened the food bag",
'There are skies of blue and red roses too!',
NA,
"The doggies, well they aren't joyfully running.",
"The daddies are coming over...",
"This is 34.546 above"
)
stem_strings(y)
## [1] "the dirtier dog ha eaten the pi"
## [2] "that shame pooch i tricki and sneaki"
## [3] "He open and then reopen the food bag"
## [4] "There ar ski of blue and red rose too!"
## [5] NA
## [6] "The doggi, well thei aren't joyfulli run."
## [7] "The daddi ar come over..."
## [8] "Thi i 34.546 abov"
Lemmatizing is the "grouping together the inflected forms of a word so
they can be analysed as a single item"
(wikipedia). In the
example below I reduce the strings to their lemma form.
lemmatize_strings
uses a lookup dictionary. The default uses
Mechura's (2016) English lemmatization
list available from the
lexicon package. The
make_lemma_dictionary
function contains two additional engines for
generating a lemma lookup table for use in lemmatize_strings
.
y <- c(
'the dirtier dog has eaten the pies',
'that shameful pooch is tricky and sneaky',
"He opened and then reopened the food bag",
'There are skies of blue and red roses too!',
NA,
"The doggies, well they aren't joyfully running.",
"The daddies are coming over...",
"This is 34.546 above"
)
lemmatize_strings(y)
## [1] "the dirty dog have eat the pie"
## [2] "that shameful pooch be tricky and sneaky"
## [3] "He open and then reopen the food bag"
## [4] "There be sky of blue and red rose too!"
## [5] NA
## [6] "The doggy, good they aren't joyfully run."
## [7] "The daddy be come over..."
## [8] "This be 34.546 above"
This lemmatization uses the hunspell package to generate lemmas.
lemma_dictionary_hs <- make_lemma_dictionary(y, engine = 'hunspell')
lemmatize_strings(y, dictionary = lemma_dictionary_hs)
## [1] "the dirty dog ha eat the pie"
## [2] "that shameful pooch i tricky and sneaky"
## [3] "He open and then reopen the food bag"
## [4] "There are sky of blue and re rose too!"
## [5] NA
## [6] "The doggy, well they aren't joyful running."
## [7] "The daddy are come over..."
## [8] "This i 34.546 above"
This lemmatization uses the koRpus package and the TreeTagger program to generate lemmas. You'll have to get TreeTagger set up, preferably in your machine's root directory.
lemma_dictionary_tt <- make_lemma_dictionary(y, engine = 'treetagger')
lemmatize_strings(y, lemma_dictionary_tt)
## [1] "the dirty dog have eat the pie"
## [2] "that shameful pooch be tricky and sneaky"
## [3] "He open and then reopen the food bag"
## [4] "There be sky of blue and red rose too!"
## [5] NA
## [6] "The doggy, well they aren't joyfully run."
## [7] "The daddy be come over..."
## [8] "This be 34.546 above"
It's pretty fast too. Observe:
tic <- Sys.time()
presidential_debates_2012$dialogue %>%
lemmatize_strings() %>%
head()
## [1] "We'll talk about specifically about health care in a moment."
## [2] "But what do you support the voucher system, Governor?"
## [3] "What I support be no change for current retiree and near retiree to Medicare."
## [4] "And the president support take dollar seven hundred sixteen billion out of that program."
## [5] "And what about the voucher?"
## [6] "So that's that's numb one."
(toc <- Sys.time() - tic)
## Time difference of 0.8516021 secs
That's 2,912 rows of text, or 42,708 words, in 0.85 seconds.
This example shows how stemming/lemmatizing might be complemented by
other text tools such as replace_contraction
from the textclean
package.
library(textclean)
'aren\'t' %>%
lemmatize_strings()
## [1] "aren't"
'aren\'t' %>%
textclean::replace_contraction() %>%
lemmatize_strings()
## [1] "be not"