Skip to content

Latest commit

 

History

History
195 lines (131 loc) · 6.23 KB

methodshub.md

File metadata and controls

195 lines (131 loc) · 6.23 KB

oolong - Create Validation Tests for Automated Content Analysis

Description

Intended to create standard human-in-the-loop validity tests for typical automated content analysis such as topic modeling and dictionary-based methods. This package offers a standard workflow with functions to prepare, administer and evaluate a human-in-the-loop validity test. This package provides functions for validating topic models using word intrusion, topic intrusion (Chang et al. 2009, https://papers.nips.cc/paper/3700-reading-tea-leaves-how-humans-interpret-topic-models) and word set intrusion (Ying et al. 2021) <doi:10.1017/pan.2021.33> tests. This package also provides functions for generating gold-standard data which are useful for validating dictionary-based methods. The default settings of all generated tests match those suggested in Chang et al. (2009) and Song et al. (2020) <doi:10.1080/10584609.2020.1723752>.

Keywords

  • Validity
  • Text Analysis
  • Topic Model

Science Usecase(s)

This package was used in the literature to valid topic models and prediction models trained on text data, e.g. Rauchfleisch et al. (2023), Rothut, et al. (2023), Eisele, et al. (2023).

Repository structure

This repository follows the standard structure of an R package.

Environment Setup

With R installed:

install.packages("oolong")

Input Data

The input data has to be a topic model or prediction model trained on text data. For example, one can train a topic model from the text data (tweets from Donald trump) included in the package by:

library(seededlda)
library(quanteda)
trump_corpus <- corpus(trump2k)
tokens(trump_corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE,
       split_hyphens = TRUE, remove_url = TRUE) %>% tokens_tolower() %>%
    tokens_remove(stopwords("en")) %>% tokens_remove("@*")  -> trump_toks

model <- textmodel_lda(x = dfm(trump_toks), k = 8, verbose = TRUE)

Sample Input and Output Data

A sample input is a model trained on text data, e.g.

library(oolong)
library(seededlda)
abstracts_seededlda
Call:
lda(x = x, k = k, label = label, max_iter = max_iter, alpha = alpha, 
    beta = beta, seeds = seeds, words = NULL, verbose = verbose)

10 topics; 2,500 documents; 3,908 features.

The sample output is an oolong R6 object.

How to Use

Please refer to the overview of this package for a comprehensive introduction of all test types.

Suppose there is a topic model trained on some text data called abstracts_seededlda, which is included in the package.

library(oolong)
abstracts_seededlda
Call:
lda(x = x, k = k, label = label, max_iter = max_iter, alpha = alpha, 
    beta = beta, seeds = seeds, words = NULL, verbose = verbose)

10 topics; 2,500 documents; 3,908 features.

Suppose one would like to conduct a word intrusion test (Chang et al. 2009) to validate this topic model. This test can be generated by the wi() function.

oolong_test <- wi(abstracts_seededlda, userid = "Hadley")
oolong_test
── oolong (topic model) ────────────────────────────────────────────────────────

✔ WI ✖ TI ✖ WSI

☺ Hadley

ℹ WI: k = 10, 0 coded.

── Methods ──

• <$do_word_intrusion_test()>: do word intrusion test

• <$lock()>: finalize and see the results

One can then conduct the test following the instruction displayed, i.e. oolong_test$$do_word_intrusion_test().

oolong_test$do_word_intrusion_test()

One should see a graphic interface like the following and conduct the test.

After the test, one can finalize the test by locking the test.

oolong_test$lock()

And then obtain the result of the test. For example:

oolong_test
── oolong (topic model) ────────────────────────────────────────────────────────

✔ WI ✖ TI ✖ WSI

☺ Hadley

ℹ WI: k = 10, 10 coded.

── Results: ──

ℹ 90%  precision

Contact Details

Maintainer: Chung-hong Chan [email protected]

Issue Tracker: https://github.com/gesistsa/oolong/issues

Publication

  1. Chan, C. H., & Sältzer, M. (2020). oolong: An R package for validating automated content analysis tools. The Journal of Open Source Software: JOSS, 5(55), 2461. https:://doi.org/10.21105/joss.02461