Skip to content

dashboard project

Peter Broadwell edited this page Nov 14, 2017 · 6 revisions

Concept: An interface for additively building a corpus in which the individual works selected (and not selected) contribute a description of themselves to interactive summary visualizations of the selected corpus and its positioning relative to the full set of possible texts. In other words, a "canon shopping cart." Researchers may then feel more confident that they have selected the works most relevant to their research agenda, rather than relying on accepted canons, hearsay, possibly incomplete knowledge, etc.

Required metadata for each work

  • Author name, ideally associated with a unique ID and/or external authority records
  • Author gender
  • Primary language of work
  • Year of first publication (may need to be estimated, at first, as the mean of the author's birth/death dates)

Preferred extra metadata for each work -- may be computationally derived

  • Some indication of orthographic anomalies present in the text
  • Some basic description of genre and/or format
  • Place of writing

Computationally derived metadata

  • Word frequencies -- "function" words, primary vocabulary words, all words(?)
  • LDA topic models -- may need to be curated, however
  • Shannon entropy of text
  • Other stylometric features
  • Document length
  • Text reuse analysis via "fuzzy" matching (somewhat computationally expensive)

Other ideas:

  • The interface also could recommend works to add to the corpus based on basic or computationally derived metadata they share with works that are already selected.
  • Reversing the usual process of faceted browsing, researchers could begin with a single keyword (or set of terms) they find interesting and build their research corpus from there based on relational data provided via the interface.