dashboard project

Concept: An interface for additively building a corpus in which the individual works selected (and not selected) contribute a description of themselves to interactive summary visualizations of the selected corpus and its positioning relative to the full set of possible texts. In other words, a "canon shopping cart." Researchers may then feel more confident that they have selected the works most relevant to their research agenda, rather than relying on accepted canons, hearsay, possibly incomplete knowledge, etc.

Required metadata for each work

Author name, ideally associated with a unique ID and/or external authority records
Author gender
Primary language of work
Year of first publication (may need to be estimated, at first, as the mean of the author's birth/death dates)

Preferred extra metadata for each work -- may be computationally derived

Some indication of orthographic anomalies present in the text
Some basic description of genre and/or format
Place of writing

Computationally derived metadata

Word frequencies -- "function" words, primary vocabulary words, all words(?)
LDA topic models -- may need to be curated, however
Shannon entropy of text
Other stylometric features
Document length
Text reuse analysis via "fuzzy" matching (somewhat computationally expensive)

Other ideas:

The interface also could recommend works to add to the corpus based on basic or computationally derived metadata they share with works that are already selected.
Reversing the usual process of faceted browsing, researchers could begin with a single keyword (or set of terms) they find interesting and build their research corpus from there based on relational data provided via the interface.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dashboard project

Required metadata for each work

Preferred extra metadata for each work -- may be computationally derived

Computationally derived metadata

Other ideas:

Clone this wiki locally