vt-passim

This repository contains helper scripts developed by the Viral Texts Project to import data and process results.

Installation

These scripts rely on Apache Spark to manage parallel computations, either on a single machine or a cluster. If you don't already have access to Spark, we recommend downloading a precompiled, binary distribution of Spark, unpacking it, and adding the bin subdirectory to your PATH.

To compile the code in this repository, we use sbt. Run the following command in the top level:

$ sbt package

This should produce a runnable .jar in target/scala_*/vtpassim*.jar. (The exact filename will depend on the version of Scala and passim that you have.)

The bin subdirectory of the vt-passim distribution contains executable shell scripts such as vtrun. We recommend adding this subdirectory to your PATH.

Format Conversions with `vtrun`

The vt-passim repository contains several Spark applications to convert data from various format into the record structure required by passim. These conversion scripts are mostly invoked with the vtrun executable script.

The passim software for text reuse analysis takes as input a series of records representing individual documents. Minimally, each record requires a unique identifier (id) and textual content (text). In JSON format this looks like:

{"id": "d1", "text": "This is text."}

You may change the names of these fields when invoking passim.

To convert, for instance, a directory of .hocr files to passim's record format using the HOCR class, do:

$ vtrun HOCR <input> <output>

By default, this will convert the data into records in the binary parquet format used by Apache Spark. To request that Spark output data in the JSON format, append the following option to the SPARK_SUBMIT_ARGS environment variable:

$ SPARK_SUBMIT_ARGS='--conf spark.sql.sources.default=json' vtrun HOCR <input> <output>

Some useful format conversion classes are:

Class	Description
AltoPages	ALTO format output by many OCR systems
ChronAm	.tar file containing multiple .xml files in ALTO format as used by Chronicling America
DjVu	Deja Vu format as used by the Internet Archive
HOCR	hOCR format output by many OCR systems
TCPPages	TEI XML as used by the Text Creation Partnership

Many of these formats allow multiple pages per file. These conversion classes output one record per page.

Name		Name	Last commit message	Last commit date
Latest commit History 574 Commits
bin		bin
data		data
etc		etc
project		project
scripts		scripts
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vt-passim

Installation

Format Conversions with `vtrun`

About

Releases

Packages

Languages

License

ViralTexts/vt-passim

Folders and files

Latest commit

History

Repository files navigation

vt-passim

Installation

Format Conversions with vtrun

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Format Conversions with `vtrun`

Packages