This directory contains recipes (use cases) that supplement The Unicode Cookbook for Linguists. Each recipe is in its own subdirectory:
- Basics: basics of grapheme segmentation and text tokenization in the Python and R programming languages
- ASJP: tokenize ASJP wordlists with the R
- Dogon: tokenize the Dogon comparative wordlist and create an orthography profile in Python with Pandas
- Dutch: create an orthography profile for tokenizing Dutch orthography with R
- JIPA: tokenize text in the International Phonetic Alphabet (IPA) with Python or R
To install the Python segments package from the Python Package Index (PyPI) run:
pip install segments
on the command line. This will give you access to both the CLI and programmatic functionality in Python scripts, when you import the segments
library.
You can also install the segments
package with from the Github repository:
git clone https://github.com/cldf/segments.git
cd segments
python setup.py develop
To install the qlcData library and accompanying data, install qlcData
:
install.packages("devtools")
devtools::install_github("cysouw/qlcData", build_vignettes = T)
and then load the library:
library(qlcData)
To access help, call:
help(qlcData)
To access the vignette, call:
vignette("orthography_processing")
Each recipe contains a short use case with accompanying code. The directory structure is typically as follows:
|-- Recipe name
| |-- recipe files
| |-- data
| | └── orthography profiles
| |-- sources
| | └── input data
| |-- sandbox
| | └── where the output is written