Skip to content

Python bindings for the Unitex/GramLab corpus processor

License

Notifications You must be signed in to change notification settings

lokdoc/python-unitex

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python bindings for Unitex/GramLab

This package provides access to the Unitex C++ Library.

Main features

  • Unitex as a library
  • Persistence of linguistic resources
  • File virtualization

Installation

The library has been tested on MacOSX (with Python from the MacPorts project) and Linux (Debian) for versions 2.7 and 3.5 of Python. The installation requires the Python header files and the setuptools module. If you plan to use the configuration system, you will also need the yaml module.

# Run as root
# On MacOSX (MacPorts)
# NOTE: the Python header files are installed by default
#       (cf. https://trac.macports.org/wiki/Python)
# Python 2.7
port install py27-setuptools
port install py27-yaml

# Python 3.5
port install py35-setuptools
port install py35-yaml

# On Linux (distributions based on Debian)
# Python 2.7
apt-get install python-dev
apt-get install python-setuptools
apt-get install python-yaml

Once you have met these requirements and downloaded the Unitex source distribution (for Unitex >= 3.1), run (as root):

UNITEX_INC=/path/to/unitex/Src/C++ python setup.py install

Installation testing

If you want to test your installation, open a terminal, move in the tests directory and run (as user):

pat@lucy /home/dev/projects/python-unitex/tests [2]$ python 01_test_tools.py
............
----------------------------------------------------------------------
Ran 12 tests in 0.023s

OK
pat@lucy /home/dev/projects/python-unitex/tests [3]$ python 02_test_resources.py
......
----------------------------------------------------------------------
Ran 6 tests in 0.003s

OK
pat@lucy /home/dev/projects/python-unitex/tests [4]$ python 03_test_io.py
....................
----------------------------------------------------------------------
Ran 20 tests in 0.004s

OK
pat@lucy /home/dev/projects/python-unitex/tests [5]$ python 04_test_processor.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.045s

OK

Documentation

The documentation is accessible from any Python interactive shell using the command help().

In [1]: import _unitex

In [2]: help(_unitex.unitex_tool)
Help on built-in function unitex_tool in module _unitex:

unitex_tool(...)
    This function launches an Unitex command.

    *Positional arguments (length: 1):*

    - **0 [str]** -- the Unitex command.

    *Return [bool]:*

      **True** if the command succeeds, **False** otherwise.

If you want to generate HTML documentation from the source files, you can also use the Sphinx Documentation Generator.

# Run as root
# On MacOSX (MacPorts)
# Python 2.7
port install py27-sphinx

# Python 3.5
port install py35-sphinx

# On Linux (distributions based on Debian)
# Python 2.7
apt-get install python-sphinx

Once the tool is installed, open a terminal, move in the documentation directory and run (as user):

# If necessary, clean previous build
make clean
make html

After the compilation process, the resulting files are located in _build/html/.

Getting started

NOTE: The texts must be encoded in UTF-8. There is so far no support for UTF-16-(LE|BE) or any other encoding.

There are three ways to use the bindings (from low to high-level):

  1. The _unitex C++ extension.
  2. The Unitex basic commands and features.
  3. The high-level Processor class.

The following sections give some sample code for each of these methods.

The _unitex C++ extension

This works like the JNI interface which is part of the Unitex distribution and provides conversion of arguments (into C++ types) and returned values (into Python objects).

from _unitex import unitex_load_persistent_alphabet,\
					unitex_free_persistent_alphabet,\
					unitex_tool

alphabet = unitex_load_persistent_alphabet("/path/to/Alphabet.txt")
grammar = "/path/to/grammar.grf"

command = "UnitexTool Grf2Fst2 --no_loop_check --alphabet=%s %s -qutf8-no-bom" % (alphabet, grammar)

ret = unitex_tool(command)

unitex_free_persistent_alphabet(alphabet)

The Unitex basic commands and features

This part of the bindings is just an abstraction layer in front of the C++ extension. It provides a basic logging system and a number of checks (on arguments). There is also the possibility of storing the different resources and (tools) options in a configuration file which offers more flexibility.

import yaml

from unitex.tools import grf2fst2
from unitex.config import UnitexConfig
from unitex.resources import load_persistent_alphabet, free_persistent_alphabet

grammar = "/path/to/grammar.grf"

config = None
with open("/path/to/unitex.yaml", "r") as f:
    config = yaml.load(f)

options = UnitexConfig(config)

alphabet = options["resources"]["alphabet"]
if options["persistence"] is True:
	alphabet = load_persistent_alphabet(alphabet)

kwargs = options["tools"]["grf2fst2"]

ret = grf2fst2(grammar, alphabet, **kwargs)

if options["persistence"] is True:
	free_persistent_alphabet(alphabet)

The high-level Processor class

This class hides most of the Unitex (pre-)processing procedures in order to facilitate its usage.

import os

from unitex.resources import load_persistent_fst2, free_persistent_fst2
from unitex.processor import UnitexProcessor

files = [ ... ]
grammar = load_persistent_fst2("/path/to/grammar.fst2")

options = None
with open("/path/to/unitex.yaml", "r") as f:
    options = yaml.load(f)

config = UnitexConfig()
config.load(options)

# Persistence is achieved during object initialization
processor = UnitexProcessor(config)

kwargs = {}
kwargs["xml"] = False

for f in files:
    path, _ = os.path.splitext(f)
    output = "%s.xml" % path

    # mode: 's': segment (apply Sentence.fst2)
    #       'r': replace (apply Replace.fst2)
    #       't': tokenize
    #       'l': lexicalize (apply dictionaries)
    processor.open(f, mode="srtl", tagged=False)
    processor.tag(grammar, output, **kwargs)

# 'clean': suppress the files produced by Unitex
# 'free': unload all the persisted resources
processor.close(clean=True, free=True)

free_persistent_fst2(grammar)

In the examples directory, you will find two scripts you can use to achieve two simple tasks.

  1. build-config-file.py: this script builds, for a given language, a default YAML configuration file adapted to your local Unitex installation. This configuration file allows you to define the different parameters required by Unitex and by the bindings.
  2. do-concord.py: this script can be used to perform a real Unitex process: the extraction of concordances from a corpus on the basis of a grammar. It also illustrates in a more detailed way of using the package for other tasks.

To visualize the different options and arguments of these scripts, open a terminal, move in the examples directory and run (as user) each script with the --help option as shown below:

pat@lucy /home/dev/projects/python-unitex/examples [1]$ python build-config-file.py --help
Build Config File -- build the (default) config file for a given language

  $ build-config-file [OPTIONS] <Unitex YAML config template>

Options:
  [ -h, --help      = this help message                                      ]
    -o, --output    = the resulting config filename
    -l, --language  = the language name
    -d, --directory = the original resources directory for the language
                      (i.e. the language directory from Unitex distribution)

Example:
  $ build-config-file -l fr -d /path/to/French -o unitex-fr.yaml unitex.yaml
pat@lucy /home/dev/projects/python-unitex/examples [2]$ python do-concord.py --help
Do Concord -- A simple script to illustrate the Unitex Python bindings

  $ do-concord [OPTIONS] <file1(, file2, ...)>

Options:
  [ -h, --help    = this help message       ]
    -c, --config  = the Unitex config file
    -g, --grammar = the fst2 grammar to use

Example:
  $ do-concord -c unitex-fr.yaml -g grammar.fst2 *.txt

Useful links

About

Python bindings for the Unitex/GramLab corpus processor

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 88.9%
  • C++ 11.1%