Data loader/common data structures and other tools
Most of the code are Python2/3 compatible. For the version of python for specific modules, please check the second line of each source file.
Install using pip will get you the latest tested version of natlang
.
> pip install natlang
Alternatively, you can also install from source using the following command:
> python setup.py install
If you want to load up a dataset, then just do this:
> import natlang as nl
> data = nl.load(filePattern, format=ChoosenFormat)
> # ChoosenFormat here can be an actual imported format or string.
> # Alternatively, you can also pass a loader func in using nl.load(filePatttern, loader=func)
For parallel datasets:
> import natlang as nl
> data = nl.biload(srcPattern, tgtPattern, srcFormat, tgtFormat)
> # Loader option similar to nl.load also applies here. src stands for source, tgt stands for target.
All supported formats are placed under src/format
.
Currently the following formats are tested:
-
txt
: simple text format. Sentences are separated by\n
, tokens/words are separated by whitespace. -
tree
: constituency tree format. Runpython -i format/tree.py
to play around. -
semanticFrame
: Propbank/Nombank frame loader. Returns bundles of frames for analysis. -
AMR
: Abstract Meaning Representation. Runpython -i format/AMR.py
to play around. -
conll
: General CoNLL format loader. Default is CoNLL_U. Runpython -i format/conll.py
to play around.
For formats supporting being loaded from a file, one should implement a load
function in the format file (see 2.1).
For formats supporting being exported, each instance of that format should have
an export
method that outputs a string.
Each format has its own loader.
It is defined as format.FORMAT.load
.
The load
function has the following interface:
def load(file, linesToLoad=sys.maxsize)
At test time, the load
function would be expected to parse the file
description and read from it.
It will return the first linesToLoad
entries as a list.
For example, if one wishes to use load a file in constituency tree format (see
example in tests/sampleTree.txt
), one could do the following:
from datatool.format import tree
x = tree.load("datatool/tests/sampleTree.txt")
This class allows one to load parallel corpora (L1, L2) in any format. You can specify the format for L1 and L2 side separately.
from datatool.loader import ParallelDataLoader
loader = ParallelDataLoader(srcFormat='txtOrTree', tgtFormat='txtOrTree')
Here, 'txtOrTree'
is the default value for srcFormat
and tgtFormat
.
Note that under the format
folder, except for data structures for specific
formats, there are also mere loaders and 'txtOrTree'
is one that can handle
both tree
and txt
.
After initialising the loader, one can just go ahead and run:
loader.load(fFile, eFile, linesToLoad)
The loader will automatically align the parallel text and output a list of
tuples, each containing a single entry in L1 and L2.
Entries with either L1 or L2 being None
or of length 0 will be omitted.
Usage:
from datatool.exporter import exportToFile, RealtimeExporter
Export a txt
format dataset or tree
format dataset (not single entry, but
rather a dataset) to file.
The code is pretty self-explanatory. If the export function of a specific format takes quite a bit of time, this method is recommended.