Query about definitions of tokenizer schemes

I am working on an Arabic NLP project where I need to find out the number of morphemes in texts. That's how I came across CAMeL Tools, which I have installed on an Amazon EC2 instance. Thank you for all the work that has gone into making CAMel Tools!

I want to know more about the MorphologicalTokenizer schemes (such as d3tok, atbtok, bwtok, etc.) In particular, I would like to know how they are defined. Is this documented somewhere? For example, when I specify atbtok, what can I expect in terms of tokenisation output?

I'm guessing ATB stands for Arabic Tree Banks, but there are several of these. And BW must stand for Buckwalter? I would be grateful if you could point me in the direction of some documentation. This will really help me work out which scheme to use, rather than trial and error :-)

Many thanks, in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Query about definitions of tokenizer schemes #152

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Query about definitions of tokenizer schemes #152

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions