-
Notifications
You must be signed in to change notification settings - Fork 78
Description
I am working on an Arabic NLP project where I need to find out the number of morphemes in texts. That's how I came across CAMeL Tools, which I have installed on an Amazon EC2 instance. Thank you for all the work that has gone into making CAMel Tools!
I want to know more about the MorphologicalTokenizer schemes (such as d3tok, atbtok, bwtok, etc.) In particular, I would like to know how they are defined. Is this documented somewhere? For example, when I specify atbtok, what can I expect in terms of tokenisation output?
I'm guessing ATB stands for Arabic Tree Banks, but there are several of these. And BW must stand for Buckwalter? I would be grateful if you could point me in the direction of some documentation. This will really help me work out which scheme to use, rather than trial and error :-)
Many thanks, in advance.