Skip to content

Query about definitions of tokenizer schemes #152

@abdulmaalik2025

Description

@abdulmaalik2025

I am working on an Arabic NLP project where I need to find out the number of morphemes in texts. That's how I came across CAMeL Tools, which I have installed on an Amazon EC2 instance. Thank you for all the work that has gone into making CAMel Tools!

I want to know more about the MorphologicalTokenizer schemes (such as d3tok, atbtok, bwtok, etc.) In particular, I would like to know how they are defined. Is this documented somewhere? For example, when I specify atbtok, what can I expect in terms of tokenisation output?

I'm guessing ATB stands for Arabic Tree Banks, but there are several of these. And BW must stand for Buckwalter? I would be grateful if you could point me in the direction of some documentation. This will really help me work out which scheme to use, rather than trial and error :-)

Many thanks, in advance.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions