Skip to content

[BUG] When running the tagger, some words are missing features like 'lex' and 'diac' for lev and glf pretrained models #142

@fadhleryani

Description

@fadhleryani

For lev, for a word like 'حقوق' for example, when I run the following: BERTUnfactoredDisambiguator.pretrained(model_name='glf').tag_sentence('يشسي'.split())

it returns:

[{'pos': 'noun',
  'prc3': '0',
  'prc2': '0',
  'prc1': '0',
  'prc0': 'Al_det',
  'per': 'na',
  'asp': 'na',
  'vox': 'na',
  'mod': 'no',
  'form_gen': 'm',
  'form_num': 's',
  'stt': 'no',
  'cas': 'no',
  'enc0': '0',
  'enc1': '0',
  'enc2': '0'}]

Even for words with no analysis the expected behavior is to backoff to the original word right, so this is def a bug sa7?

For glf, try the word 'شئ' and you'll get something without lex and diac.

Desktop (please complete the following information):

  • macos 14.1.1
  • Python version 3.10.14
  • CAMeL Tools version 1.5.2 from pip

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions