Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download_missing_files seems not to provide all missing files, in particular *_mwt_expander.pt (dutch) #80

Open
tcbrouwer opened this issue Feb 8, 2024 · 1 comment

Comments

@tcbrouwer
Copy link

After training on a seperate machine we got some promising results, and we are now looking to move our model into production. However we encounter an issue. Downloading missing files and verifying the model like this:

# First we download any missing files and verify the pipeline

import trankit

# Download any missing files
trankit.download_missing_files(
	category='customized-mwt-ner', 
	save_dir='./trankit_model', 
	embedding_name='xlm-roberta-base', 
	language='dutch'
)

# Verify the pipeline
trankit.verify_customized_pipeline(
    category='customized-mwt-ner', # pipeline category
    save_dir='./trankit_model', # directory used for saving models in previous steps
    embedding_name='xlm-roberta-base' # embedding version that we use for training our customized pipeline, by default, it is `xlm-roberta-base`
)

Leads to the following output and error:

Missing ./trankit_model/xlm-roberta-base/customized-mwt-ner/customized-mwt-ner_mwt_expander.pt
Missing ./trankit_model/xlm-roberta-base/customized-mwt-ner/customized-mwt-ner_lemmatizer.pt
Missing ./trankit_model/xlm-roberta-base/customized-mwt-ner/customized-mwt-ner.ner.mdl
Missing ./trankit_model/xlm-roberta-base/customized-mwt-ner/customized-mwt-ner.ner-vocab.json
http://nlp.uoregon.edu/download/trankit/v1.0.0/xlm-roberta-base/dutch.zip
Downloading: 100%|██████████| 46.3M/46.3M [01:07<00:00, 682kiB/s] 
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[7], line 6
      3 import trankit
      5 # Download any missing files
----> 6 trankit.download_missing_files(
      7 	category='customized-mwt-ner', 
      8 	save_dir='./trankit_model', 
      9 	embedding_name='xlm-roberta-base', 
     10 	language='dutch'
     11 )
     13 # Verify the pipeline
     14 trankit.verify_customized_pipeline(
     15     category='customized-mwt-ner', # pipeline category
     16     save_dir='./trankit_model', # directory used for saving models in previous steps
     17     embedding_name='xlm-roberta-base' # embedding version that we use for training our customized pipeline, by default, it is `xlm-roberta-base`
     18 )

File ~/Projects/UDParserEvaluation/venv/lib/python3.10/site-packages/trankit/__init__.py:71, in download_missing_files(category, save_dir, embedding_name, language)
     69 tgt_dir = os.path.join(save_dir, embedding_name, category)
     70 for fname in missing_filenamess:
---> 71     copyfile(os.path.join(src_dir, fname.format(language)), os.path.join(tgt_dir, fname.format(category)))
     72     print('Copying {} to {}'.format(
     73         os.path.join(src_dir, fname.format(language)),
     74         os.path.join(tgt_dir, fname.format(category))
     75     ))
     76 remove_with_path(src_dir)

File /usr/lib/python3.10/shutil.py:254, in copyfile(src, dst, follow_symlinks)
    252     os.symlink(os.readlink(src), dst)
    253 else:
--> 254     with open(src, 'rb') as fsrc:
    255         try:
    256             with open(dst, 'wb') as fdst:
    257                 # macOS

FileNotFoundError: [Errno 2] No such file or directory: './trankit_model/xlm-roberta-base/dutch/dutch_mwt_expander.pt'

No file named *_mwt_expander.pt seems to be present.

I tried to download a few zips from http://nlp.uoregon.edu/download/trankit/ and it's subfolders, but no luck finding any mwt_expander.

Am I missing something?

The model was trained like this:

import trankit

# initialize a trainer for the task
trainer = trankit.TPipeline(
    training_config={
    'category': 'customized-mwt-ner', # pipeline category
    'task': 'posdep', # task name
    'save_dir': './trankit_model', # directory for saving trained model
    'train_conllu_fpath': './corpus/split-conllu/train.conllu', # annotations file in CONLLU format  for training
    'dev_conllu_fpath': './corpus/split-conllu/dev.conllu' # annotations file in CONLLU format for development
    }
)

# start training
trainer.train()
@tcbrouwer
Copy link
Author

tcbrouwer commented Feb 9, 2024

For now, we have chosen to run a model of the "customized" type instead of the "customized-mwt-ner" type. For "customized" all missing files seem to be downloaded correctly.

https://trankit.readthedocs.io/en/latest/training.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant