Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add community contribution guidelines #387

Closed
eu9ene opened this issue Jan 23, 2024 · 1 comment · Fixed by #973
Closed

Add community contribution guidelines #387

eu9ene opened this issue Jan 23, 2024 · 1 comment · Fixed by #973
Labels
documentation Improvements or additions to documentation

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Jan 23, 2024

People keep asking how to help add another language.

  1. The first good step would be helping to research datasets. To estimate feasibility of training we need statistics on how much data is there, including monolingual datasets.

  2. Contributing datasets that are not on OPUS or mtdata. A good example is when folks provided data for Catalan and now @gregtatum is experimenting with it.

  3. Helping tuning cleaning rules. We just started looking into OpusCleaner ourselves. In the future we could provide a guide on how to run the UI, tune rules for a language pair and contribute configs to the repo.

  4. For those looking to train a language pair themselves helping with maintaining Snakemake would be handy.

  5. We might have simple issues to take care of as a part of the training pipeline

We can setup a workflow on Github by creating an issue for a language (ideally with a template) and adding all the stats and discussing things related to the language there.

We should add a doc with clear guidelines on all this.

@eu9ene eu9ene added community documentation Improvements or additions to documentation labels Jan 23, 2024
@marco-c
Copy link
Collaborator

marco-c commented Jan 23, 2024

Helping tuning cleaning rules. We just started looking into OpusCleaner ourselves. In the future we could provide a guide on how to run the UI, tune rules for a language pair and contribute configs to the repo.

I think something like hplt-project/OpusCleaner#148 (comment) would be ideal here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants