Skip to content

Commit 08ef862

Browse files
authored
Docker (#11)
* add dockerfile * use CLI * elaborate README.md * pretty print table
1 parent ce1eed2 commit 08ef862

File tree

1,060 files changed

+533
-21103
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,060 files changed

+533
-21103
lines changed

.dockerignore

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
**/__pycache__/**/*
2+
**/__pycache__
3+
**/.cachier/**/*
4+
**/.cachier

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ checkpoints
33
.ipynb_checkpoints
44
.idea
55
.vscode
6+
.cachier
67
group*-shard*
78
models
89
final_model

Dockerfile

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
FROM tensorflow/tensorflow:latest-gpu
2+
3+
#ENV VIRTUAL_ENV=/opt/venv
4+
#RUN python -m venv $VIRTUAL_ENV
5+
#ENV PATH="$VIRTUAL_ENV/bin:$PATH"
6+
RUN pip install --upgrade pip
7+
8+
WORKDIR /app
9+
COPY README.md README.md
10+
COPY requirements.txt requirements.txt
11+
RUN pip install -r requirements.txt
12+
13+
COPY hebrew_diacritized hebrew_diacritized
14+
COPY tests tests
15+
COPY models models
16+
COPY nakdimon nakdimon
17+
18+
RUN chown -R 1000:1000 .
19+
RUN chmod -R 755 .
20+
21+
#RUN nohup python nakdimon server &
22+
23+
#ENTRYPOINT ["python", "nakdimon"]

README.md

+55-5
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,65 @@
1-
# Nakdimon - a simple Hebrew Diacritizer
1+
## Running docker container
2+
```
3+
$ docker run --rm --gpus all --user 1000:1000 -it nakdimon-gpu
4+
```
5+
6+
The `--gpus all` flag is required to run the container with GPU support.
27

3-
Repository for the paper [Restoring Hebrew Diacritics Without a Dictionary](https://arxiv.org/abs/2105.05209) by Elazar Gershuni and Yuval Pinter.
8+
## Training and evaluating
9+
To train, test and evaluate the system, run the following commands:
10+
```
11+
> python nakdimon train --model=models/Nakdimon.h5
12+
> python nakdimon run_test --test_set=tests/new --model=models/Nakdimon.h5
13+
> python nakdimon results --test_set=tests/new --systems Snopi Morfix Dicta MajAllWithDicta Nakdimon
14+
```
15+
The first step trains the model and create a file named `Nakdimon.h5` in the `models` directory.
16+
By default, the model is the one described in the paper: `models/Nakdimon.h5`.
17+
If the model already exists, you may skip this step.
418

5-
Demo: http://www.nakdimon.org/
19+
The second step asks the Nakdimon server to predict the diacritics for the test set. You may skip this step.
20+
A folder for the results is created in the chosen test folder, with the same name as the model; in this case, `tests/new/NakdimonNew`.
21+
By default, the test set is the one used in the paper (`tests/new`); you can use `tests/dicta` instead.
22+
If the test results already exist, you may skip this step. If you are not sure, you can use the `--skip_existing` flag.
23+
24+
The third step calculates and prints the results (DEC, CHA, WOR and VOC metrics, as well as OOV_WOR and OOV_VOC).
25+
By default, the systems are the folders in the chosen test folder.
26+
For the Dicta test set (`/tests/dicta`) you should use `MajAllNoDicta` instead of `MajAllWithDicta`, otherwise the vocabulary for the Majority would include the test set itself.
27+
28+
## Diacritizing a single file
29+
```
30+
> python nakdimon predict input_file.txt output_file.txt
31+
```
632

7-
Citation (until NAACL 2022 prceedings are available):
33+
## Using other systems
34+
You can use the `run_test` command to run the test set on other systems, such as Dicta:
35+
```
36+
> python nakdimon run_test --test_set=tests/new --system=Dicta
37+
```
38+
This will create a folder named `Dicta` for the results in the `tests/new` folder.
39+
Note that `Morfix` cannot be used in this manner, as its license prohibit automatic use.
40+
41+
## Running ablation tests
42+
You can use the `--ablation` flag to train different models for the ablation tests and other experiments:
43+
```
44+
> python nakdimon train --model=models/SingleLayer.h5 --ablation=SingleLayer
45+
```
46+
See the file `ablation.py` for the list of available ablation parameters.
47+
48+
## Important folders
49+
* `hebrew_diacritized` is the training set.
50+
* `tests` contains three tests sets: `new`, `dicta` and `validation`.
51+
Each test set has an `expected` folder that describes the ground truth.
52+
The results of `python nakdimon run_test` are stored in sibling folder, named after the model.
53+
* `models` contains the trained model.
54+
* `nakdimon` holds the source code.
55+
56+
## Citation
57+
(until NAACL 2022 prceedings are available):
858
```
959
@article{gershuni2021restoring,
1060
title={Restoring Hebrew Diacritics Without a Dictionary},
1161
author={Gershuni, Elazar and Pinter, Yuval},
1262
journal={arXiv preprint arXiv:2105.05209},
1363
year={2021}
1464
}
15-
```
65+
```

hebrew_diacritized

Submodule hebrew_diacritized updated 90 files
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

nakdimon/__main__.py

+97
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
import argparse
2+
import sys
3+
import os
4+
import logging
5+
6+
7+
def do_train(**kwargs):
8+
import train
9+
train.main(**kwargs)
10+
11+
12+
def do_run_test(**kwargs):
13+
import run_test
14+
run_test.main(**kwargs)
15+
16+
17+
def do_metrics(**kwargs):
18+
import metrics
19+
metrics.main(**kwargs)
20+
21+
22+
def do_predict(**kwargs):
23+
import predict
24+
predict.main(**kwargs)
25+
26+
27+
def do_server(**kwargs):
28+
import os
29+
import sys
30+
import pkgutil
31+
package = pkgutil.get_loader("server")
32+
assert package is not None
33+
logging.info("Executing flask server...")
34+
os.execv(sys.executable, [sys.executable, package.get_filename()])
35+
exit(1)
36+
37+
38+
if __name__ == '__main__':
39+
40+
logging.basicConfig(level=logging.INFO, format='%(levelname)s - %(message)s')
41+
42+
# Parse command line arguments
43+
parser = argparse.ArgumentParser(
44+
description="""Train and evaluate Nakdimon and other diacritizers. Reproduce the Nakdimon paper.""",
45+
)
46+
parser.add_argument('-q', '--quiet', action='store_true', help='suppress info logging.', default=False)
47+
48+
subparsers = parser.add_subparsers(help='sub-command help', dest="command", required=True)
49+
50+
parser_train = subparsers.add_parser('train', help='train Nakdimon')
51+
parser_train.add_argument('--wandb', action='store_true', help='use wandb.', default=False)
52+
parser_train.add_argument('--model', help='path to output model (.h5 file)', default='models/Full.h5', dest='model_path')
53+
parser_train.add_argument('--ablation', help='ablation test', default=None, dest='ablation_name')
54+
parser_train.set_defaults(func=do_train)
55+
56+
test_systems = ['Snopi', 'Morfix', 'Dicta', 'Nakdimon', 'MajMod', 'MajAllWithDicta', 'MajAllWithoutDicta']
57+
# iterate over folders to find available options:
58+
available_tests = [f'tests/{folder}' for folder in os.listdir('tests/')
59+
if os.path.isdir(f'tests/{folder}') and os.path.isdir(f'tests/{folder}/expected')]
60+
61+
parser_test = subparsers.add_parser('run_test', help='diacritize a test set')
62+
parser_test.add_argument('--test_set', choices=available_tests, help='choose test set', default='tests/new')
63+
parser_test.add_argument('--system', choices=test_systems, help='diacritization system to use', default='Nakdimon')
64+
parser_test.add_argument('--model', help='path to model (.h5 file)', default='models/Nakdimon.h5', dest='model_path')
65+
parser_test.add_argument('--skip-existing', action='store_true', help='skip existing files')
66+
parser_test.set_defaults(func=do_run_test)
67+
68+
parser_predict = subparsers.add_parser('predict', help='diacritize a text file')
69+
parser_predict.add_argument('input_path', help='input file')
70+
parser_predict.add_argument('output_path', help='output file')
71+
parser_predict.set_defaults(func=do_predict)
72+
73+
# parser_daemon = subparsers.add_parser('server', help='run Nakdimon server as a daemon')
74+
# parser_daemon.set_defaults(func=do_server)
75+
76+
parser_eval = subparsers.add_parser('results', help='evaluate the results of a test run')
77+
parser_eval.add_argument('--test_set', choices=available_tests, help='choose test set', default='tests/new')
78+
partial_result, _ = parser.parse_known_args()
79+
if partial_result.command == 'results':
80+
test_systems = [folder for folder in os.listdir(partial_result.test_set)
81+
if os.path.isdir(f'{partial_result.test_set}/{folder}') and folder != 'expected']
82+
parser_eval.add_argument('--systems', choices=test_systems, nargs='+', help='list of systems to evaluate',
83+
default=test_systems)
84+
parser_eval.set_defaults(func=do_metrics)
85+
86+
args = parser.parse_args()
87+
88+
if args.quiet:
89+
logging.disable(logging.INFO)
90+
del args.quiet
91+
92+
kwargs = vars(args).copy()
93+
del kwargs['command']
94+
del kwargs['func']
95+
args.func(**kwargs)
96+
97+
sys.exit(0)

nakdimon/ablations.py

+3-15
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
1-
import keras
21
import tensorflow as tf
32

4-
import train
5-
from train import TrainingParams, train_ablation
3+
from train import TrainingParams
64
import schedulers
75

86

@@ -97,7 +95,7 @@ def epoch_params(self, data):
9795
class Chunk(TrainingParams):
9896
def __init__(self, maxlen):
9997
super().__init__()
100-
self.maxlen = maxlen
98+
self.maxlen = int(maxlen)
10199

102100
@property
103101
def name(self):
@@ -107,7 +105,7 @@ def name(self):
107105
class Batch(TrainingParams):
108106
def __init__(self, batch_size):
109107
super().__init__()
110-
self.batch_size = batch_size
108+
self.batch_size = int(batch_size)
111109

112110
@property
113111
def name(self):
@@ -211,13 +209,3 @@ def epoch_params(self, data):
211209
yield ('automatic', len(lrs1), tf.keras.callbacks.LearningRateScheduler(lambda epoch, lr: lrs1[epoch-1-1]))
212210
lrs2 = [10e-4, 10e-4, 3e-4]
213211
yield ('modern', len(lrs2), tf.keras.callbacks.LearningRateScheduler(lambda epoch, lr: lrs2[epoch-len(lrs1)-1-1]))
214-
215-
216-
if __name__ == '__main__':
217-
# units = 400
218-
print(train.Full().build_model().count_params())
219-
# for cls in [train.TwoLevelLSTM]:
220-
# for i in range(1):
221-
# print(cls(units).build_model().count_params())
222-
# train_ablation(cls(units), group=f"{cls.__name__}:2022")
223-

nakdimon/app.py

-27
This file was deleted.

nakdimon/dataset.py

-5
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,6 @@
33
import random
44
import numpy as np
55

6-
from cachier import cachier
7-
86
import hebrew
97
import utils
108

@@ -128,7 +126,6 @@ def read_corpora(base_paths) -> tuple[tuple[str, list[hebrew.HebrewItem]], ...]:
128126
return tuple([(filename, list(hebrew.iterate_file(filename))) for filename in utils.iterate_files(base_paths)])
129127

130128

131-
@cachier()
132129
def load_data(base_paths, maxlen: int) -> Data:
133130
corpora = read_corpora(base_paths)
134131
corpus = [(filename, Data.from_text(heb_items, maxlen)) for (filename, heb_items) in corpora]
@@ -145,5 +142,3 @@ def load_data(base_paths, maxlen: int) -> Data:
145142
# print(res)
146143
print_tables()
147144
print(letters_table.to_ids(["שלום"]))
148-
149-
# load_data.clear_cache()
-24.3 KB
Binary file not shown.

0 commit comments

Comments
 (0)