Skip to content

Commit 89ef02f

Browse files
andrenatalCircleCI evaluation job
andauthored
Adding Catalan to English models (#72)
* Adding Catalan to English models * Update evaluation results [skip ci] * Update model registry [skip ci] --------- Co-authored-by: CircleCI evaluation job <ci-models-evaluation@firefox-translations>
1 parent c652b86 commit 89ef02f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

75 files changed

+1262
-1038
lines changed

evaluation/dev/bleu-results.md

Lines changed: 65 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -56,57 +56,68 @@ Both absolute and relative differences in BLEU scores between Bergamot and other
5656

5757
## avg
5858

59-
| Translator/Dataset | en-ru | ru-en | en-nl | fa-en | uk-en | en-fa | is-en | nl-en | en-uk |
60-
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
61-
| bergamot | 29.44 | 33.69 | 27.30 | 28.70 | 35.93 | 17.30 | 23.40 | 29.65 | 26.30 |
62-
| google | 34.49 (+5.05, +17.15%) | 38.20 (+4.51, +13.38%) | 29.30 (+2.00, +7.33%) | 40.85 (+12.15, +42.33%) | 42.43 (+6.50, +18.09%) | 27.80 (+10.50, +60.69%) | 38.90 (+15.50, +66.24%) | 33.05 (+3.40, +11.47%) | 32.63 (+6.33, +24.08%) |
63-
| microsoft | 33.62 (+4.18, +14.21%) | 38.38 (+4.68, +13.90%) | 28.80 (+1.50, +5.49%) | 36.15 (+7.45, +25.96%) | 42.30 (+6.37, +17.72%) | 20.50 (+3.20, +18.50%) | 38.17 (+14.77, +63.11%) | 32.60 (+2.95, +9.95%) | 32.03 (+5.73, +21.80%) |
59+
| Translator/Dataset | ru-en | en-nl | en-ru | en-fa | nl-en | uk-en | fa-en | ca-en | en-uk | is-en |
60+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
61+
| bergamot | 33.69 | 27.30 | 29.44 | 17.30 | 29.65 | 35.93 | 28.70 | 38.00 | 26.30 | 23.40 |
62+
| google | 38.20 (+4.51, +13.38%) | 29.30 (+2.00, +7.33%) | 34.49 (+5.05, +17.15%) | 27.80 (+10.50, +60.69%) | 33.05 (+3.40, +11.47%) | 42.43 (+6.50, +18.09%) | 40.85 (+12.15, +42.33%) | 48.95 (+10.95, +28.82%) | 32.63 (+6.33, +24.08%) | 38.90 (+15.50, +66.24%) |
63+
| microsoft | 38.38 (+4.68, +13.90%) | 28.80 (+1.50, +5.49%) | 33.62 (+4.18, +14.21%) | 20.50 (+3.20, +18.50%) | 32.60 (+2.95, +9.95%) | 42.30 (+6.37, +17.72%) | 36.15 (+7.45, +25.96%) | 46.50 (+8.50, +22.37%) | 32.03 (+5.73, +21.80%) | 38.17 (+14.77, +63.11%) |
6464

6565
![Results](img/avg-bleu.png)
6666
---
6767

68-
## en-ru
69-
70-
| Translator/Dataset | wmt20 | wmt13 | flores-test | flores-dev | wmt21 | wmt19 | wmt17 | wmt16 | wmt15 | wmt14 | wmt22 | wmt18 |
71-
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
72-
| bergamot | 22.00 | 26.20 | 29.20 | 29.90 | 25.50 | 31.40 | 33.60 | 30.90 | 31.40 | 38.20 | 26.50 | 28.50 |
73-
| google | 27.20 (+5.20, +23.64%) | 28.00 (+1.80, +6.87%) | 34.40 (+5.20, +17.81%) | 34.90 (+5.00, +16.72%) | 30.00 (+4.50, +17.65%) | 32.90 (+1.50, +4.78%) | 38.90 (+5.30, +15.77%) | 35.00 (+4.10, +13.27%) | 36.90 (+5.50, +17.52%) | 45.70 (+7.50, +19.63%) | 35.00 (+8.50, +32.08%) | 35.00 (+6.50, +22.81%) |
74-
| microsoft | 26.30 (+4.30, +19.55%) | 27.30 (+1.10, +4.20%) | 33.60 (+4.40, +15.07%) | 33.50 (+3.60, +12.04%) | 29.20 (+3.70, +14.51%) | 33.20 (+1.80, +5.73%) | 38.60 (+5.00, +14.88%) | 34.20 (+3.30, +10.68%) | 36.10 (+4.70, +14.97%) | 44.70 (+6.50, +17.02%) | 33.10 (+6.60, +24.91%) | 33.70 (+5.20, +18.25%) |
75-
76-
![Results](img/en-ru-bleu.png)
77-
---
78-
7968
## ru-en
8069

81-
| Translator/Dataset | flores-dev | mtedx_test | wmt18 | wmt20 | wmt19 | wmt15 | wmt17 | wmt14 | wmt16 | wmt22 | wmt13 | flores-test | wmt21 |
70+
| Translator/Dataset | mtedx_test | wmt19 | wmt17 | flores-dev | wmt22 | flores-test | wmt14 | wmt15 | wmt16 | wmt13 | wmt18 | wmt21 | wmt20 |
8271
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
83-
| bergamot | 31.90 | 24.00 | 31.90 | 35.00 | 39.10 | 33.50 | 37.60 | 37.80 | 33.00 | 38.50 | 29.30 | 31.00 | 35.40 |
84-
| google | 38.40 (+6.50, +20.38%) | 25.10 (+1.10, +4.58%) | 37.30 (+5.40, +16.93%) | 38.40 (+3.40, +9.71%) | 42.80 (+3.70, +9.46%) | 38.60 (+5.10, +15.22%) | 42.70 (+5.10, +13.56%) | 42.70 (+4.90, +12.96%) | 37.60 (+4.60, +13.94%) | 43.70 (+5.20, +13.51%) | 32.20 (+2.90, +9.90%) | 37.30 (+6.30, +20.32%) | 39.80 (+4.40, +12.43%) |
85-
| microsoft | 36.50 (+4.60, +14.42%) | 26.20 (+2.20, +9.17%) | 37.40 (+5.50, +17.24%) | 38.80 (+3.80, +10.86%) | 43.80 (+4.70, +12.02%) | 38.50 (+5.00, +14.93%) | 43.70 (+6.10, +16.22%) | 44.10 (+6.30, +16.67%) | 38.40 (+5.40, +16.36%) | 43.90 (+5.40, +14.03%) | 32.50 (+3.20, +10.92%) | 36.10 (+5.10, +16.45%) | 39.00 (+3.60, +10.17%) |
72+
| bergamot | 24.00 | 39.10 | 37.60 | 31.90 | 38.50 | 31.00 | 37.80 | 33.50 | 33.00 | 29.30 | 31.90 | 35.40 | 35.00 |
73+
| google | 25.10 (+1.10, +4.58%) | 42.80 (+3.70, +9.46%) | 42.70 (+5.10, +13.56%) | 38.40 (+6.50, +20.38%) | 43.70 (+5.20, +13.51%) | 37.30 (+6.30, +20.32%) | 42.70 (+4.90, +12.96%) | 38.60 (+5.10, +15.22%) | 37.60 (+4.60, +13.94%) | 32.20 (+2.90, +9.90%) | 37.30 (+5.40, +16.93%) | 39.80 (+4.40, +12.43%) | 38.40 (+3.40, +9.71%) |
74+
| microsoft | 26.20 (+2.20, +9.17%) | 43.80 (+4.70, +12.02%) | 43.70 (+6.10, +16.22%) | 36.50 (+4.60, +14.42%) | 43.90 (+5.40, +14.03%) | 36.10 (+5.10, +16.45%) | 44.10 (+6.30, +16.67%) | 38.50 (+5.00, +14.93%) | 38.40 (+5.40, +16.36%) | 32.50 (+3.20, +10.92%) | 37.40 (+5.50, +17.24%) | 39.00 (+3.60, +10.17%) | 38.80 (+3.80, +10.86%) |
8675

8776
![Results](img/ru-en-bleu.png)
8877
---
8978

9079
## en-nl
9180

92-
| Translator/Dataset | flores-test | flores-dev |
81+
| Translator/Dataset | flores-dev | flores-test |
9382
| --- | --- | --- |
94-
| bergamot | 27.00 | 27.60 |
95-
| google | 29.20 (+2.20, +8.15%) | 29.40 (+1.80, +6.52%) |
96-
| microsoft | 28.60 (+1.60, +5.93%) | 29.00 (+1.40, +5.07%) |
83+
| bergamot | 27.60 | 27.00 |
84+
| google | 29.40 (+1.80, +6.52%) | 29.20 (+2.20, +8.15%) |
85+
| microsoft | 29.00 (+1.40, +5.07%) | 28.60 (+1.60, +5.93%) |
9786

9887
![Results](img/en-nl-bleu.png)
9988
---
10089

101-
## fa-en
90+
## en-ru
91+
92+
| Translator/Dataset | wmt16 | wmt15 | flores-dev | wmt22 | wmt18 | wmt14 | wmt17 | wmt20 | wmt13 | wmt21 | wmt19 | flores-test |
93+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
94+
| bergamot | 30.90 | 31.40 | 29.90 | 26.50 | 28.50 | 38.20 | 33.60 | 22.00 | 26.20 | 25.50 | 31.40 | 29.20 |
95+
| google | 35.00 (+4.10, +13.27%) | 36.90 (+5.50, +17.52%) | 34.90 (+5.00, +16.72%) | 35.00 (+8.50, +32.08%) | 35.00 (+6.50, +22.81%) | 45.70 (+7.50, +19.63%) | 38.90 (+5.30, +15.77%) | 27.20 (+5.20, +23.64%) | 28.00 (+1.80, +6.87%) | 30.00 (+4.50, +17.65%) | 32.90 (+1.50, +4.78%) | 34.40 (+5.20, +17.81%) |
96+
| microsoft | 34.20 (+3.30, +10.68%) | 36.10 (+4.70, +14.97%) | 33.50 (+3.60, +12.04%) | 33.10 (+6.60, +24.91%) | 33.70 (+5.20, +18.25%) | 44.70 (+6.50, +17.02%) | 38.60 (+5.00, +14.88%) | 26.30 (+4.30, +19.55%) | 27.30 (+1.10, +4.20%) | 29.20 (+3.70, +14.51%) | 33.20 (+1.80, +5.73%) | 33.60 (+4.40, +15.07%) |
97+
98+
![Results](img/en-ru-bleu.png)
99+
---
100+
101+
## en-fa
102+
103+
| Translator/Dataset | flores-test | flores-dev |
104+
| --- | --- | --- |
105+
| bergamot | 17.40 | 17.20 |
106+
| google | 28.40 (+11.00, +63.22%) | 27.20 (+10.00, +58.14%) |
107+
| microsoft | 21.10 (+3.70, +21.26%) | 19.90 (+2.70, +15.70%) |
108+
109+
![Results](img/en-fa-bleu.png)
110+
---
111+
112+
## nl-en
102113

103114
| Translator/Dataset | flores-dev | flores-test |
104115
| --- | --- | --- |
105-
| bergamot | 29.10 | 28.30 |
106-
| google | 42.00 (+12.90, +44.33%) | 39.70 (+11.40, +40.28%) |
107-
| microsoft | 36.50 (+7.40, +25.43%) | 35.80 (+7.50, +26.50%) |
116+
| bergamot | 29.70 | 29.60 |
117+
| google | 33.00 (+3.30, +11.11%) | 33.10 (+3.50, +11.82%) |
118+
| microsoft | 32.40 (+2.70, +9.09%) | 32.80 (+3.20, +10.81%) |
108119

109-
![Results](img/fa-en-bleu.png)
120+
![Results](img/nl-en-bleu.png)
110121
---
111122

112123
## uk-en
@@ -120,46 +131,46 @@ Both absolute and relative differences in BLEU scores between Bergamot and other
120131
![Results](img/uk-en-bleu.png)
121132
---
122133

123-
## en-fa
134+
## fa-en
124135

125136
| Translator/Dataset | flores-dev | flores-test |
126137
| --- | --- | --- |
127-
| bergamot | 17.20 | 17.40 |
128-
| google | 27.20 (+10.00, +58.14%) | 28.40 (+11.00, +63.22%) |
129-
| microsoft | 19.90 (+2.70, +15.70%) | 21.10 (+3.70, +21.26%) |
130-
131-
![Results](img/en-fa-bleu.png)
132-
---
133-
134-
## is-en
135-
136-
| Translator/Dataset | flores-dev | flores-test | wmt21 |
137-
| --- | --- | --- | --- |
138-
| bergamot | 23.60 | 23.40 | 23.20 |
139-
| google | 39.40 (+15.80, +66.95%) | 38.60 (+15.20, +64.96%) | 38.70 (+15.50, +66.81%) |
140-
| microsoft | 37.30 (+13.70, +58.05%) | 36.70 (+13.30, +56.84%) | 40.50 (+17.30, +74.57%) |
138+
| bergamot | 29.10 | 28.30 |
139+
| google | 42.00 (+12.90, +44.33%) | 39.70 (+11.40, +40.28%) |
140+
| microsoft | 36.50 (+7.40, +25.43%) | 35.80 (+7.50, +26.50%) |
141141

142-
![Results](img/is-en-bleu.png)
142+
![Results](img/fa-en-bleu.png)
143143
---
144144

145-
## nl-en
145+
## ca-en
146146

147147
| Translator/Dataset | flores-dev | flores-test |
148148
| --- | --- | --- |
149-
| bergamot | 29.70 | 29.60 |
150-
| google | 33.00 (+3.30, +11.11%) | 33.10 (+3.50, +11.82%) |
151-
| microsoft | 32.40 (+2.70, +9.09%) | 32.80 (+3.20, +10.81%) |
149+
| bergamot | 38.70 | 37.30 |
150+
| google | 49.60 (+10.90, +28.17%) | 48.30 (+11.00, +29.49%) |
151+
| microsoft | 46.80 (+8.10, +20.93%) | 46.20 (+8.90, +23.86%) |
152152

153-
![Results](img/nl-en-bleu.png)
153+
![Results](img/ca-en-bleu.png)
154154
---
155155

156156
## en-uk
157157

158-
| Translator/Dataset | flores-test | wmt22 | flores-dev |
158+
| Translator/Dataset | flores-dev | flores-test | wmt22 |
159159
| --- | --- | --- | --- |
160-
| bergamot | 28.20 | 22.80 | 27.90 |
161-
| google | 33.10 (+4.90, +17.38%) | 32.00 (+9.20, +40.35%) | 32.80 (+4.90, +17.56%) |
162-
| microsoft | 33.50 (+5.30, +18.79%) | 30.40 (+7.60, +33.33%) | 32.20 (+4.30, +15.41%) |
160+
| bergamot | 27.90 | 28.20 | 22.80 |
161+
| google | 32.80 (+4.90, +17.56%) | 33.10 (+4.90, +17.38%) | 32.00 (+9.20, +40.35%) |
162+
| microsoft | 32.20 (+4.30, +15.41%) | 33.50 (+5.30, +18.79%) | 30.40 (+7.60, +33.33%) |
163163

164164
![Results](img/en-uk-bleu.png)
165+
---
166+
167+
## is-en
168+
169+
| Translator/Dataset | flores-dev | flores-test | wmt21 |
170+
| --- | --- | --- | --- |
171+
| bergamot | 23.60 | 23.40 | 23.20 |
172+
| google | 39.40 (+15.80, +66.95%) | 38.60 (+15.20, +64.96%) | 38.70 (+15.50, +66.81%) |
173+
| microsoft | 37.30 (+13.70, +58.05%) | 36.70 (+13.30, +56.84%) | 40.50 (+17.30, +74.57%) |
174+
175+
![Results](img/is-en-bleu.png)
165176
---
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
38.7
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
0.6699
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
==========================
2+
x_name: flores-dev.bergamot.en
3+
y_name: flores-dev.microsoft.en
4+
5+
Bootstrap Resampling Results:
6+
x-mean: 0.6700
7+
y-mean: 0.7980
8+
ties (%): 0.0000
9+
x_wins (%): 0.0000
10+
y_wins (%): 1.0000
11+
12+
Paired T-Test Results:
13+
statistic: -18.8769
14+
p_value: 0.0000
15+
Null hypothesis rejected according to t-test.
16+
Scores differ significantly across samples.
17+
flores-dev.microsoft.en outperforms flores-dev.bergamot.en.
18+
==========================
19+
x_name: flores-dev.bergamot.en
20+
y_name: flores-dev.google.en
21+
22+
Bootstrap Resampling Results:
23+
x-mean: 0.6700
24+
y-mean: 0.8228
25+
ties (%): 0.0000
26+
x_wins (%): 0.0000
27+
y_wins (%): 1.0000
28+
29+
Paired T-Test Results:
30+
statistic: -21.5915
31+
p_value: 0.0000
32+
Null hypothesis rejected according to t-test.
33+
Scores differ significantly across samples.
34+
flores-dev.google.en outperforms flores-dev.bergamot.en.
35+
==========================
36+
x_name: flores-dev.microsoft.en
37+
y_name: flores-dev.google.en
38+
39+
Bootstrap Resampling Results:
40+
x-mean: 0.7980
41+
y-mean: 0.8228
42+
ties (%): 0.0000
43+
x_wins (%): 0.0000
44+
y_wins (%): 1.0000
45+
46+
Paired T-Test Results:
47+
statistic: -6.7390
48+
p_value: 0.0000
49+
Null hypothesis rejected according to t-test.
50+
Scores differ significantly across samples.
51+
flores-dev.google.en outperforms flores-dev.microsoft.en.
52+
53+
Summary
54+
If system_x is better than system_y then:
55+
Null hypothesis rejected according to t-test with p_value=0.05.
56+
Scores differ significantly across samples.
57+
system_x \ system_y flores-dev.bergamot.en flores-dev.microsoft.en flores-dev.google.en
58+
----------------------- ------------------------ ------------------------- ----------------------
59+
flores-dev.bergamot.en False False
60+
flores-dev.microsoft.en True False
61+
flores-dev.google.en True True
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
49.6
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
0.8218
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
46.8
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
0.7979
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
37.3
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
0.6381

0 commit comments

Comments
 (0)