-
Notifications
You must be signed in to change notification settings - Fork 3
/
METADATA.yml
109 lines (92 loc) · 2.67 KB
/
METADATA.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
schema: >-
https://github.com/tboenig/gt-metadata/tree/master/schema/2022-03-15/schema.json
title: reichsanzeiger-gt
url: https://github.com/UB-Mannheim/reichsanzeiger-gt
authors:
- name: Kamlah
surname: Jan
roles:
- project-manager
- support
- name: Schmidt
surname: Thomas
roles:
- project-manager
- quality-control
- name: Bartz
surname: Franziska
roles:
- transcriber
- name: Costa de Sousa
surname: André
roles:
- transcriber
- name: Ochkovskij
surname: Sergej
roles:
- transcriber
- name: Öztürk
surname: Dilara
roles:
- transcriber
description: >-
Reichsanzeiger-gt provides ground truth for the historical German newspaper
"Deutscher Reichsanzeiger und Preußischer Staatsanzeiger" (German Imperial
Gazette and Prussian Official Gazette), which was published under changing
names from 1819 to 1945. The ground truth dataset provides PAGE-XMLs and URLs
for the corresponding newspaper scans/images. The dataset contains 117 single
pages (119.435 lines) of ground truth.
project-name: >-
OCR-D "Workflow for work-specific training based on generic models with
OCR-D and ground truth enhancement"
project-website: >-
https://www.bib.uni-mannheim.de/en/about/projects-of-the-university-library/ocr-d-modelltraining/
language:
- eng
- fra
- deu
- por
- lat
- ita
script:
- Latn
- Latf
script-type: print
time:
notBefore: '1820'
notAfter: '1939'
hands:
count: more-than-3
level: levelmix
license:
- name: CC0 1.0
url: https://creativecommons.org/licenses/zero/1.0/
gtTyp: data_structure_and_text
format: Page-XML
sources:
- reference: ''
link: 'link: https://digi.bib.uni-mannheim.de/periodika/reichsanzeiger/'
volume:
- metric: lines
count: 119435
- metric: pages
count: 197
transcription-guidelines: >-
The transcription rules are based on the OCR-D transcription guidelines Level
2 with some exceptions (see below No. 2):
1) Special characters:
- Long s (ſ)
- Currency symbols: German Mark (ℳ) and Pfennig (₰), $, £
- Fractions (¼ ½ ¾ ⅐ ⅑ ⅒ ⅓ ⅔ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞)
- R rotunda (ꝛ)
- Combining Latin Small Letter E for old German Umlaut ( ͤ )
- Dagger (†)
- Black Right Pointing Index (☛)
- Black Left Pointing Index (☚)
- Superscript Numbers 0-9 (⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹)
- White square (□)
2) Additional characters transcribed true to original (contrary to OCR-D Level
2):
- Double oblique hyphen (⸗)
- Em dash (—) instead of En dash (–)
- Asterisk (*) used for both standard asterisk (*) and tear-drop asterisk (✽)