Skip to content

Commit 4e50d4e

Browse files
authored
TLDR-446 annotations doc (#322)
* Annotations information added * Line types information added
1 parent 79fb6e3 commit 4e50d4e

File tree

5 files changed

+273
-18
lines changed

5 files changed

+273
-18
lines changed

.github/workflows/release.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ jobs:
4141
- name: Push to dockerhub
4242
if: ${{ success() }}
4343
run: |
44-
docker build -f docker/Dockerfile -t dedocproject/dedoc:$GITHUB_REF_NAME .
44+
docker build -f Dockerfile -t dedocproject/dedoc:$GITHUB_REF_NAME .
4545
docker login -u ${{ secrets.DOCKERHUB_USERNAME }} -p ${{ secrets.DOCKERHUB_PASSWORD }}
4646
docker tag dedocproject/dedoc:$GITHUB_REF_NAME dedocproject/dedoc:latest
4747
docker push dedocproject/dedoc:$GITHUB_REF_NAME

README.md

+39-17
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,17 @@
44

55
![Dedoc](https://github.com/ispras/dedoc/raw/master/dedoc_logo.png)
66

7-
Dedoc is an open universal system for converting documents to a unified output format. It extracts a document’s logical structure and content, its tables, text formatting and metadata. The document’s content is represented as a tree storing headings and lists of any level. Dedoc can be integrated in a document contents and structure analysis system as a separate module.
7+
Dedoc is an open universal system for converting documents to a unified output format.
8+
It extracts a document’s logical structure and content, its tables, text formatting and metadata.
9+
The document’s content is represented as a tree storing headings and lists of any level.
10+
Dedoc can be integrated in a document contents and structure analysis system as a separate module.
11+
12+
Relevant documentation of the dedoc is available [here](https://dedoc.readthedocs.io).
813

914
## Features and advantages
10-
Dedoc is implemented in Python and works with semi-structured data formats (DOC/DOCX, ODT, XLS/XLSX, CSV, TXT, JSON) and none-structured data formats like images (PNG, JPG etc.), archives (ZIP, RAR etc.), PDF and HTML formats. Document structure extraction is fully automatic regardless of input data type. Metadata and text formatting is also extracted automatically.
15+
Dedoc is implemented in Python and works with semi-structured data formats (DOC/DOCX, ODT, XLS/XLSX, CSV, TXT, JSON) and none-structured data formats like images (PNG, JPG etc.), archives (ZIP, RAR etc.), PDF and HTML formats.
16+
Document structure extraction is fully automatic regardless of input data type.
17+
Metadata and text formatting are also extracted automatically.
1118

1219
In 2022, the system won a grant to support the development of promising AI projects from the [Innovation Assistance Foundation (Фонд содействия инновациям)](https://fasie.ru/).
1320

@@ -16,36 +23,35 @@ In 2022, the system won a grant to support the development of promising AI proje
1623
* Support for extracting document structure out of nested documents having different formats.
1724
* Extracting various text formatting features (indentation, font type, size, style etc.).
1825
* Working with documents of various origin (statements of work, legal documents, technical reports, scientific papers) allowing flexible tuning for new domains.
19-
* Working with PDF documents containinng a text layer:
20-
* Support to automatically determine the correctness of the text layer in PDF documents;
21-
* Extract containing and formatting from PDF-documents with a text layer using the developed interpreter of the virtual stack machine for printing graphics according to the format specification.
22-
Extracting table data from DOC/DOCX, PDF, HTML, CSV and image formats:
26+
* Working with PDF documents containing a textual layer:
27+
* Support to automatically determine the correctness of the textual layer in PDF documents;
28+
* Extract containing and formatting from PDF-documents with a textual layer using the developed interpreter of the virtual stack machine for printing graphics according to the format specification.
29+
* Extracting table data from DOC/DOCX, PDF, HTML, CSV and image formats:
2330
* Recognizing a physical structure and a cell text for complex multipage tables having explicit borders with the help of contour analysis.
2431
* Working with scanned documents (image formats and PDF without text layer):
2532
* Using Tesseract, an actively developed OCR engine from Google, together with image preprocessing methods.
2633
* Utilizing modern machine learning approaches for detecting a document orientation, detecting single/multicolumn document page, detecting bold text and extracting hierarchical structure based on the classification of features extracted from document images.
2734

2835

29-
This project may be useful as a first step of automatic document analysis pipeline (e.g. before the NLP part)
36+
This project may be useful as a first step of automatic document analysis pipeline (e.g. before the NLP part).
3037

31-
This project has REST Api and you can run it in Docker container
32-
To read full Dedoc documentation run the project and go to localhost:1231.
38+
This project has REST Api and you can run it in Docker container.
39+
Also, dedoc can be installed as a library via `pip`.
40+
To read full Dedoc documentation go [here](https://dedoc.readthedocs.io).
3341

3442

3543
## Run the project
36-
How to build and run the project
3744

38-
Ensure you have Git and Docker installed
39-
45+
### Install and run dedoc using docker
46+
4047
Clone the project
4148
```bash
4249
git clone https://github.com/ispras/dedoc.git
43-
44-
cd dedoc/
50+
cd dedoc
4551
```
4652

4753
Ensure you have Docker installed.
48-
Start 'Dedoc' on the port 1231:
54+
Start `dedoc` on the port `1231`:
4955
```bash
5056
docker-compose up --build
5157
```
@@ -55,6 +61,22 @@ Start Dedoc with tests:
5561
test="true" docker-compose up --build
5662
```
5763

58-
Now you can go to the localhost:1231 and look at the docs and examples.
64+
Now you can go to the `localhost:1231` and look at the docs and examples.
65+
You can change the port and host in the config file `dedoc/config.py`.
66+
67+
### Install dedoc using pip
68+
69+
One may install the dedoc library via `pip`.
70+
To fulfil all the library requirements, you should have `torch~=1.11.0` and `torchvision~=0.12.0` installed.
71+
You can install suitable for you versions of these libraries and install dedoc using `pip` command:
72+
73+
```bash
74+
pip install dedoc
75+
```
76+
77+
Or you can install dedoc with torch and torchvision included:
78+
```bash
79+
pip install "dedoc[torch]"
80+
```
5981

60-
You can change the port and host in the config file 'dedoc/config.py'
82+
Go [here](https://dedoc.readthedocs.io/en/latest/getting_started/installation.html) to get more details about dedoc installation.

docs/source/index.rst

+9
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,8 @@ Reading documents using dedoc
5050

5151
Dedoc allows to get the common intermediate representation for the documents of various formats.
5252
The resulting output of any reader is a class :class:`~dedoc.data_structures.UnstructuredDocument`.
53+
See :ref:`readers' annotations <readers_annotations>` and :ref:`readers' line types <readers_line_types>`
54+
to get more details about information that can be extracted by each available reader.
5355

5456
.. _table_formats:
5557

@@ -220,6 +222,13 @@ For a document of unknown or unsupported domain there is an option to use defaul
220222
dedoc_api_usage/return_format
221223

222224

225+
.. toctree::
226+
:maxdepth: 1
227+
:caption: Readers output
228+
229+
readers_output/annotations
230+
readers_output/line_types
231+
223232
.. toctree::
224233
:maxdepth: 1
225234
:caption: Structure types
+159
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
.. _readers_annotations:
2+
3+
Text annotations
4+
================
5+
6+
Below the readers are enlisted that can return non-empty list of annotations for document lines:
7+
8+
* `+` means that the reader can return the annotation.
9+
* `-` means that the reader doesn't return the annotation due to complexity of the task or lack of information provided by the format.
10+
11+
.. _table_annotations:
12+
13+
.. list-table:: Annotations returned by each reader
14+
:widths: 20 10 10 10 10 10 10
15+
:class: tight-table
16+
17+
* - **Annotation**
18+
- :class:`~dedoc.readers.DocxReader`
19+
- :class:`~dedoc.readers.HtmlReader`, :class:`~dedoc.readers.MhtmlReader`, :class:`~dedoc.readers.EmailReader`
20+
- :class:`~dedoc.readers.RawTextReader`
21+
- :class:`~dedoc.readers.PdfImageReader`
22+
- :class:`~dedoc.readers.PdfTabbyReader`
23+
- :class:`~dedoc.readers.PdfTxtlayerReader`
24+
25+
* - :class:`~dedoc.data_structures.AttachAnnotation`
26+
- `+`
27+
- `-`
28+
- `-`
29+
- `-`
30+
- `-`
31+
- `+`
32+
33+
* - :class:`~dedoc.data_structures.TableAnnotation`
34+
- `+`
35+
- `-`
36+
- `-`
37+
- `+`
38+
- `+`
39+
- `+`
40+
41+
* - :class:`~dedoc.data_structures.LinkedTextAnnotation`
42+
- `+`
43+
- `+`
44+
- `-`
45+
- `-`
46+
- `+`
47+
- `+`
48+
49+
* - :class:`~dedoc.data_structures.BBoxAnnotation`
50+
- `-`
51+
- `-`
52+
- `-`
53+
- `+`
54+
- `+`
55+
- `+`
56+
57+
* - :class:`~dedoc.data_structures.AlignmentAnnotation`
58+
- `+`
59+
- `+`
60+
- `-`
61+
- `-`
62+
- `-`
63+
- `-`
64+
65+
* - :class:`~dedoc.data_structures.IndentationAnnotation`
66+
- `+`
67+
- `-`
68+
- `+`
69+
- `+`
70+
- `+`
71+
- `+`
72+
73+
* - :class:`~dedoc.data_structures.SpacingAnnotation`
74+
- `+`
75+
- `-`
76+
- `+`
77+
- `+`
78+
- `+`
79+
- `+`
80+
81+
* - :class:`~dedoc.data_structures.BoldAnnotation`
82+
- `+`
83+
- `+`
84+
- `-`
85+
- `+`
86+
- `+`
87+
- `+`
88+
89+
* - :class:`~dedoc.data_structures.ItalicAnnotation`
90+
- `+`
91+
- `+`
92+
- `-`
93+
- `-`
94+
- `+`
95+
- `+`
96+
97+
* - :class:`~dedoc.data_structures.UnderlinedAnnotation`
98+
- `+`
99+
- `+`
100+
- `-`
101+
- `-`
102+
- `-`
103+
- `-`
104+
105+
* - :class:`~dedoc.data_structures.StrikeAnnotation`
106+
- `+`
107+
- `+`
108+
- `-`
109+
- `-`
110+
- `-`
111+
- `-`
112+
113+
* - :class:`~dedoc.data_structures.SubscriptAnnotation`
114+
- `+`
115+
- `+`
116+
- `-`
117+
- `-`
118+
- `-`
119+
- `-`
120+
121+
* - :class:`~dedoc.data_structures.SuperscriptAnnotation`
122+
- `+`
123+
- `+`
124+
- `-`
125+
- `-`
126+
- `-`
127+
- `-`
128+
129+
* - :class:`~dedoc.data_structures.ColorAnnotation`
130+
- `-`
131+
- `-`
132+
- `-`
133+
- `+`
134+
- `-`
135+
- `+`
136+
137+
* - :class:`~dedoc.data_structures.SizeAnnotation`
138+
- `+`
139+
- `+`
140+
- `-`
141+
- `+`
142+
- `+`
143+
- `+`
144+
145+
* - :class:`~dedoc.data_structures.StyleAnnotation`
146+
- `+`
147+
- `+`
148+
- `-`
149+
- `-`
150+
- `+`
151+
- `+`
152+
153+
* - :class:`~dedoc.data_structures.ConfidenceAnnotation`
154+
- `-`
155+
- `-`
156+
- `-`
157+
- `+`
158+
- `-`
159+
- `-`
+65
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
.. _readers_line_types:
2+
3+
Types of textual lines
4+
======================
5+
6+
Each reader returns :class:`~dedoc.data_structures.UnstructuredDocument` with textual lines.
7+
Readers don't fill `hierarchy_level` metadata field (structure extractors do this), but they can fill `hierarchy_level_tag` with information about line types.
8+
Below the readers are enlisted that can return non-empty `hierarchy_level_tag` in document lines metadata:
9+
10+
* `+` means that the reader can return lines of this type.
11+
* `-` means that the reader doesn't return lines of this type due to complexity of the task or lack of information provided by the format.
12+
13+
.. _table_line_types:
14+
15+
.. list-table:: Line types returned by each reader
16+
:widths: 20 20 20 20 20
17+
:class: tight-table
18+
19+
* - **Reader**
20+
- **header**
21+
- **list_item**
22+
- **raw_text, unknown**
23+
- **key**
24+
25+
* - :class:`~dedoc.readers.DocxReader`
26+
- `+`
27+
- `+`
28+
- `+`
29+
- `-`
30+
31+
* - :class:`~dedoc.readers.HtmlReader`, :class:`~dedoc.readers.MhtmlReader`, :class:`~dedoc.readers.EmailReader`
32+
- `+`
33+
- `+`
34+
- `+`
35+
- `-`
36+
37+
* - :class:`~dedoc.readers.RawTextReader`
38+
- `-`
39+
- `+`
40+
- `+`
41+
- `-`
42+
43+
* - :class:`~dedoc.readers.JsonReader`
44+
- `-`
45+
- `+`
46+
- `+`
47+
- `+`
48+
49+
* - :class:`~dedoc.readers.PdfImageReader`
50+
- `-`
51+
- `+`
52+
- `+`
53+
- `-`
54+
55+
* - :class:`~dedoc.readers.PdfTabbyReader`
56+
- `+`
57+
- `+`
58+
- `+`
59+
- `-`
60+
61+
* - :class:`~dedoc.readers.PdfTxtlayerReader`
62+
- `-`
63+
- `+`
64+
- `+`
65+
- `-`

0 commit comments

Comments
 (0)