You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dedoc is an open universal system for converting documents to a unified output format. It extracts a document’s logical structure and content, its tables, text formatting and metadata. The document’s content is represented as a tree storing headings and lists of any level. Dedoc can be integrated in a document contents and structure analysis system as a separate module.
7
+
Dedoc is an open universal system for converting documents to a unified output format.
8
+
It extracts a document’s logical structure and content, its tables, text formatting and metadata.
9
+
The document’s content is represented as a tree storing headings and lists of any level.
10
+
Dedoc can be integrated in a document contents and structure analysis system as a separate module.
11
+
12
+
Relevant documentation of the dedoc is available [here](https://dedoc.readthedocs.io).
8
13
9
14
## Features and advantages
10
-
Dedoc is implemented in Python and works with semi-structured data formats (DOC/DOCX, ODT, XLS/XLSX, CSV, TXT, JSON) and none-structured data formats like images (PNG, JPG etc.), archives (ZIP, RAR etc.), PDF and HTML formats. Document structure extraction is fully automatic regardless of input data type. Metadata and text formatting is also extracted automatically.
15
+
Dedoc is implemented in Python and works with semi-structured data formats (DOC/DOCX, ODT, XLS/XLSX, CSV, TXT, JSON) and none-structured data formats like images (PNG, JPG etc.), archives (ZIP, RAR etc.), PDF and HTML formats.
16
+
Document structure extraction is fully automatic regardless of input data type.
17
+
Metadata and text formatting are also extracted automatically.
11
18
12
19
In 2022, the system won a grant to support the development of promising AI projects from the [Innovation Assistance Foundation (Фонд содействия инновациям)](https://fasie.ru/).
13
20
@@ -16,36 +23,35 @@ In 2022, the system won a grant to support the development of promising AI proje
16
23
* Support for extracting document structure out of nested documents having different formats.
17
24
* Extracting various text formatting features (indentation, font type, size, style etc.).
18
25
* Working with documents of various origin (statements of work, legal documents, technical reports, scientific papers) allowing flexible tuning for new domains.
19
-
* Working with PDF documents containinng a text layer:
20
-
* Support to automatically determine the correctness of the text layer in PDF documents;
21
-
* Extract containing and formatting from PDF-documents with a text layer using the developed interpreter of the virtual stack machine for printing graphics according to the format specification.
22
-
Extracting table data from DOC/DOCX, PDF, HTML, CSV and image formats:
26
+
* Working with PDF documents containing a textual layer:
27
+
* Support to automatically determine the correctness of the textual layer in PDF documents;
28
+
* Extract containing and formatting from PDF-documents with a textual layer using the developed interpreter of the virtual stack machine for printing graphics according to the format specification.
29
+
*Extracting table data from DOC/DOCX, PDF, HTML, CSV and image formats:
23
30
* Recognizing a physical structure and a cell text for complex multipage tables having explicit borders with the help of contour analysis.
24
31
* Working with scanned documents (image formats and PDF without text layer):
25
32
* Using Tesseract, an actively developed OCR engine from Google, together with image preprocessing methods.
26
33
* Utilizing modern machine learning approaches for detecting a document orientation, detecting single/multicolumn document page, detecting bold text and extracting hierarchical structure based on the classification of features extracted from document images.
27
34
28
35
29
-
This project may be useful as a first step of automatic document analysis pipeline (e.g. before the NLP part)
36
+
This project may be useful as a first step of automatic document analysis pipeline (e.g. before the NLP part).
30
37
31
-
This project has REST Api and you can run it in Docker container
32
-
To read full Dedoc documentation run the project and go to localhost:1231.
38
+
This project has REST Api and you can run it in Docker container.
39
+
Also, dedoc can be installed as a library via `pip`.
40
+
To read full Dedoc documentation go [here](https://dedoc.readthedocs.io).
33
41
34
42
35
43
## Run the project
36
-
How to build and run the project
37
44
38
-
Ensure you have Git and Docker installed
39
-
45
+
### Install and run dedoc using docker
46
+
40
47
Clone the project
41
48
```bash
42
49
git clone https://github.com/ispras/dedoc.git
43
-
44
-
cd dedoc/
50
+
cd dedoc
45
51
```
46
52
47
53
Ensure you have Docker installed.
48
-
Start 'Dedoc' on the port 1231:
54
+
Start `dedoc` on the port `1231`:
49
55
```bash
50
56
docker-compose up --build
51
57
```
@@ -55,6 +61,22 @@ Start Dedoc with tests:
55
61
test="true" docker-compose up --build
56
62
```
57
63
58
-
Now you can go to the localhost:1231 and look at the docs and examples.
64
+
Now you can go to the `localhost:1231` and look at the docs and examples.
65
+
You can change the port and host in the config file `dedoc/config.py`.
66
+
67
+
### Install dedoc using pip
68
+
69
+
One may install the dedoc library via `pip`.
70
+
To fulfil all the library requirements, you should have `torch~=1.11.0` and `torchvision~=0.12.0` installed.
71
+
You can install suitable for you versions of these libraries and install dedoc using `pip` command:
72
+
73
+
```bash
74
+
pip install dedoc
75
+
```
76
+
77
+
Or you can install dedoc with torch and torchvision included:
78
+
```bash
79
+
pip install "dedoc[torch]"
80
+
```
59
81
60
-
You can change the port and host in the config file 'dedoc/config.py'
82
+
Go [here](https://dedoc.readthedocs.io/en/latest/getting_started/installation.html) to get more details about dedoc installation.
Each reader returns :class:`~dedoc.data_structures.UnstructuredDocument` with textual lines.
7
+
Readers don't fill `hierarchy_level` metadata field (structure extractors do this), but they can fill `hierarchy_level_tag` with information about line types.
8
+
Below the readers are enlisted that can return non-empty `hierarchy_level_tag` in document lines metadata:
9
+
10
+
* `+` means that the reader can return lines of this type.
11
+
* `-` means that the reader doesn't return lines of this type due to complexity of the task or lack of information provided by the format.
12
+
13
+
.. _table_line_types:
14
+
15
+
.. list-table:: Line types returned by each reader
0 commit comments