Skip to content

Commit ff26829

Browse files
NastyBogetsunveilTravvy88Nikita Shevtsovalexander1999-hub
authored
update master (#352)
* TLDR-405 remove is_one_column_document_list (#332) * TLDR-405 remove is_one_column_document_list * TLDR-405 fix tests * TLDR-405 review fix * TLDR-448-Fix draw coordinates bug (#330) * Fix draw coordinates bug * Fix draw coordinates conversion * TLDR-451 tutorial new doc type (#331) * docs added * add code testing * some fixes * some fixes * add tabula and some fixes * add python-djvulibre * delete python-djvulibre and add djvulibre-bin * add poppler-utils * add tesseract * some fixes * flake8 stylefix * fix docs after flake8 * update last part of adding_new_doc_type_tutorial * rewrite dedoc_add_new_doc_type_tutorial * minor fixes * minor fixes * minor fixes * some fixes * add more code examples * some fixes --------- Co-authored-by: Nikita Shevtsov <[email protected]> Co-authored-by: Nasty <[email protected]> * updated txt layer correctness classifier (#334) Co-authored-by: Alexander Golodkov <[email protected]> * Esl 137 added boxes into table (#333) * ESL-137 added box extraction skeleton into scan table extraction * ESL-138 ESL-137 a lot of table changes - added CellWithMeta - change output table structure, remove CellProperies in output - change logic bbox extraction from image tables after debugging - change output in CSV, HTML, TABBY, PDF, SCAN readers - change all tests with tables - fixed styles * ESL-137 chnaged draw table script * ESL-148 added script of table word boxes drawing * TLDR-471 added angle rotation from PdfImageReader and Tables * ESL-137 fixed unit-tests * ESL-137 fixed after review; removing some unused functions - fixed after review - removing some unused functions * ESL-137 update docs * ESL-137 after review * Updated columns orientation classifier (#335) * updated txt columns orientation classifier * deleted "no_lines" parameter --------- Co-authored-by: Alexander Golodkov <[email protected]> * fix pdf reader (#337) Co-authored-by: Nikita Shevtsov <[email protected]> * TLDR-472 add flake8-fill-one-line and flake8-multiline-containers and fix lint (#336) * add flake8-fill-one-line and flake8-multiline-containers and fix lint * update precommit hook * TLDR-475 fix table documentation (#338) * TLDR-475 fix table documentation * Small fixes * TLDR-474 remove insert_table parameter (#339) * TLDR-474 remove insert_table parameter * TLDR-474 remove is_inserted attribute * ESL-470 fixed rotation operation of table word boxes (#341) rotates a table image and saving image.shape during rotation. It is important for word bounding box extraction * TLDR-478 docx table refactoring (#342) * TLDR-478 docx table refactoring * Small fixes * TLDR-483 fixed box extraction from cropped cells (#343) * TLDR-473 add dedoc utils (#340) * use dedoc utils BBox class * use AdaptiveBinarizer from dedoc-utils * use SkewCorrector from dedoc-utils * fix style * fix rotated angle error * delete BBox from docs * fix angles * delete print * fix dedocutils * dedocutils set ver. 0.3.5 * fix mistakes and names --------- Co-authored-by: Nikita Shevtsov <[email protected]> * TLDR-481 html refactoring (#344) * delete unused files * Delete unused files, refactor html * Refactor query parameters * Fix tests * Refactor train dataset api * Fix style * Change python version in tests * Review fixes * TLDR-490 changed uuid1 on uuid4; fixed bug in tabby's table uuid (#345) * TLDR-490 changed uuid1 on uuid4; fixed bug in tabby's table uuid * TLDR-490 fixes after review * Added running API examples instruction (#346) * added linewithmeta comparison operator (#347) Co-authored-by: Alexander Golodkov <[email protected]> * ESL-156 fix pdfminer boxes output (#348) * ESL-156 fix pdfminer boxes output * ESL-156 after review * ESL-159 fixed extract boxes from pdfminer reader (#350) * new version 1.0 (#351) --------- Co-authored-by: Andrey Mikhailov <[email protected]> Co-authored-by: Nikita Shevtsov <[email protected]> Co-authored-by: Nikita Shevtsov <[email protected]> Co-authored-by: Alexander Golodkov <[email protected]> Co-authored-by: Alexander Golodkov <[email protected]> Co-authored-by: Oksana Belyaeva <[email protected]> Co-authored-by: Andrew Perminov <[email protected]>
1 parent 79f4cb5 commit ff26829

File tree

264 files changed

+5260
-5122
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

264 files changed

+5260
-5122
lines changed

.flake8

-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@ exclude =
1616
resources,
1717
dedoc/scripts,
1818
examples,
19-
docs,
2019
venv,
2120
build,
2221
dedoc.egg-info

.github/workflows/docs.yaml

+2-1
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ jobs:
1919

2020
- name: Install dependencies
2121
run: |
22-
sudo apt-get install -y libreoffice
22+
sudo apt-get install -y libreoffice djvulibre-bin poppler-utils tesseract-ocr libtesseract-dev tesseract-ocr-rus tesseract-ocr-eng
2323
python -m pip install --upgrade --no-cache-dir pip setuptools
2424
python -m pip install --exists-action=w --no-cache-dir -r requirements.txt
2525
python -m pip install --upgrade --upgrade-strategy eager --no-cache-dir .[torch,docs]
@@ -30,3 +30,4 @@ jobs:
3030
python -m sphinx -T -E -W -b html -d docs/_build/doctrees -D language=en docs/source docs/_build
3131
cd docs/source/_static/code_examples
3232
python dedoc_usage_tutorial.py
33+
python dedoc_add_new_doc_type_tutorial.py

.github/workflows/test_on_push.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ jobs:
2828
- name: Set up Python ${{ matrix.python-version }}
2929
uses: actions/setup-python@v2
3030
with:
31-
python-version: '3.8'
31+
python-version: '3.9'
3232
- name: Run lint
3333
run: |
3434
python3 -m pip install --upgrade pip

.pre-commit-config.yaml

+2
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,9 @@ repos:
1111
flake8-annotations==2.9.1,
1212
flake8-bugbear==23.3.12,
1313
flake8-builtins==2.1.0,
14+
flake8-fill-one-line>=0.4.0,
1415
flake8-import-order==0.18.2,
16+
flake8-multiline-containers==0.0.19,
1517
flake8-print==5.0.0,
1618
flake8-quotes==3.3.2,
1719
flake8-use-fstring==1.4,

Dockerfile

+5-2
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,14 @@ ADD requirements.txt .
88
RUN pip3 install --no-cache-dir -r requirements.txt
99

1010
RUN mkdir /dedoc_root
11+
RUN mkdir /dedoc_root/dedoc
12+
ADD dedoc/config.py /dedoc_root/dedoc/config.py
13+
ADD dedoc/download_models.py /dedoc_root/dedoc/download_models.py
14+
RUN python3 /dedoc_root/dedoc/download_models.py
15+
1116
ADD dedoc /dedoc_root/dedoc
1217
ADD VERSION /dedoc_root
13-
1418
RUN echo "__version__ = \"$(cat /dedoc_root/VERSION)\"" > /dedoc_root/dedoc/version.py
15-
RUN python3 /dedoc_root/dedoc/download_models.py
1619

1720
ADD tests /dedoc_root/tests
1821
ADD resources /dedoc_root/resources

README.md

+22-22
Original file line numberDiff line numberDiff line change
@@ -47,19 +47,19 @@ There are two ways to install and run dedoc as a web application or a library th
4747

4848
## Install and run dedoc using docker
4949

50-
You should have [`git`] (https://git-scm.com) and [`docker`](https://www.docker.com) installed for running dedoc by this method.
50+
You should have [`git`](https://git-scm.com) and [`docker`](https://www.docker.com) installed for running dedoc by this method.
5151
This method is more flexible because it doesn't depend on the operating system and other user's limitations,
5252
still, the docker application should be installed and configured properly.
5353

5454
If you don't need to change the application configuration, you may use the built docker image as well.
5555

5656
### 1. Pull the image
57-
```bash
57+
```shell
5858
docker pull dedocproject/dedoc
5959
```
6060

6161
### 2. Run the container
62-
```bash
62+
```shell
6363
docker run -p 1231:1231 --rm dedocproject/dedoc python3 /dedoc_root/dedoc/main.py
6464
```
6565

@@ -69,22 +69,22 @@ If you need to change some application settings, you may update `config.py` acco
6969
You can build and run image:
7070

7171
### 1. Clone the repository
72-
```bash
72+
```shell
7373
git clone https://github.com/ispras/dedoc
7474
```
7575

7676
### 2. Go to the `dedoc` directory
77-
```bash
77+
```shell
7878
cd dedoc
7979
```
8080

8181
### 3. Build the image and run the application
82-
```bash
82+
```shell
8383
docker-compose up --build
8484
```
8585

8686
### 4. Run container with tests
87-
```bash
87+
```shell
8888
test="true" docker-compose up --build
8989
```
9090

@@ -99,7 +99,7 @@ there may be not enough machine's resources for its work.
9999
You should have `python` (`python3.8`, `python3.9` are recommended) and `pip` installed.
100100

101101
### 1. Install necessary packages:
102-
```bash
102+
```shell
103103
sudo apt-get install -y libreoffice djvulibre-bin unzip unrar
104104
```
105105

@@ -112,14 +112,14 @@ You can try any tutorial for this purpose or look [`here`](https://github.com/is
112112
to get the example of Tesseract installing for dedoc container or use next commands for building Tesseract OCR 5 from sources:
113113

114114
#### 2.1. Install compilers and libraries required by the Tesseract OCR:
115-
```bash
115+
```shell
116116
sudo apt-get update
117117
sudo apt-get install -y automake binutils-dev build-essential ca-certificates clang g++ g++-multilib gcc-multilib libcairo2 libffi-dev \
118118
libgdk-pixbuf2.0-0 libglib2.0-dev libjpeg-dev libleptonica-dev libpango-1.0-0 libpango1.0-dev libpangocairo-1.0-0 libpng-dev libsm6 \
119119
libtesseract-dev libtool libxext6 make pkg-config poppler-utils pstotext shared-mime-info software-properties-common swig zlib1g-dev
120120
```
121121
#### 2.2. Build Tesseract from sources:
122-
```bash
122+
```shell
123123
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr-devel
124124
sudo apt-get update --allow-releaseinfo-change
125125
sudo apt-get install -y tesseract-ocr tesseract-ocr-rus
@@ -130,24 +130,24 @@ export TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata/
130130

131131
## Install the dedoc library via pip.
132132

133-
You need torch~=1.11.0 and torchvision~=0.12.0 installed. If you already have torch and torchvision in your environment:
133+
You need `torch~=1.11.0` and `torchvision~=0.12.0` installed. If you already have torch and torchvision in your environment:
134134

135-
```bash
135+
```shell
136136
pip install dedoc
137137
```
138138

139139
Or you can install dedoc with torch and torchvision included:
140140

141-
```bash
141+
```shell
142142
pip install "dedoc[torch]"
143143
```
144144

145145
## Install and run dedoc from sources
146146

147-
If you want to run dedoc as a service from sources. it's possible to run dedoc locally.
148-
However, it isn't suitable for any operating system (Ubuntu 20+ is recommended) and
147+
If you want to run dedoc as a service from sources, it's possible to run dedoc locally.
148+
However, it is suitable not for all operating systems (`Ubuntu 20+` is recommended) and
149149
there may be not enough machine's resources for its work.
150-
You should have `python` (python3.8, python3.9 are recommended) and `pip` installed.
150+
You should have `python` (`python3.8`, `python3.9` are recommended) and `pip` installed.
151151

152152
### 1. Install necessary packages: according to instructions [install necessary packages](#1-Install-necessary-packages)
153153

@@ -157,7 +157,7 @@ You should have `python` (python3.8, python3.9 are recommended) and `pip` instal
157157

158158
Below are the instructions for installing the package `virtualenvwrapper`:
159159

160-
```bash
160+
```shell
161161
sudo pip3 install virtualenv virtualenvwrapper
162162
mkdir ~/.virtualenvs
163163
export WORKON_HOME=~/.virtualenvs
@@ -169,7 +169,7 @@ mkvirtualenv dedoc_env
169169

170170
### 4. Install python's requirements and launch dedoc service on default port `1231`:
171171

172-
```bash
172+
```shell
173173
# clone dedoc project
174174
git clone https://github.com/ispras/dedoc.git
175175
cd dedoc
@@ -183,14 +183,14 @@ python dedoc/main.py -c ./dedoc/config.py
183183
Now you can go to the `localhost:1231` and look at the docs and examples.
184184

185185
## Option: You can change the port of service:
186-
you need to change environment DOCREADER_PORT
186+
You need to change environment `DOCREADER_PORT`
187187

188-
1. For local service launching on your_port (1166 example). [Install instruction from sources](#Install-and-run-dedoc-from-sources) and launch with environment:
189-
```bash
188+
1. For local service launching on `your_port` (e.g. `1166`). Install ([installation instruction](#Install-and-run-dedoc-from-sources)) and launch with environment:
189+
```shell
190190
DOCREADER_PORT=1166 python dedoc/main.py -c ./dedoc/config.py
191191
```
192192

193-
2. For service launching in docker-container you need to change port value in DOCREADER_PORT env and field 'ports' in docker-compose.yml file:
193+
2. For service launching in docker-container you need to change port value in `DOCREADER_PORT` env and field `ports` in `docker-compose.yml` file:
194194
```yaml
195195
...
196196
dedoc:

VERSION

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.11.2
1+
1.0

dedoc/api/api_args.py

+54-103
Original file line numberDiff line numberDiff line change
@@ -1,103 +1,54 @@
1-
from typing import Any, Optional
2-
3-
from fastapi import Body
4-
from pydantic import BaseModel
5-
6-
7-
class QueryParameters(BaseModel):
8-
document_type: Optional[str]
9-
structure_type: Optional[str]
10-
return_format: Optional[str]
11-
12-
with_attachments: Optional[str]
13-
need_content_analysis: Optional[str]
14-
recursion_deep_attachments: Optional[str]
15-
return_base64: Optional[str]
16-
attachments_dir: Optional[str]
17-
18-
insert_table: Optional[str]
19-
need_pdf_table_analysis: Optional[str]
20-
table_type: Optional[str]
21-
orient_analysis_cells: Optional[str]
22-
orient_cell_angle: Optional[str]
23-
24-
pdf_with_text_layer: Optional[str]
25-
language: Optional[str]
26-
pages: Optional[str]
27-
is_one_column_document: Optional[str]
28-
document_orientation: Optional[str]
29-
need_header_footer_analysis: Optional[str]
30-
need_binarization: Optional[str]
31-
32-
delimiter: Optional[str]
33-
encoding: Optional[str]
34-
html_fields: Optional[str]
35-
handle_invisible_table: Optional[str]
36-
37-
def __init__(self,
38-
# type of document structure parsing
39-
document_type: Optional[str] = Body(description="a document type. Default: ''", enum=["", "law", "tz", "diploma"], default=None), # noqa
40-
structure_type: Optional[str] = Body(description="output structure type (linear or tree). Default: 'tree'", enum=["linear", "tree"], default=None), # noqa
41-
return_format: Optional[str] = Body(description="an option for returning a response in html form, json, pretty_json or tree. Assume that one should use json in all cases, all other formats are used for debug porpoises only. Default: 'json'", default=None), # noqa
42-
43-
# attachments handling
44-
with_attachments: Optional[str] = Body(description="an option to enable the analysis of attached files. Default: 'false'", default=None), # noqa
45-
need_content_analysis: Optional[str] = Body(description="turn on if you need parse the contents of the document attachments. Default: 'false'", default=None), # noqa
46-
recursion_deep_attachments: Optional[str] = Body(description="the depth on which nested attachments will be parsed if need_content_analysis=true. Default: '10'", default=None), # noqa
47-
return_base64: Optional[str] = Body(description="returns images in base64 format. Default: 'false'", default=None), # noqa
48-
attachments_dir: Optional[str] = Body(description="path to the directory where to save files' attachments", default=None), # noqa
49-
50-
# tables handling
51-
insert_table: Optional[str] = Body(description="Insert table into the result tree's content or not. Default: 'false'", default=None), # noqa
52-
need_pdf_table_analysis: Optional[str] = Body(description="include a table analysis into pdfs. Default: 'true'", default=None), # noqa
53-
table_type: Optional[str] = Body(description="a pipeline mode for a table recognition. Default: ''", default=None), # noqa
54-
orient_analysis_cells: Optional[str] = Body(description="a table recognition option enables analysis of rotated cells in table headers. Default: 'false'", default=None), # noqa
55-
orient_cell_angle: Optional[str] = Body(description="an option to set orientation of cells in table headers. \"270\" - cells are rotated 90 degrees clockwise, \"90\" - cells are rotated 90 degrees counterclockwise (or 270 clockwise)", default=None), # noqa
56-
57-
# pdf handling
58-
pdf_with_text_layer: Optional[str] = Body(description="an option to extract text from a text layer to PDF or using OCR methods for image-documents. Default: 'auto_tabby'", enum=["true", "false", "auto", "auto_tabby", "tabby"], default=None), # noqa
59-
language: Optional[str] = Body(description="a recognition language. Default: 'rus+eng'", enum=["rus+eng", "rus", "eng"], default=None), # noqa
60-
pages: Optional[str] = Body(description="an option to limit page numbers in pdf, archives with images. left:right, read pages from left to right. Default: ':'", default=None), # noqa
61-
is_one_column_document: Optional[str] = Body(description="an option to set one or multiple column document. \"auto\" - system predict number of columns in document pages, \"true\" - is one column documents, \"false\" - is multiple column documents. Default: 'auto'", default=None), # noqa
62-
document_orientation: Optional[str] = Body(description="an option to set vertical orientation of the document without using an orientation classifier \"auto\" - system predict angle (0, 90, 180, 270) and rotate document, \"no_change\" - do not predict orientation. Default: 'auto'", enum=["auto", "no_change"], default=None), # noqa
63-
need_header_footer_analysis: Optional[str] = Body(description="include header-footer analysis into pdf with text layer. Default: 'false'", default=None), # noqa
64-
need_binarization: Optional[str] = Body(description="include an adaptive binarization into pdf without a text layer. Default: 'false'", default=None), # noqa
65-
66-
# other formats handling
67-
delimiter: Optional[str] = Body(description="a column separator for csv-files", default=None), # noqa
68-
encoding: Optional[str] = Body(description="a document encoding", default=None), # noqa
69-
html_fields: Optional[str] = Body(description="a list of fields for JSON documents to be parsed as HTML documents. It is written as a json string of a list, where each list item is a list of keys to get the field. Default: ''", default=None), # noqa
70-
handle_invisible_table: Optional[str] = Body(description="handle table without visible borders as tables in html. Default: 'false'", default=None), # noqa
71-
72-
73-
**data: Any) -> None: # noqa
74-
75-
super().__init__(**data)
76-
self.document_type: str = document_type or ""
77-
self.structure_type: str = structure_type or "tree"
78-
self.return_format: str = return_format or "json"
79-
80-
self.with_attachments: str = with_attachments or "false"
81-
self.need_content_analysis: str = need_content_analysis or "false"
82-
self.recursion_deep_attachments: str = recursion_deep_attachments or "10"
83-
self.return_base64: str = return_base64 or "false"
84-
self.attachments_dir: str = attachments_dir
85-
86-
self.insert_table: str = insert_table or "false"
87-
self.need_pdf_table_analysis: str = need_pdf_table_analysis or "true"
88-
self.table_type: str = table_type or ""
89-
self.orient_analysis_cells: str = orient_analysis_cells or "false"
90-
self.orient_cell_angle: str = orient_cell_angle or "90"
91-
92-
self.pdf_with_text_layer: str = pdf_with_text_layer or "auto_tabby"
93-
self.language: str = language or "rus+eng"
94-
self.pages: str = pages or ":"
95-
self.is_one_column_document: str = is_one_column_document or "auto"
96-
self.document_orientation: str = document_orientation or "auto"
97-
self.need_header_footer_analysis: str = need_header_footer_analysis or "false"
98-
self.need_binarization: str = need_binarization or "false"
99-
100-
self.delimiter: str = delimiter
101-
self.encoding: str = encoding
102-
self.html_fields: str = html_fields or ""
103-
self.handle_invisible_table: str = handle_invisible_table or "false"
1+
from dataclasses import asdict, dataclass
2+
from typing import Optional
3+
4+
from fastapi import Form
5+
6+
7+
@dataclass
8+
class QueryParameters:
9+
# type of document structure parsing
10+
document_type: str = Form("", enum=["", "law", "tz", "diploma"], description="Document domain")
11+
structure_type: str = Form("tree", enum=["linear", "tree"], description="Output structure type")
12+
return_format: str = Form("json", enum=["json", "html", "plain_text", "tree", "collapsed_tree", "ujson", "pretty_json"],
13+
description="Response representation, most types (except json) are used for debug purposes only")
14+
15+
# attachments handling
16+
with_attachments: str = Form("false", enum=["true", "false"], description="Enable attached files extraction")
17+
need_content_analysis: str = Form("false", enum=["true", "false"], description="Enable parsing contents of the attached files")
18+
recursion_deep_attachments: str = Form("10", description="Depth on which nested attachments will be parsed if need_content_analysis=true")
19+
return_base64: str = Form("false", enum=["true", "false"], description="Save attached images to the document metadata in base64 format")
20+
attachments_dir: Optional[str] = Form(None, description="Path to the directory where to save files' attachments")
21+
22+
# tables handling
23+
need_pdf_table_analysis: str = Form("true", enum=["true", "false"], description="Enable table recognition for pdf")
24+
table_type: str = Form("", description="Pipeline mode for table recognition")
25+
orient_analysis_cells: str = Form("false", enum=["true", "false"], description="Enable analysis of rotated cells in table headers")
26+
orient_cell_angle: str = Form("90", enum=["90", "270"],
27+
description='Set cells orientation in table headers, "90" means 90 degrees counterclockwise cells rotation')
28+
29+
# pdf handling
30+
pdf_with_text_layer: str = Form("auto_tabby", enum=["true", "false", "auto", "auto_tabby", "tabby"],
31+
description="Extract text from a text layer of PDF or using OCR methods for image-like documents")
32+
language: str = Form("rus+eng", enum=["rus+eng", "rus", "eng"], description="Recognition language")
33+
pages: str = Form(":", description='Page numbers range for reading PDF or images, "left:right" means read pages from left to right')
34+
is_one_column_document: str = Form("auto", enum=["auto", "true", "false"],
35+
description='One or multiple column document, "auto" - predict number of page columns automatically')
36+
document_orientation: str = Form("auto", enum=["auto", "no_change"],
37+
description='Orientation of the document pages, "auto" - predict orientation (0, 90, 180, 270 degrees), '
38+
'"no_change" - set vertical orientation of the document without using an orientation classifier')
39+
need_header_footer_analysis: str = Form("false", enum=["true", "false"], description="Exclude headers and footers from PDF parsing result")
40+
need_binarization: str = Form("false", enum=["true", "false"], description="Binarize document pages (for images or PDF without a textual layer)")
41+
42+
# other formats handling
43+
delimiter: Optional[str] = Form(None, description="Column separator for CSV files")
44+
encoding: Optional[str] = Form(None, description="Document encoding")
45+
html_fields: str = Form("", description="List of fields for JSON documents to be parsed as HTML documents")
46+
handle_invisible_table: str = Form("false", enum=["true", "false"], description="Handle tables without visible borders as tables in HTML")
47+
48+
def to_dict(self) -> dict:
49+
parameters = {}
50+
51+
for parameter_name, parameter_value in asdict(self).items():
52+
parameters[parameter_name] = getattr(parameter_value, "default", parameter_value)
53+
54+
return parameters

0 commit comments

Comments
 (0)