Skip to content

Commit 59cf81b

Browse files
committed
TLDR-853 after review
1 parent 7edaa1b commit 59cf81b

File tree

8 files changed

+29
-24
lines changed

8 files changed

+29
-24
lines changed

dedoc/readers/pdf_reader/pdf_image_reader/pdf_image_reader.py

+4
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33

44
from numpy import ndarray
55

6+
from dedoc.data_structures.unstructured_document import UnstructuredDocument
67
from dedoc.readers.pdf_reader.data_classes.line_with_location import LineWithLocation
78
from dedoc.readers.pdf_reader.data_classes.pdf_image_attachment import PdfImageAttachment
89
from dedoc.readers.pdf_reader.data_classes.tables.scantable import ScanTable
@@ -53,6 +54,9 @@ def __init__(self, *, config: Optional[dict] = None) -> None:
5354
self.binarizer = AdaptiveBinarizer()
5455
self.ocr = OCRLineExtractor(config=self.config)
5556

57+
def read(self, file_path: str, parameters: Optional[dict] = None) -> UnstructuredDocument:
58+
return super().read(file_path, parameters)
59+
5660
def _process_one_page(self,
5761
image: ndarray,
5862
parameters: ParametersForParseDoc,

dedoc/readers/pdf_reader/pdf_txtlayer_reader/pdf_txtlayer_reader.py

+4
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
from dedocutils.data_structures import BBox
44
from numpy import ndarray
55

6+
from dedoc.data_structures.unstructured_document import UnstructuredDocument
67
from dedoc.readers.pdf_reader.data_classes.line_with_location import LineWithLocation
78
from dedoc.readers.pdf_reader.data_classes.pdf_image_attachment import PdfImageAttachment
89
from dedoc.readers.pdf_reader.data_classes.tables.scantable import ScanTable
@@ -37,6 +38,9 @@ def can_read(self, file_path: Optional[str] = None, mime: Optional[str] = None,
3738
from dedoc.utils.parameter_utils import get_param_pdf_with_txt_layer
3839
return super().can_read(file_path=file_path, mime=mime, extension=extension) and get_param_pdf_with_txt_layer(parameters) == "true"
3940

41+
def read(self, file_path: str, parameters: Optional[dict] = None) -> UnstructuredDocument:
42+
return super().read(file_path, parameters)
43+
4044
def _process_one_page(self,
4145
image: ndarray,
4246
parameters: ParametersForParseDoc,
Binary file not shown.

docs/source/parameters/gost_frame_handling.rst

+9-11
Original file line numberDiff line numberDiff line change
@@ -18,28 +18,28 @@ GOST frame handling
1818
- True, False
1919
- False
2020
- * :meth:`dedoc.DedocManager.parse`
21-
* method :meth:`~dedoc.readers.BaseReader.read` of inheritors of :class:`~dedoc.readers.BaseReader`
22-
* :meth:`dedoc.readers.PdfTabbyReader.read`
21+
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfTabbyReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
22+
* :meth:`dedoc.readers.ReaderComposition.read`
2323
- This option is used to enable GOST (Russian government standard "ГОСТ Р 21.1101") frame recognition for PDF documents or images.
2424

2525

2626
The content of each page of some technical documents is placed in special GOST frames. An example of GOST frames is shown in the example below (:ref:`example_gost_frame`).
27-
Such frames contain meta-information and are not part of the text content of the document.Based on this, we have implemented the functionality for ignoring GOST frames in documents, which works for:
27+
Such frames contain meta-information and are not part of the text content of the document. Based on this, we have implemented the functionality for ignoring GOST frames in documents, which works for:
2828

29-
* Copyable and non-copyable PDF documents (:class:`dedoc.readers.PdfTxtlayerReader` and :class:`dedoc.readers.PdfTabbyReader`);
30-
* Images (:class:`dedoc.readers.PdfImageReader`).
29+
* Copyable PDF documents (:class:`dedoc.readers.PdfTxtlayerReader` and :class:`dedoc.readers.PdfTabbyReader`);
30+
* Non-copyable PDF documents and Images (:class:`dedoc.readers.PdfImageReader`).
3131

3232
If parameter ``need_gost_frame_analysis=True``, the GOST frame itself is ignored and only the contents inside the frame are extracted.
3333

3434
.. _example_gost_frame:
3535

3636
Examples of GOST frame
3737
----------------------
38-
For example your send PDF-document with two pages:
38+
For example, your send PDF-document with two pages :download:`PDF-document with two pages <../_static/gost_frame_data/document_with_gost_frame.pdf>`:
3939

40-
.. image:: ../_static/page_with_gost_frame_1.png
40+
.. image:: ../_static/gost_frame_data/page_with_gost_frame_1.png
4141
:width: 30%
42-
.. image:: ../_static/page_with_gost_frame_2.png
42+
.. image:: ../_static/gost_frame_data/page_with_gost_frame_2.png
4343
:width: 30%
4444

4545
Parameter's usage
@@ -62,7 +62,5 @@ Parameter's usage
6262
Request's result
6363
----------------
6464

65-
.. image:: ../_static/result_gost_frame.png
65+
.. image:: ../_static/gost_frame_data/result_gost_frame.png
6666
:width: 50%
67-
68-

docs/source/parameters/pdf_handling.rst

+12-13
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ PDF and images handling
6262
- rus, eng, rus+eng, fra, spa
6363
- rus+eng
6464
- * :meth:`dedoc.DedocManager.parse`
65-
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
65+
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
6666
* :meth:`dedoc.readers.ReaderComposition.read`
6767
* :meth:`dedoc.structure_extractors.FintocStructureExtractor.extract`
6868
- Language of the document without a textual layer. The following values are available:
@@ -77,7 +77,7 @@ PDF and images handling
7777
- :, start:, :end, start:end
7878
- :
7979
- * :meth:`dedoc.DedocManager.parse`
80-
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`, :meth:`dedoc.readers.PdfTabbyReader.read`
80+
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfImageReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`, :meth:`dedoc.readers.PdfTabbyReader.read`
8181
* :meth:`dedoc.readers.ReaderComposition.read`
8282
- If you need to read a part of the PDF document, you can use page slice to define the reading range.
8383
If the range is set like ``start_page:end_page``, document will be processed from ``start_page`` to ``end_page``
@@ -96,7 +96,7 @@ PDF and images handling
9696
- true, false, auto
9797
- auto
9898
- * :meth:`dedoc.DedocManager.parse`
99-
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
99+
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfImageReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`
100100
* :meth:`dedoc.readers.ReaderComposition.read`
101101
- This option is used to set the number of columns if the PDF document is without a textual layer in case it's known beforehand.
102102
The following values are available:
@@ -111,7 +111,7 @@ PDF and images handling
111111
- auto, no_change
112112
- auto
113113
- * :meth:`dedoc.DedocManager.parse`
114-
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
114+
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
115115
* :meth:`dedoc.readers.ReaderComposition.read`
116116
- This option is used to control document orientation analysis for PDF documents without a textual layer.
117117
The following values are available:
@@ -125,7 +125,7 @@ PDF and images handling
125125
- True, False
126126
- False
127127
- * :meth:`dedoc.DedocManager.parse`
128-
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
128+
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfImageReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`
129129
* :meth:`dedoc.readers.ReaderComposition.read`
130130
- This option is used to **remove** headers and footers of PDF documents from the output result.
131131
If ``need_header_footer_analysis=False``, header and footer lines will present in the output as well as all other document lines.
@@ -134,7 +134,7 @@ PDF and images handling
134134
- True, False
135135
- False
136136
- * :meth:`dedoc.DedocManager.parse`
137-
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
137+
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
138138
* :meth:`dedoc.readers.ReaderComposition.read`
139139
- This option is used to clean background (binarize) for pages of PDF documents without a textual layer.
140140
If the document's background is heterogeneous, this option may help to improve the result of document text recognition.
@@ -144,7 +144,7 @@ PDF and images handling
144144
- True, False
145145
- True
146146
- * :meth:`dedoc.DedocManager.parse`
147-
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
147+
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
148148
* :meth:`dedoc.readers.ReaderComposition.read`
149149
- This option is used to enable table recognition for PDF documents or images.
150150
The table recognition method is used in :class:`dedoc.readers.PdfImageReader` and :class:`dedoc.readers.PdfTxtlayerReader`.
@@ -155,18 +155,17 @@ PDF and images handling
155155
- True, False
156156
- False
157157
- * :meth:`dedoc.DedocManager.parse`
158-
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
158+
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfTabbyReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
159159
* :meth:`dedoc.readers.ReaderComposition.read`
160160
- This option is used to enable GOST (Russian government standard) frame recognition for PDF documents or images.
161-
The GOST frame recognizer is used in :meth:`dedoc.readers.PdfBaseReader.read`. Its main function is to recognize and
162-
ignore the GOST frame on the document. It allows :class:`dedoc.readers.PdfImageReader`, :class:`dedoc.readers.PdfTxtlayerReader`
163-
and :class:`dedoc.readers.PdfTabbyReader` to properly process the content of the document containing GOST frame, see :ref:`gost_frame_handling` for more details
161+
It allows :class:`dedoc.readers.PdfImageReader`, :class:`dedoc.readers.PdfTxtlayerReader` and :class:`dedoc.readers.PdfTabbyReader`
162+
to properly process the content of the document containing GOST frame, see :ref:`gost_frame_handling` for more details.
164163

165164
* - orient_analysis_cells
166165
- True, False
167166
- False
168167
- * :meth:`dedoc.DedocManager.parse`
169-
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
168+
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
170169
* :meth:`dedoc.readers.ReaderComposition.read`
171170
- This option is used for a table recognition for PDF documents or images.
172171
It is ignored when ``need_pdf_table_analysis=False``.
@@ -177,7 +176,7 @@ PDF and images handling
177176
- 90, 270
178177
- 90
179178
- * :meth:`dedoc.DedocManager.parse`
180-
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfBaseReader.read`
179+
* :meth:`dedoc.readers.PdfAutoReader.read`, :meth:`dedoc.readers.PdfTxtlayerReader.read`, :meth:`dedoc.readers.PdfImageReader.read`
181180
* :meth:`dedoc.readers.ReaderComposition.read`
182181
- This option is used for a table recognition for PDF documents or images.
183182
It is ignored when ``need_pdf_table_analysis=False`` or ``orient_analysis_cells=False``.

0 commit comments

Comments
 (0)