Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLDR-853 added info about GOST frame processing into docs #506

Merged
merged 2 commits into from
Nov 19, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion dedoc/api/web/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ <h4>Tables handling </h4>

<div class="parameters">
<h4>PDF handling</h4>
<details><summary>pdf_with_text_layer, fast_textual_layer_detection, language, pages, is_one_column_document, document_orientation, need_header_footer_analysis, need_binarization</summary>
<details><summary>pdf_with_text_layer, fast_textual_layer_detection, language, pages, is_one_column_document, document_orientation, need_header_footer_analysis, need_binarization, need_gost_frame_analysis</summary>
<br>
<p>
<label>
Expand Down
Binary file modified docs/source/_static/code_examples/test_dir/example.docx
Binary file not shown.
Binary file added docs/source/_static/page_with_gost_frame_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/page_with_gost_frame_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/result_gost_frame.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
("py:class", "abc.ABC"),
("py:class", "pydantic.main.BaseModel"),
("py:class", "scipy.stats._multivariate.dirichlet_multinomial_gen.cov"),
("py:class", "scipy.stats._multivariate.random_table_gen.rvs"),
("py:class", "pandas.core.series.Series"),
("py:class", "numpy.ndarray"),
("py:class", "pandas.core.frame.DataFrame"),
Expand Down
68 changes: 68 additions & 0 deletions docs/source/parameters/gost_frame_handling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
.. _gost_frame_handling:

GOST frame handling
====================

.. flat-table:: Parameters for GOST frame handling
:widths: 5 5 3 15 72
:header-rows: 1
:class: tight-table

* - Parameter
- Possible values
- Default value
- Where can be used
- Description

* - need_gost_frame_analysis
- True, False
- False
- * :meth:`dedoc.DedocManager.parse`
* method :meth:`~dedoc.readers.BaseReader.read` of inheritors of :class:`~dedoc.readers.BaseReader`
* :meth:`dedoc.readers.PdfTabbyReader.read`
- This option is used to enable GOST (Russian government standard "ГОСТ Р 21.1101") frame recognition for PDF documents or images.


The content of each page of some technical documents is placed in special GOST frames. An example of GOST frames is shown in the example below (:ref:`example_gost_frame`).
Such frames contain meta-information and are not part of the text content of the document.Based on this, we have implemented the functionality for ignoring GOST frames in documents, which works for:

* Copyable and non-copyable PDF documents (:class:`dedoc.readers.PdfTxtlayerReader` and :class:`dedoc.readers.PdfTabbyReader`);
* Images (:class:`dedoc.readers.PdfImageReader`).

If parameter ``need_gost_frame_analysis=True``, the GOST frame itself is ignored and only the contents inside the frame are extracted.

.. _example_gost_frame:

Examples of GOST frame
----------------------
For example your send PDF-document with two pages:

.. image:: ../_static/page_with_gost_frame_1.png
:width: 30%
.. image:: ../_static/page_with_gost_frame_2.png
:width: 30%

Parameter's usage
-----------------

.. code-block:: python

import requests

data = {
"pdf_with_text_layer": "auto_tabby",
"need_gost_frame_analysis": "true",
"return_format": "html"
}
with open(filename, "rb") as file:
files = {"file": (filename, file)}
r = requests.post("http://localhost:1231/upload", files=files, data=data)
result = r.content.decode("utf-8")

Request's result
----------------

.. image:: ../_static/result_gost_frame.png
:width: 50%


10 changes: 8 additions & 2 deletions docs/source/parameters/pdf_handling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,8 +159,8 @@ PDF and images handling
* :meth:`dedoc.readers.ReaderComposition.read`
- This option is used to enable GOST (Russian government standard) frame recognition for PDF documents or images.
The GOST frame recognizer is used in :meth:`dedoc.readers.PdfBaseReader.read`. Its main function is to recognize and
ignore the GOST frame on the document. It allows :class:`dedoc.readers.PdfImageReader` and :class:`dedoc.readers.PdfTxtlayerReader`
to properly process the content of the document containing GOST frame.
ignore the GOST frame on the document. It allows :class:`dedoc.readers.PdfImageReader`, :class:`dedoc.readers.PdfTxtlayerReader`
and :class:`dedoc.readers.PdfTabbyReader` to properly process the content of the document containing GOST frame, see :ref:`gost_frame_handling` for more details

* - orient_analysis_cells
- True, False
Expand All @@ -185,3 +185,9 @@ PDF and images handling

* **270** -- cells are rotated 90 degrees clockwise;
* **90** -- cells are rotated 90 degrees counterclockwise (or 270 clockwise).


.. toctree::
:maxdepth: 1

gost_frame_handling
Loading