Releases: ispras/dedoc
Releases · ispras/dedoc
v2.3.1
Release note: v2.3.1
- Fix bug with bold lines in
DocxReader
(see issue 479). - Upgraded requirements.txt (
beautifulsoup4
to 4.12.3 version). - Added support for external grobid (added support parameter
Authorization
). - Added GOST (Russian government standard) frame recognition in
PdfTabbyReader
(need_gost_frame_analysis
parameter). - Update documentation (added GOST frame recognition).
- Added multi-page table handling to
PdfTabbyReader
.
v2.3
- Dedoc telegram chat created.
- Added
patterns
parameter for configuring default structure type. - Added notebooks with Dedoc usage (see issue 484).
- Fix bug
OutOfMemoryError: Java heap space
inPdfTabbyReader
(see issue 489). - Fix bug with numeration in
DocxReader
(see issue 494). - Added GOST (Russian government standard) frame recognition in
PdfImageReader
andPdfTxtlayerReader
(need_gost_frame_analysis
parameter).
v2.2.7
- Fix bugs with
start
,end
ofBBoxAnnotation
inPdfTabbyReader
. - Improve columns classification and orientation detection for PDF and images (
is_one_column_document
anddocument_orientation
parameters). - Upgrade
docker
:docker-compose
is no longer supported, usedocker compose
instead. - Fix bug of tables parsing in
DocxReader
(see issue). - Added simple textual layer detection in
PdfAutoReader
(fast_textual_layer_detection
parameter). - Improve paragraph extraction from PDF documents and images.
- Retrain a classifier for diplomas (document_type="diploma") on a new dataset.
v2.2.6
- Upgrade dependencies:
numpy<2.0
anddedoc-utils==0.3.7
.
v2.2.5
v2.2.4
- Show page division and page numbers in the HTML output representation (API usage, return_format="html").
- Make imports from dedoc library faster.
- Added tutorial how to add a new language to dedoc (not finished entirely).
- Added additional page_id metadata for multi-page nodes (structure_type="tree" in API,
TreeConstructor
in the library). - Updated OCR and orientation/columns classification benchmarks.
- Minor edits of
README.md
. - Fixed empty cells handling in
CSVReader
. - Fixed bounding boxes extraction for text in tables for
PdfTabbyReader
.
v2.2.3
- Show attached images and added ability to download attached files in the HTML output representation (API usage, return_format="html").
- Added hierarchy level information and annotations to
PptxReader
.
v2.2.2
- Added images extraction to
ArticleReader
. - Added attachments and references to them in the HTML output representation (return_format="html").
- Fixed functionality of parameter
need_content_analysis
. - Fixed
CSVReader
(exclude BOM character from the output). - Added handling files with wrong extension or without extension to
DedocManager
(detect file type by its content). - Update
README.md
.
v2.2.1
- Added
fintoc
structure type for parsing financial prospects according to the FinTOC 2022 Shared task (FintocStructureExtractor
). - Fixed small bugs in
ArticleReader
: colspan for tables, keywords, sections numbering, etc. - Added references to nodes and fixed small bugs in the HTML output representation (return_format="html").
- Removed
other_fields
fromLineMetadata
andDocumentMetadata
. - Update
README.md
.
v2.2
PdfTabbyReader
improved: bugs fixes, speed increase of partial PDF extraction (with parameterpages
).- Added benchmarks for evaluation of PDF readers performance.
- Added
ReferenceAnnotation
class. - Fixed bug in
can_read
method for all readers. - Added
article
structure type for parsing scientific articles using GROBID (ArticleReader
,ArticleStructureExtractor
).