Releases: ispras/dedoc
Releases · ispras/dedoc
v2.1.1
v2.1
- Custom loggers deleted (the common logger is used for all dedoc classes).
- Do not change the document image if it has a correct orientation (orientation correction function changed).
- Use only
PdfTabbyReader
during detection of a textual layer in PDF files. - Code related to the labeling mode refactored and removed from the library package (it is located in the separate directory).
- Added
BoldAnnotation
for words inPdfImageReader
. - More benchmarks are added: images of tables parsing, postprocessing of Tesseract OCR.
- Some fixes are made in a web-form of Dedoc.
- Tutorial how to add a new structure type to Dedoc added.
- Parsing of EML and HTML files fixed.
v2.0
- Fix table extraction from
PDF
using empty config (see issue) - Add more benchmarks for Tesseract
- Fix extension extraction for file names with several dots
- Change names of some methods and their parameters for all main classes (attachments extractors, converters, readers, metadata extractors, structure extractors, structure constructors).
Please look to thePackage reference
of documentation for more details - Add
AttachAnnotation
andTableAnnotation
toPPTX
(see discussion) - Fix bugs in
DOCX
handling (see issues 378, 379
v1.1.1
- Use older
pydantic
version for improving compatibility with other libraries. - Add support for
RTF
format. - Fix bug in handling files' names with dots and spaces.
- Fix bug in non-integer values of text formatting in
DocxReader
. - Add support of
on_gpu
parameter inconfig
. - Add attached images extraction for
PdfTabbyReader
. - Fix partial file reading for
PdfTabbyReader
. - Add tutorial how to create dedoc's basic data structures.
- Fix
attachments_dir
parameter for readers and attachments extractors.
v1.1.0
- Add
BBoxAnnotation
to table cells forPdfTabbyReader
. - Fix swagger, add api schema classes, remove
to_dict
method fromParsedDocument
. - Improve parsing PDF by
PdfTxtlayerReader
, add benchmarks. - Fix
BBoxAnnotation
extraction for tables inPdfImageReader
usingtable_type=split_last_column
parameter. - Change base method of metadata extractors, rename it to
extract_metadata
. - Unify
BBoxAnnotation
extraction for all PDF readers - return only words bboxes. - Increase timeout value for all converters.
v1.0
- Remove
is_one_column_document_list
parameter. - Add tutorial about support for a new document type to the documentation.
- Improve textual layer correctness classifier.
- Improve orientation and columns classifier.
- Change table's output structure - added
CellWithMeta
instead of a textual string. - Add
BBoxAnnotation
to table cells forPdfTxtlayerReader
andPdfImageReader
. - Add
ConfidenceAnnotation
to table cells forPdfImageReader
. - Remove
insert_table
parameter. - Added information about table and page rotation to the table and document metadata respectively.
- Use dedoc-utils library for document images preprocessing.
- Change web interface, fix online-examples of document processing.
- Add comparison operator to
LineWithMeta
.
v0.11.2
- Remove plexus-utils-1.1.jar.
- Update installation documentation.
- Add documentation for Tesseract OCR installation.
- Add documentation for annotations.
- Add documentation for secure torch.
- Fix examples.
v0.11.1
- Add bbox annotations in
PdfTabbyReader
. - Add bbox annotations for words in
PdfTxtlayerReader
. - Add an option
plain_text
to thereturn_format
parameter. - Reduce size of the dedoc base image, move dockerfiles to the separate repository.
- Refactor script for tesseract benchmarking.
- Make fixed dedoc dependencies as ranges.
- Add table cell properties in
PdfTabbyReader
.
v0.11.0
v0.10.0
- Add ConfidenceAnnotation annotation for PdfImageReader.
- Remove version parameter from metadata extractors, structure constructors and parsed document methods.
- Add version file and version resolving for the library.
- Add recursive handling of attachments.
- Add parameter for saving attachments in a custom directory.
- Remove dedoc threaded manager.
- Improve PdfAutoReader.
- Add temporary file name to DocumentMetadata.