Bump unstructured from 0.10.29 to 0.11.0 #149

dependabot · 2023-11-20T23:43:51Z

Bumps unstructured from 0.10.29 to 0.11.0.

Release notes

0.11.0

Enhancements

Add a class for the strategy constants. Add a class PartitionStrategy for the strategy constants and use the constants to replace strategy strings.

Temporary Support for paddle language parameter. User can specify default langage code for paddle with ENV DEFAULT_PADDLE_LANG before we have the language mapping for paddle.

Improve DOCX page-break fidelity. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the PageBreak element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements.

Features

Add ad-hoc fields to ElementMetadata instance. End-users can now add their own metadata fields simply by assigning to an element-metadata attribute-name of their choice, like element.metadata.coefficient = 0.58. These fields will round-trip through JSON and can be accessed with dotted notation.

MongoDB Destination Connector New destination connector added to all CLI ingest commands to support writing partitioned json output to mongodb.

Fixes

Fix TYPE_TO_TEXT_ELEMENT_MAP Updated Figure mapping from FigureCaption to Image.

Handle errors when extracting PDF text Certain pdfs throw unexpected errors when being opened by pdfminer, causing partition_pdf() to fail. We expect to be able to partition smoothly using an alternative strategy if text extraction doesn't work. Added exception handling to handle unexpected errors when extracting pdf text and to help determine pdf strategy.

Fix fast strategy fall back to ocr_only The fast strategy should not fall back to a more expensive strategy.

Remove default user ./ssh folder The default notebook user during image build would create the known_hosts file with incorrect ownership, this is legacy and no longer needed so it was removed.

Include languages in metadata when partitioning strategy=hi_res or fast User defined languages was previously used for text detection, but not included in the resulting element metadata for some strategies. languages will now be included in the metadata regardless of partition strategy for pdfs and images.

Handle a case where Paddle returns a list item in ocr_data as None In partition, while parsing PaddleOCR data, it was assumed that PaddleOCR does not return None for any list item in ocr_data. Removed the assumption by skipping the text region whenever this happens.

Fix some pdfs returning KeyError: 'N' Certain pdfs were throwing this error when being opened by pdfminer. Added a wrapper function for pdfminer that allows these documents to be partitioned.

Fix mis-splits on Table chunks. Remedies repeated appearance of full .text_as_html on metadata of each TableChunk split from a Table element too large to fit in the chunking window.

Import tables_agent from inference so that we don't have to initialize a global table agent in unstructured OCR again

Fix empty table is identified as bulleted-table. A table with no text content was mistakenly identified as a bulleted-table and processed by the wrong branch of the initial HTML partitioner.

Fix partition_html() emits empty (no text) tables. A table with cells nested below a <thead> or <tfoot> element was emitted as a table element having no text and unparseable HTML in element.metadata.text_as_html. Do not emit empty tables to the element stream.

Fix HTML element.metadata.text_as_html contains spurious elements in invalid locations. The HTML generated for the text_as_html metadata for HTML tables contained <br> elements invalid locations like between <table> and <tr>. Change the HTML generator such that these do not appear.

Fix HTML table cells enclosed in and elements are dropped. HTML table cells nested in a <thead> or <tfoot> element were not detected and the text in those cells was omitted from the table element text and .text_as_html. Detect table rows regardless of the semantic tag they may be nested in.

Remove whitespace padding from .text_as_html. tabulate inserts padding spaces to achieve visual alignment of columns in HTML tables it generates. Add our own HTML generator to do this simple job and omit that padding as well as newlines ("\n") used for human readability.

Fix local connector with absolute input path When passed an absolute filepath for the input document path, the local connector incorrectly writes the output file to the input file directory. This fixes such that the output in this case is written to output-dir/input-filename.json

0.10.30

Enhancements

Support nested DOCX tables. In DOCX, like HTML, a table cell can itself contain a table. In this case, create nested HTML tables to reflect that structure and create a plain-text table with captures all the text in nested tables, formatting it as a reasonable facsimile of a table.

Add connection check to ingest connectors Each source and destination connector now support a check_connection() method which makes sure a valid connection can be established with the source/destination given any authentication credentials in a lightweight request.

Features

Add functionality to do a second OCR on cropped table images. Changes to the values for scaling ENVs affect entire page OCR output(OCR regression) so we now do a second OCR for tables.

Adds ability to pass timeout for a request when partitioning via a url. partition now accepts a new optional parameter request_timeout which if set will prevent any requests.get from hanging indefinitely and instead will raise a timeout error. This is useful when partitioning a url that may be slow to respond or may not respond at all.

Fixes

Fix logic that determines pdf auto strategy. Previously, _determine_pdf_auto_strategy returned hi_res strategy only if infer_table_structure was true. It now returns the hi_res strategy if either infer_table_structure or extract_images_in_pdf is true.

Fix invalid coordinates when parsing tesseract ocr data. Previously, when parsing tesseract ocr data, the ocr data had invalid bboxes if zoom was set to 0. A logical check is now added to avoid such error.

Fix ingest partition parameters not being passed to the api. When using the --partition-by-api flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. This allows a user to specify all of the same partition arguments they would locally and have them respected when specifying --partition-by-api.

Support tables in section-less DOCX. Generalize solution for MS Chat Transcripts exported as DOCX by including tables in the partitioned output when present.

Support tables that contain only numbers when partitioning via ocr_only Tables that contain only numbers are returned as floats in a pandas.DataFrame when the image is converted from .image_to_data(). An AttributeError was raised downstream when trying to .strip() the floats.

... (truncated)

Changelog

Sourced from unstructured's changelog.

0.11.0

Enhancements

Add a class for the strategy constants. Add a class PartitionStrategy for the strategy constants and use the constants to replace strategy strings.

Temporary Support for paddle language parameter. User can specify default langage code for paddle with ENV DEFAULT_PADDLE_LANG before we have the language mapping for paddle.

Improve DOCX page-break fidelity. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the PageBreak element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements.

Features

Add ad-hoc fields to ElementMetadata instance. End-users can now add their own metadata fields simply by assigning to an element-metadata attribute-name of their choice, like element.metadata.coefficient = 0.58. These fields will round-trip through JSON and can be accessed with dotted notation.

MongoDB Destination Connector New destination connector added to all CLI ingest commands to support writing partitioned json output to mongodb.

Fixes

Fix TYPE_TO_TEXT_ELEMENT_MAP Updated Figure mapping from FigureCaption to Image.

Handle errors when extracting PDF text Certain pdfs throw unexpected errors when being opened by pdfminer, causing partition_pdf() to fail. We expect to be able to partition smoothly using an alternative strategy if text extraction doesn't work. Added exception handling to handle unexpected errors when extracting pdf text and to help determine pdf strategy.

Fix fast strategy fall back to ocr_only The fast strategy should not fall back to a more expensive strategy.

Remove default user ./ssh folder The default notebook user during image build would create the known_hosts file with incorrect ownership, this is legacy and no longer needed so it was removed.

Include languages in metadata when partitioning strategy=hi_res or fast User defined languages was previously used for text detection, but not included in the resulting element metadata for some strategies. languages will now be included in the metadata regardless of partition strategy for pdfs and images.

Handle a case where Paddle returns a list item in ocr_data as None In partition, while parsing PaddleOCR data, it was assumed that PaddleOCR does not return None for any list item in ocr_data. Removed the assumption by skipping the text region whenever this happens.

Fix some pdfs returning KeyError: 'N' Certain pdfs were throwing this error when being opened by pdfminer. Added a wrapper function for pdfminer that allows these documents to be partitioned.

Fix mis-splits on Table chunks. Remedies repeated appearance of full .text_as_html on metadata of each TableChunk split from a Table element too large to fit in the chunking window.

Import tables_agent from inference so that we don't have to initialize a global table agent in unstructured OCR again

Fix empty table is identified as bulleted-table. A table with no text content was mistakenly identified as a bulleted-table and processed by the wrong branch of the initial HTML partitioner.

Fix partition_html() emits empty (no text) tables. A table with cells nested below a <thead> or <tfoot> element was emitted as a table element having no text and unparseable HTML in element.metadata.text_as_html. Do not emit empty tables to the element stream.

Fix HTML element.metadata.text_as_html contains spurious elements in invalid locations. The HTML generated for the text_as_html metadata for HTML tables contained <br> elements invalid locations like between <table> and <tr>. Change the HTML generator such that these do not appear.

Fix HTML table cells enclosed in and elements are dropped. HTML table cells nested in a <thead> or <tfoot> element were not detected and the text in those cells was omitted from the table element text and .text_as_html. Detect table rows regardless of the semantic tag they may be nested in.

Remove whitespace padding from .text_as_html. tabulate inserts padding spaces to achieve visual alignment of columns in HTML tables it generates. Add our own HTML generator to do this simple job and omit that padding as well as newlines ("\n") used for human readability.

Fix local connector with absolute input path When passed an absolute filepath for the input document path, the local connector incorrectly writes the output file to the input file directory. This fixes such that the output in this case is written to output-dir/input-filename.json

0.10.30

Enhancements

Support nested DOCX tables. In DOCX, like HTML, a table cell can itself contain a table. In this case, create nested HTML tables to reflect that structure and create a plain-text table with captures all the text in nested tables, formatting it as a reasonable facsimile of a table.

Add connection check to ingest connectors Each source and destination connector now support a check_connection() method which makes sure a valid connection can be established with the source/destination given any authentication credentials in a lightweight request.

Features

Add functionality to do a second OCR on cropped table images. Changes to the values for scaling ENVs affect entire page OCR output(OCR regression) so we now do a second OCR for tables.

Adds ability to pass timeout for a request when partitioning via a url. partition now accepts a new optional parameter request_timeout which if set will prevent any requests.get from hanging indefinitely and instead will raise a timeout error. This is useful when partitioning a url that may be slow to respond or may not respond at all.

Fixes

Fix logic that determines pdf auto strategy. Previously, _determine_pdf_auto_strategy returned hi_res strategy only if infer_table_structure was true. It now returns the hi_res strategy if either infer_table_structure or extract_images_in_pdf is true.

Fix invalid coordinates when parsing tesseract ocr data. Previously, when parsing tesseract ocr data, the ocr data had invalid bboxes if zoom was set to 0. A logical check is now added to avoid such error.

Fix ingest partition parameters not being passed to the api. When using the --partition-by-api flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. This allows a user to specify all of the same partition arguments they would locally and have them respected when specifying --partition-by-api.

Support tables in section-less DOCX. Generalize solution for MS Chat Transcripts exported as DOCX by including tables in the partitioned output when present.

Support tables that contain only numbers when partitioning via ocr_only Tables that contain only numbers are returned as floats in a pandas.DataFrame when the image is converted from .image_to_data(). An AttributeError was raised downstream when trying to .strip() the floats.

... (truncated)

Commits

ccda93b chore: bump inference to 0.7.15 release unst 0.11.0 (#2110)
ee9be2a fix: assorted partition_html() bugs (#2113)
13a23de fix: local connector with input path to single file (#2116)
d623d75 Fix: incorrect figure mapping (#2111)
5ba3b9c chore: get eval metrics from ingest in (#2097)
ee62ed7 rfctr(html): clean types and docs in prep for HTML table parsing fixes (#2104)
ef8ac72 Chore: Import tables_agent from inference (#2087)
97a25b0 Chore: move hi res initialization initialize.py file out of ingest (#2096)
9c66eab Fix: handle pdf text extraction errors (#2101)
a589a49 docx: improve page break fidelity (#1631)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [unstructured](https://github.com/Unstructured-IO/unstructured) from 0.10.29 to 0.11.0. - [Release notes](https://github.com/Unstructured-IO/unstructured/releases) - [Changelog](https://github.com/Unstructured-IO/unstructured/blob/main/CHANGELOG.md) - [Commits](Unstructured-IO/unstructured@0.10.29...0.11.0) --- updated-dependencies: - dependency-name: unstructured dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]>

railway-app · 2023-11-20T23:43:54Z

This PR was not deployed automatically as @dependabot[bot] does not have access to the Railway project.

In order to get automatic PR deploys, please add @dependabot[bot] to the project inside the project settings page.

lintrule-review · 2023-11-20T23:43:54Z

You need to setup a payment method to use Lintrule

You can fix that by putting in a card here.

dependabot · 2023-12-04T23:26:16Z

Superseded by #160.

dependabot bot added the dependencies Pull requests that update a dependency file label Nov 20, 2023

dependabot bot mentioned this pull request Nov 20, 2023

Bump unstructured from 0.10.29 to 0.10.30 #143

Closed

dependabot bot closed this Dec 4, 2023

dependabot bot deleted the dependabot/pip/unstructured-0.11.0 branch December 4, 2023 23:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump unstructured from 0.10.29 to 0.11.0 #149

Bump unstructured from 0.10.29 to 0.11.0 #149

dependabot bot commented on behalf of github Nov 20, 2023

railway-app bot commented Nov 20, 2023

lintrule-review bot commented Nov 20, 2023

dependabot bot commented on behalf of github Dec 4, 2023

Bump unstructured from 0.10.29 to 0.11.0 #149

Bump unstructured from 0.10.29 to 0.11.0 #149

Conversation

dependabot bot commented on behalf of github Nov 20, 2023

0.11.0

Enhancements

Features

Fixes

0.10.30

Enhancements

Features

Fixes

0.11.0

Enhancements

Features

Fixes

0.10.30

Enhancements

Features

Fixes

railway-app bot commented Nov 20, 2023

lintrule-review bot commented Nov 20, 2023

You need to setup a payment method to use Lintrule

dependabot bot commented on behalf of github Dec 4, 2023