Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump unstructured from 0.10.29 to 0.10.30 #143

Closed
wants to merge 1 commit into from

Conversation

dependabot[bot]
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Nov 14, 2023

Bumps unstructured from 0.10.29 to 0.10.30.

Release notes

Sourced from unstructured's releases.

0.10.30

Enhancements

  • Support nested DOCX tables. In DOCX, like HTML, a table cell can itself contain a table. In this case, create nested HTML tables to reflect that structure and create a plain-text table with captures all the text in nested tables, formatting it as a reasonable facsimile of a table.
  • Add connection check to ingest connectors Each source and destination connector now support a check_connection() method which makes sure a valid connection can be established with the source/destination given any authentication credentials in a lightweight request.

Features

  • Add functionality to do a second OCR on cropped table images. Changes to the values for scaling ENVs affect entire page OCR output(OCR regression) so we now do a second OCR for tables.
  • Adds ability to pass timeout for a request when partitioning via a url. partition now accepts a new optional parameter request_timeout which if set will prevent any requests.get from hanging indefinitely and instead will raise a timeout error. This is useful when partitioning a url that may be slow to respond or may not respond at all.

Fixes

  • Fix logic that determines pdf auto strategy. Previously, _determine_pdf_auto_strategy returned hi_res strategy only if infer_table_structure was true. It now returns the hi_res strategy if either infer_table_structure or extract_images_in_pdf is true.
  • Fix invalid coordinates when parsing tesseract ocr data. Previously, when parsing tesseract ocr data, the ocr data had invalid bboxes if zoom was set to 0. A logical check is now added to avoid such error.
  • Fix ingest partition parameters not being passed to the api. When using the --partition-by-api flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. This allows a user to specify all of the same partition arguments they would locally and have them respected when specifying --partition-by-api.
  • Support tables in section-less DOCX. Generalize solution for MS Chat Transcripts exported as DOCX by including tables in the partitioned output when present.
  • Support tables that contain only numbers when partitioning via ocr_only Tables that contain only numbers are returned as floats in a pandas.DataFrame when the image is converted from .image_to_data(). An AttributeError was raised downstream when trying to .strip() the floats.
  • Improve DOCX page-break detection. DOCX page breaks are reliably indicated by w:lastRenderedPageBreak elements present in the document XML. Page breaks are NOT reliably indicated by "hard" page-breaks inserted by the author and when present are redundant to a w:lastRenderedPageBreak element so cause over-counting if used. Use rendered page-breaks only.
Changelog

Sourced from unstructured's changelog.

0.10.30

Enhancements

  • Support nested DOCX tables. In DOCX, like HTML, a table cell can itself contain a table. In this case, create nested HTML tables to reflect that structure and create a plain-text table with captures all the text in nested tables, formatting it as a reasonable facsimile of a table.
  • Add connection check to ingest connectors Each source and destination connector now support a check_connection() method which makes sure a valid connection can be established with the source/destination given any authentication credentials in a lightweight request.

Features

  • Add functionality to do a second OCR on cropped table images. Changes to the values for scaling ENVs affect entire page OCR output(OCR regression) so we now do a second OCR for tables.
  • Adds ability to pass timeout for a request when partitioning via a url. partition now accepts a new optional parameter request_timeout which if set will prevent any requests.get from hanging indefinitely and instead will raise a timeout error. This is useful when partitioning a url that may be slow to respond or may not respond at all.

Fixes

  • Fix logic that determines pdf auto strategy. Previously, _determine_pdf_auto_strategy returned hi_res strategy only if infer_table_structure was true. It now returns the hi_res strategy if either infer_table_structure or extract_images_in_pdf is true.
  • Fix invalid coordinates when parsing tesseract ocr data. Previously, when parsing tesseract ocr data, the ocr data had invalid bboxes if zoom was set to 0. A logical check is now added to avoid such error.
  • Fix ingest partition parameters not being passed to the api. When using the --partition-by-api flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. This allows a user to specify all of the same partition arguments they would locally and have them respected when specifying --partition-by-api.
  • Support tables in section-less DOCX. Generalize solution for MS Chat Transcripts exported as DOCX by including tables in the partitioned output when present.
  • Support tables that contain only numbers when partitioning via ocr_only Tables that contain only numbers are returned as floats in a pandas.DataFrame when the image is converted from .image_to_data(). An AttributeError was raised downstream when trying to .strip() the floats.
  • Improve DOCX page-break detection. DOCX page breaks are reliably indicated by w:lastRenderedPageBreak elements present in the document XML. Page breaks are NOT reliably indicated by "hard" page-breaks inserted by the author and when present are redundant to a w:lastRenderedPageBreak element so cause over-counting if used. Use rendered page-breaks only.
Commits

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [unstructured](https://github.com/Unstructured-IO/unstructured) from 0.10.29 to 0.10.30.
- [Release notes](https://github.com/Unstructured-IO/unstructured/releases)
- [Changelog](https://github.com/Unstructured-IO/unstructured/blob/main/CHANGELOG.md)
- [Commits](Unstructured-IO/unstructured@0.10.29...0.10.30)

---
updated-dependencies:
- dependency-name: unstructured
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
@dependabot dependabot bot added the dependencies Pull requests that update a dependency file label Nov 14, 2023
Copy link

You need to setup a payment method to use Lintrule

You can fix that by putting in a card here.

Copy link

railway-app bot commented Nov 14, 2023

This PR was not deployed automatically as @dependabot[bot] does not have access to the Railway project.

In order to get automatic PR deploys, please add @dependabot[bot] to the project inside the project settings page.

Copy link
Contributor Author

dependabot bot commented on behalf of github Nov 20, 2023

Superseded by #149.

@dependabot dependabot bot closed this Nov 20, 2023
@dependabot dependabot bot deleted the dependabot/pip/unstructured-0.10.30 branch November 20, 2023 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants