Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump unstructured from 0.10.29 to 0.11.2 #160

Closed
wants to merge 24 commits into from

Conversation

dependabot[bot]
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Dec 4, 2023

Bumps unstructured from 0.10.29 to 0.11.2.

Release notes

Sourced from unstructured's releases.

0.11.2

Enhancements

  • Updated Documentation: (i) Added examples, and (ii) API Documentation, including Usage, SDKs, Azure Marketplace, and parameters and validation errors.

Features

  • Add Pinecone destination connector. Problem: After ingesting data from a source, users might want to produce embeddings for their data and write these into a vector DB. Pinecone is an option among these vector databases. Feature: Added Pinecone destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Pinecone.

Fixes

  • Process chunking parameter names in ingest correctly Solves a bug where chunking parameters weren't being processed and used by ingest cli by renaming faulty parameter names and prepends; adds relevant parameters to ingest pinecone test to verify that the parameters are functional.

0.11.1

Enhancements

  • Use pikepdf to repair invalid PDF structure for PDFminer when we see error PSSyntaxError when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.

  • Batch Source Connector support For instances where it is more optimal to read content from a source connector in batches, a new batch ingest doc is added which created multiple ingest docs after reading them in in batches per process.

Features

  • Staging Brick for Coco Format Staging brick which converts a list of Elements into Coco Format.
  • Adds HubSpot connector Adds connector to retrieve call, communications, emails, notes, products and tickets from HubSpot

Fixes

  • Do not extract text of <style> tags in HTML. <style> tags containing CSS in invalid positions previously contributed to element text. Do not consider text node of a <style> element as textual content.
  • Fix DOCX merged table cell repeats cell text. Only include text for a merged cell, not for each underlying cell spanned by the merge.
  • Fix tables not extracted from DOCX header/footers. Headers and footers in DOCX documents skip tables defined in the header and commonly used for layout/alignment purposes. Extract text from tables as a string and include in the Header and Footer document elements.
  • Fix output filepath for fsspec-based source connectors. Previously the base directory was being included in the output filepath unnecessarily.

0.11.0

Enhancements

  • Add a class for the strategy constants. Add a class PartitionStrategy for the strategy constants and use the constants to replace strategy strings.
  • Temporary Support for paddle language parameter. User can specify default langage code for paddle with ENV DEFAULT_PADDLE_LANG before we have the language mapping for paddle.
  • Improve DOCX page-break fidelity. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the PageBreak element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements.

Features

  • Add ad-hoc fields to ElementMetadata instance. End-users can now add their own metadata fields simply by assigning to an element-metadata attribute-name of their choice, like element.metadata.coefficient = 0.58. These fields will round-trip through JSON and can be accessed with dotted notation.
  • MongoDB Destination Connector New destination connector added to all CLI ingest commands to support writing partitioned json output to mongodb.

Fixes

  • Fix TYPE_TO_TEXT_ELEMENT_MAP Updated Figure mapping from FigureCaption to Image.
  • Handle errors when extracting PDF text Certain pdfs throw unexpected errors when being opened by pdfminer, causing partition_pdf() to fail. We expect to be able to partition smoothly using an alternative strategy if text extraction doesn't work. Added exception handling to handle unexpected errors when extracting pdf text and to help determine pdf strategy.

... (truncated)

Changelog

Sourced from unstructured's changelog.

0.11.2

Enhancements

  • Updated Documentation: (i) Added examples, and (ii) API Documentation, including Usage, SDKs, Azure Marketplace, and parameters and validation errors.

Features

    • Add Pinecone destination connector. Problem: After ingesting data from a source, users might want to produce embeddings for their data and write these into a vector DB. Pinecone is an option among these vector databases. Feature: Added Pinecone destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Pinecone.

Fixes

  • Process chunking parameter names in ingest correctly Solves a bug where chunking parameters weren't being processed and used by ingest cli by renaming faulty parameter names and prepends; adds relevant parameters to ingest pinecone test to verify that the parameters are functional.

0.11.1

Enhancements

  • Use pikepdf to repair invalid PDF structure for PDFminer when we see error PSSyntaxError when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.

  • Batch Source Connector support For instances where it is more optimal to read content from a source connector in batches, a new batch ingest doc is added which created multiple ingest docs after reading them in in batches per process.

Features

  • Staging Brick for Coco Format Staging brick which converts a list of Elements into Coco Format.
  • Adds HubSpot connector Adds connector to retrieve call, communications, emails, notes, products and tickets from HubSpot

Fixes

  • Do not extract text of <style> tags in HTML. <style> tags containing CSS in invalid positions previously contributed to element text. Do not consider text node of a <style> element as textual content.
  • Fix DOCX merged table cell repeats cell text. Only include text for a merged cell, not for each underlying cell spanned by the merge.
  • Fix tables not extracted from DOCX header/footers. Headers and footers in DOCX documents skip tables defined in the header and commonly used for layout/alignment purposes. Extract text from tables as a string and include in the Header and Footer document elements.
  • Fix output filepath for fsspec-based source connectors. Previously the base directory was being included in the output filepath unnecessarily.

0.11.0

Enhancements

  • Add a class for the strategy constants. Add a class PartitionStrategy for the strategy constants and use the constants to replace strategy strings.
  • Temporary Support for paddle language parameter. User can specify default langage code for paddle with ENV DEFAULT_PADDLE_LANG before we have the language mapping for paddle.
  • Improve DOCX page-break fidelity. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the PageBreak element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements.

Features

  • Add ad-hoc fields to ElementMetadata instance. End-users can now add their own metadata fields simply by assigning to an element-metadata attribute-name of their choice, like element.metadata.coefficient = 0.58. These fields will round-trip through JSON and can be accessed with dotted notation.
  • MongoDB Destination Connector. New destination connector added to all CLI ingest commands to support writing partitioned json output to mongodb.

Fixes

  • Fix TYPE_TO_TEXT_ELEMENT_MAP. Updated Figure mapping from FigureCaption to Image.
  • Handle errors when extracting PDF text Certain pdfs throw unexpected errors when being opened by pdfminer, causing partition_pdf() to fail. We expect to be able to partition smoothly using an alternative strategy if text extraction doesn't work. Added exception handling to handle unexpected errors when extracting pdf text and to help determine pdf strategy.

... (truncated)

Commits
  • 039ae17 build(release): release commit for 0.11.2 (#2191)
  • d80abf0 Reorganized the Examples section in Documentation & add Databricks example (#...
  • ed08773 feat: add pinecone destination connector (#1774)
  • 341f0f4 Add coco staging brick to unstructured base (#2180)
  • c028a14 chore: enable azure destination CI tests (#2172)
  • 92dae8c Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137)
  • 2d450c4 fix: skipped file not found error (#2188)
  • 7ad8e88 feat: leverage logger to hide sensitive data in ingest logs (#2175)
  • 1576e0b docs: update docker image link (#2186)
  • b951d73 feat: add logging to ingest CLI for tests being skipped at the end (#2174)
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

star-nox and others added 24 commits November 6, 2023 12:48
Bumps [pymupdf](https://github.com/pymupdf/pymupdf) from 1.22.5 to 1.23.6.
- [Release notes](https://github.com/pymupdf/pymupdf/releases)
- [Changelog](https://github.com/pymupdf/PyMuPDF/blob/main/changes.txt)
- [Commits](pymupdf/PyMuPDF@1.22.5...1.23.6)

---
updated-dependencies:
- dependency-name: pymupdf
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* should be fully working, in final testing

* trying to fix double nested kwargs

* fixing readable_filename in pdf ingest

* apt install tesseract-ocr, LAME

* remove stupid typo

* minor bug

* Finally fix **kwargs passing

* minor fix

* guarding against webscrape kwargs in pdf

* guarding against webscrape kwargs in pdf

* guarding against webscrape kwargs in pdf

* adding better error messages

* revert req changes

* simplify prints
Bumps [typing-extensions](https://github.com/python/typing_extensions) from 4.7.1 to 4.8.0.
- [Release notes](https://github.com/python/typing_extensions/releases)
- [Changelog](https://github.com/python/typing_extensions/blob/main/CHANGELOG.md)
- [Commits](python/typing_extensions@4.7.1...4.8.0)

---
updated-dependencies:
- dependency-name: typing-extensions
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kastan Day <[email protected]>
Bumps [flask](https://github.com/pallets/flask) from 2.3.3 to 3.0.0.
- [Release notes](https://github.com/pallets/flask/releases)
- [Changelog](https://github.com/pallets/flask/blob/main/CHANGES.rst)
- [Commits](pallets/flask@2.3.3...3.0.0)

---
updated-dependencies:
- dependency-name: flask
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kastan Day <[email protected]>
* updated nomic version in requirements.txt

* initial commit to PR

* created API endpoint

* completed export function

* testing csv export on railway

* code to remove file from repo after download

* moved file storing out of docs folder
Bumps [unstructured](https://github.com/Unstructured-IO/unstructured) from 0.10.29 to 0.11.2.
- [Release notes](https://github.com/Unstructured-IO/unstructured/releases)
- [Changelog](https://github.com/Unstructured-IO/unstructured/blob/0.11.2/CHANGELOG.md)
- [Commits](Unstructured-IO/unstructured@0.10.29...0.11.2)

---
updated-dependencies:
- dependency-name: unstructured
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
@dependabot dependabot bot added the dependencies Pull requests that update a dependency file label Dec 4, 2023
Copy link

You need to setup a payment method to use Lintrule

You can fix that by putting in a card here.

Copy link

railway-app bot commented Dec 4, 2023

This PR was not deployed automatically as @dependabot[bot] does not have access to the Railway project.

In order to get automatic PR deploys, please add @dependabot[bot] to the project inside the project settings page.

Copy link
Contributor Author

dependabot bot commented on behalf of github Dec 15, 2023

Superseded by #171.

@dependabot dependabot bot closed this Dec 15, 2023
@dependabot dependabot bot deleted the dependabot/pip/unstructured-0.11.2 branch December 15, 2023 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants