Bump unstructured from 0.10.29 to 0.11.2 #160

dependabot · 2023-12-04T23:26:13Z

Bumps unstructured from 0.10.29 to 0.11.2.

Release notes

0.11.2

Enhancements

Updated Documentation: (i) Added examples, and (ii) API Documentation, including Usage, SDKs, Azure Marketplace, and parameters and validation errors.

Features

Add Pinecone destination connector. Problem: After ingesting data from a source, users might want to produce embeddings for their data and write these into a vector DB. Pinecone is an option among these vector databases. Feature: Added Pinecone destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Pinecone.

Fixes

Process chunking parameter names in ingest correctly Solves a bug where chunking parameters weren't being processed and used by ingest cli by renaming faulty parameter names and prepends; adds relevant parameters to ingest pinecone test to verify that the parameters are functional.

0.11.1

Enhancements

Use pikepdf to repair invalid PDF structure for PDFminer when we see error PSSyntaxError when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.

Batch Source Connector support For instances where it is more optimal to read content from a source connector in batches, a new batch ingest doc is added which created multiple ingest docs after reading them in in batches per process.

Features

Staging Brick for Coco Format Staging brick which converts a list of Elements into Coco Format.

Adds HubSpot connector Adds connector to retrieve call, communications, emails, notes, products and tickets from HubSpot

Fixes

Do not extract text of <style> tags in HTML. <style> tags containing CSS in invalid positions previously contributed to element text. Do not consider text node of a <style> element as textual content.

Fix DOCX merged table cell repeats cell text. Only include text for a merged cell, not for each underlying cell spanned by the merge.

Fix tables not extracted from DOCX header/footers. Headers and footers in DOCX documents skip tables defined in the header and commonly used for layout/alignment purposes. Extract text from tables as a string and include in the Header and Footer document elements.

Fix output filepath for fsspec-based source connectors. Previously the base directory was being included in the output filepath unnecessarily.

0.11.0

Enhancements

Add a class for the strategy constants. Add a class PartitionStrategy for the strategy constants and use the constants to replace strategy strings.

Temporary Support for paddle language parameter. User can specify default langage code for paddle with ENV DEFAULT_PADDLE_LANG before we have the language mapping for paddle.

Improve DOCX page-break fidelity. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the PageBreak element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements.

Features

Add ad-hoc fields to ElementMetadata instance. End-users can now add their own metadata fields simply by assigning to an element-metadata attribute-name of their choice, like element.metadata.coefficient = 0.58. These fields will round-trip through JSON and can be accessed with dotted notation.

MongoDB Destination Connector New destination connector added to all CLI ingest commands to support writing partitioned json output to mongodb.

Fixes

Fix TYPE_TO_TEXT_ELEMENT_MAP Updated Figure mapping from FigureCaption to Image.

Handle errors when extracting PDF text Certain pdfs throw unexpected errors when being opened by pdfminer, causing partition_pdf() to fail. We expect to be able to partition smoothly using an alternative strategy if text extraction doesn't work. Added exception handling to handle unexpected errors when extracting pdf text and to help determine pdf strategy.

... (truncated)

Changelog

Sourced from unstructured's changelog.

0.11.2

Enhancements

Updated Documentation: (i) Added examples, and (ii) API Documentation, including Usage, SDKs, Azure Marketplace, and parameters and validation errors.

Features

Add Pinecone destination connector. Problem: After ingesting data from a source, users might want to produce embeddings for their data and write these into a vector DB. Pinecone is an option among these vector databases. Feature: Added Pinecone destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Pinecone.

Fixes

Process chunking parameter names in ingest correctly Solves a bug where chunking parameters weren't being processed and used by ingest cli by renaming faulty parameter names and prepends; adds relevant parameters to ingest pinecone test to verify that the parameters are functional.

0.11.1

Enhancements

Use pikepdf to repair invalid PDF structure for PDFminer when we see error PSSyntaxError when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.

Batch Source Connector support For instances where it is more optimal to read content from a source connector in batches, a new batch ingest doc is added which created multiple ingest docs after reading them in in batches per process.

Features

Staging Brick for Coco Format Staging brick which converts a list of Elements into Coco Format.

Adds HubSpot connector Adds connector to retrieve call, communications, emails, notes, products and tickets from HubSpot

Fixes

Do not extract text of <style> tags in HTML. <style> tags containing CSS in invalid positions previously contributed to element text. Do not consider text node of a <style> element as textual content.

Fix DOCX merged table cell repeats cell text. Only include text for a merged cell, not for each underlying cell spanned by the merge.

Fix tables not extracted from DOCX header/footers. Headers and footers in DOCX documents skip tables defined in the header and commonly used for layout/alignment purposes. Extract text from tables as a string and include in the Header and Footer document elements.

Fix output filepath for fsspec-based source connectors. Previously the base directory was being included in the output filepath unnecessarily.

0.11.0

Enhancements

Add a class for the strategy constants. Add a class PartitionStrategy for the strategy constants and use the constants to replace strategy strings.

Temporary Support for paddle language parameter. User can specify default langage code for paddle with ENV DEFAULT_PADDLE_LANG before we have the language mapping for paddle.

Improve DOCX page-break fidelity. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the PageBreak element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements.

Features

Add ad-hoc fields to ElementMetadata instance. End-users can now add their own metadata fields simply by assigning to an element-metadata attribute-name of their choice, like element.metadata.coefficient = 0.58. These fields will round-trip through JSON and can be accessed with dotted notation.

MongoDB Destination Connector. New destination connector added to all CLI ingest commands to support writing partitioned json output to mongodb.

Fixes

Fix TYPE_TO_TEXT_ELEMENT_MAP. Updated Figure mapping from FigureCaption to Image.

Handle errors when extracting PDF text Certain pdfs throw unexpected errors when being opened by pdfminer, causing partition_pdf() to fail. We expect to be able to partition smoothly using an alternative strategy if text extraction doesn't work. Added exception handling to handle unexpected errors when extracting pdf text and to help determine pdf strategy.

... (truncated)

Commits

039ae17 build(release): release commit for 0.11.2 (#2191)
d80abf0 Reorganized the Examples section in Documentation & add Databricks example (#...
ed08773 feat: add pinecone destination connector (#1774)
341f0f4 Add coco staging brick to unstructured base (#2180)
c028a14 chore: enable azure destination CI tests (#2172)
92dae8c Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137)
2d450c4 fix: skipped file not found error (#2188)
7ad8e88 feat: leverage logger to hide sensitive data in ingest logs (#2175)
1576e0b docs: update docker image link (#2186)
b951d73 feat: add logging to ingest CLI for tests being skipped at the end (#2174)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [pymupdf](https://github.com/pymupdf/pymupdf) from 1.22.5 to 1.23.6. - [Release notes](https://github.com/pymupdf/pymupdf/releases) - [Changelog](https://github.com/pymupdf/PyMuPDF/blob/main/changes.txt) - [Commits](pymupdf/PyMuPDF@1.22.5...1.23.6) --- updated-dependencies: - dependency-name: pymupdf dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…pefully resolved

… bundle size by 5GB

* should be fully working, in final testing * trying to fix double nested kwargs * fixing readable_filename in pdf ingest * apt install tesseract-ocr, LAME * remove stupid typo * minor bug * Finally fix **kwargs passing * minor fix * guarding against webscrape kwargs in pdf * guarding against webscrape kwargs in pdf * guarding against webscrape kwargs in pdf * adding better error messages * revert req changes * simplify prints

Bumps [typing-extensions](https://github.com/python/typing_extensions) from 4.7.1 to 4.8.0. - [Release notes](https://github.com/python/typing_extensions/releases) - [Changelog](https://github.com/python/typing_extensions/blob/main/CHANGELOG.md) - [Commits](python/typing_extensions@4.7.1...4.8.0) --- updated-dependencies: - dependency-name: typing-extensions dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kastan Day <[email protected]>

Bumps [flask](https://github.com/pallets/flask) from 2.3.3 to 3.0.0. - [Release notes](https://github.com/pallets/flask/releases) - [Changelog](https://github.com/pallets/flask/blob/main/CHANGES.rst) - [Commits](pallets/flask@2.3.3...3.0.0) --- updated-dependencies: - dependency-name: flask dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kastan Day <[email protected]>

* updated nomic version in requirements.txt * initial commit to PR * created API endpoint * completed export function * testing csv export on railway * code to remove file from repo after download * moved file storing out of docs folder

Bumps [unstructured](https://github.com/Unstructured-IO/unstructured) from 0.10.29 to 0.11.2. - [Release notes](https://github.com/Unstructured-IO/unstructured/releases) - [Changelog](https://github.com/Unstructured-IO/unstructured/blob/0.11.2/CHANGELOG.md) - [Commits](Unstructured-IO/unstructured@0.10.29...0.11.2) --- updated-dependencies: - dependency-name: unstructured dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]>

lintrule-review · 2023-12-04T23:26:16Z

You need to setup a payment method to use Lintrule

You can fix that by putting in a card here.

railway-app · 2023-12-04T23:26:16Z

This PR was not deployed automatically as @dependabot[bot] does not have access to the Railway project.

In order to get automatic PR deploys, please add @dependabot[bot] to the project inside the project settings page.

dependabot · 2023-12-15T17:09:58Z

Superseded by #171.

star-nox and others added 24 commits November 6, 2023 12:48

Updated Nomic in requirements.txt

9ba6e9d

fix openai version to pre 1.0

ca806eb

upgrade python from 3.8 to 3.10

17a7779

trying to fix tesseract // pdfminer requirements for image ingest

8562de7

adding strict versions to all requirements

be34f01

compatible wheel version

170ed79

upgrade pip during image startup

6b94aac

properly upgrade pip

c084960

Fully lock ALL requirements. Hopefully speed up build times, too

f4b8bd9

Limit unstructured dependencies, image balloned from 700MB to 6GB. Ho…

4e80002

…pefully resolved

Lock version of pip

abf1fc2

Lock (correct) version of pip

8a8eac2

add libgl1 for cv2 in Docker (for unstructured)

cf78800

adding proper error logging to image ingest

62883e8

Installing unstructured requirements individually to hopefully redoce…

fcfa485

… bundle size by 5GB

Reduce use of unstructured, hopefully the install is much smaller now

97bbbd9

Guard against kwargs failures during webscrape

a5b418c

HOTFIX: kwargs in html and pdf ingest for /webscrape

0d371ba

dependabot bot added the dependencies Pull requests that update a dependency file label Dec 4, 2023

dependabot bot mentioned this pull request Dec 4, 2023

Bump unstructured from 0.10.29 to 0.11.0 #149

Closed

KastanDay force-pushed the main branch from d345a88 to 7306dc3 Compare December 15, 2023 17:09

dependabot bot closed this Dec 15, 2023

dependabot bot deleted the dependabot/pip/unstructured-0.11.2 branch December 15, 2023 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump unstructured from 0.10.29 to 0.11.2 #160

Bump unstructured from 0.10.29 to 0.11.2 #160

dependabot bot commented on behalf of github Dec 4, 2023 •

edited

Loading

lintrule-review bot commented Dec 4, 2023

railway-app bot commented Dec 4, 2023

dependabot bot commented on behalf of github Dec 15, 2023

Bump unstructured from 0.10.29 to 0.11.2 #160

Bump unstructured from 0.10.29 to 0.11.2 #160

Conversation

dependabot bot commented on behalf of github Dec 4, 2023 • edited Loading

0.11.2

Enhancements

Features

Fixes

0.11.1

Enhancements

Features

Fixes

0.11.0

Enhancements

Features

Fixes

0.11.2

Enhancements

Features

Fixes

0.11.1

Enhancements

Features

Fixes

0.11.0

Enhancements

Features

Fixes

lintrule-review bot commented Dec 4, 2023

You need to setup a payment method to use Lintrule

railway-app bot commented Dec 4, 2023

dependabot bot commented on behalf of github Dec 15, 2023

dependabot bot commented on behalf of github Dec 4, 2023 •

edited

Loading