feat: LlamaParseReader: update supported file types, add support for array of file Paths, add support for directory Paths + example #808

ghost · 2024-05-04T09:16:17Z

Should be mostly up to date with the python version now.

Main thing still missing is get_images function, will look at that next.

changeset-bot · 2024-05-04T09:16:21Z

⚠️ No Changeset found

Latest commit: 434175e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

vercel · 2024-05-04T09:16:22Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
llama-index-ts-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 5, 2024 2:33pm

ghost · 2024-05-05T14:28:09Z

I tried implementing it more with SimpleDirectoryReader by also defining numWorkers and adding parallel processing, but I had issues with it not respecting the numWorkers from LlamaParseReader, so I added directory reading straight into LlamaParseReader.

If someone wants to check out parallel processing in SimpleDirectoryReader and maybe has an idea how to respect the number of "workers": Here it is on a different branch and the Commit

ghost · 2024-05-05T21:56:36Z

Something causing a Connect Timeout Error at loadFile and loadData, gonna check that out

himself65 · 2024-05-05T23:32:17Z

packages/core/src/readers/type.ts

+  loadData(filePath: string, fs?: CompleteFileSystem): Promise<Document[]>;
+  loadData(filePaths: string[], fs?: CompleteFileSystem): Promise<Document[][]>;
+  loadData(
+    directoryPath: string,
+    fs?: CompleteFileSystem,
+  ): Promise<Document[][]>;


Should we support directory input?

I think directory input is essential in order to take full advantage of the ability to load and send multiple files at once.

But yeah the smarter way would be to integrate it in Simple Directory Reader but I didn't figure out how to not overload the Parser when adding parallel processing to the Directory Reader

I'm gonna check it out again and see. I'll probably have some time in 2-3 days.

The Connect Timeout Error seems to be kind of random. Sometimes it works as expected, sometimes it times out, so I'll recheck everything anyway.

yes, the SimpleDirectoryReader already supports loading multiple files from a directory.

I think we have three concerns:

Walking the filesystem (already done by the SimpleDirectoryReader)

Reading a single file (done by the reader)

Loading files in parallel (new concern in this PR)

If we handle this new concern by a single reader (here LlamaParseReader), then we can only use it with this reader.

How about adding numWorkers to SimpleDirectoryReader instead?

@InsightByAI if you like you can split your PR in two, the first one just adding the missing filetypes

@marcusschiesser The issue for me with adding numWorkers to SimpleDirectoryReader was that it would load in batches just fine, but it would load the next batch right after finishing the 1st without waiting for the parse results.

I think the options would be to either:

have SimpleDirectoryReader wait for a finish signal from the Parser

have SimpleDirectoryReader pass on a list of all filePaths and let the parallel loading be handled by the Parser

I concentrated only on LlamaParse because I think the requirements for numWorkers in most other readers is quite different, since in LlamaParse the "workers" kind of refers to the max amount of files the LlamaParse API can accept at once, which is limited to 10.

If I have powerful enough hardware, I could probably use a 100 or more "workers" to load a 100 files parallel with PDFReader.
Not sure if a numWorkers that limits one reader to 10 workers and has to wait for results from an API endpoint, while allowing another one to use unlimited workers and only be limited by hardware speed would be easy to implement/work that well?

I'm going to check how python implements this. The Llama-Parse standalone can accept arrays of files but not a directory afaik, so I'll check how that works with an equivalent llama-index python directory Reader.

I'll split the PR later. Thanks for the elaborate review!

Can you re-add numWorkers to SimpleDirectorReader and add a test case for the problem? Then we can have a look at it

marcusschiesser · 2024-05-07T02:03:10Z

packages/core/src/readers/type.ts

+  loadData(filePath: string, fs?: CompleteFileSystem): Promise<Document[]>;
+  loadData(filePaths: string[], fs?: CompleteFileSystem): Promise<Document[][]>;
+  loadData(
+    directoryPath: string,
+    fs?: CompleteFileSystem,
+  ): Promise<Document[][]>;


yes, the SimpleDirectoryReader already supports loading multiple files from a directory.

I think we have three concerns:

Walking the filesystem (already done by the SimpleDirectoryReader)

Reading a single file (done by the reader)

Loading files in parallel (new concern in this PR)

If we handle this new concern by a single reader (here LlamaParseReader), then we can only use it with this reader.

How about adding numWorkers to SimpleDirectoryReader instead?

@InsightByAI if you like you can split your PR in two, the first one just adding the missing filetypes

packages/core/src/readers/utils.ts

marcusschiesser · 2024-05-07T02:06:42Z

examples/readers/src/llamaparse_2.ts

  });
-  const documents = await reader.loadData("../data/manga.pdf"); // The manga.pdf in the data folder is just a copy of the TOS, due to copyright laws. You have to place your own. I used "The Manga Guide to Calculus" by Hiroyuki Kojima
+  // Can either accept a single file path an array[] of file paths or a directory path
+  const documents = await reader.loadData("../data/LlamaParseData");


can we just use all the files in the existing data folder?

Suggested change

const documents = await reader.loadData("../data/LlamaParseData");

const documents = await reader.loadData("../data");

we can add new files, but then we need to take care about the license

Yeah sure, could also just split the brk-2022.pdf file into 6 parts and put them into a separate folder to showcase parallel processing if that makes things easier.

KindOfAScam added 3 commits May 2, 2024 22:34

feat: Updated Supported File Types to match Python version

57b3367

feat: added support for multiple filePaths

153af3f

Updated example usage of LlamaParseReader with multiple filePaths

c30575c

vercel bot deployed to Preview May 4, 2024 09:34 View deployment

fix: LlamaParseReader as Standard Reader in SimpleDirectoryReader

cb2b673

vercel bot had a problem deploying to Preview May 4, 2024 14:46 Failure

fix: Second try, BaseReader fixed in IngestionPipeline

3c0f03e

vercel bot deployed to Preview May 4, 2024 15:09 View deployment

feat: add directory reading to LlamaParseReader

434175e

ghost changed the title ~~feat: LlamaParseReader: update supported file types, add support for multiple file Paths + example~~ feat: LlamaParseReader: update supported file types, add support for array of file Paths, add support for directory Paths + example May 5, 2024

vercel bot deployed to Preview May 5, 2024 14:33 View deployment

ghost marked this pull request as draft May 5, 2024 21:54

himself65 reviewed May 5, 2024

View reviewed changes

marcusschiesser reviewed May 7, 2024

View reviewed changes

This was referenced May 8, 2024

feat: LlamaParseReader: update Supported File Types to match python version #823

Merged

feat: add parallel processing to SimpleDirectoryReader #830

Closed

ghost closed this May 10, 2024

KindOfAScam deleted the LlamaParse branch May 26, 2024 11:50

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LlamaParseReader: update supported file types, add support for array of file Paths, add support for directory Paths + example #808

feat: LlamaParseReader: update supported file types, add support for array of file Paths, add support for directory Paths + example #808

ghost commented May 4, 2024

changeset-bot bot commented May 4, 2024 •

edited

Loading

vercel bot commented May 4, 2024 •

edited

Loading

ghost commented May 5, 2024 •

edited by ghost

Loading

ghost commented May 5, 2024

himself65 May 5, 2024

ghost May 6, 2024

ghost May 6, 2024

marcusschiesser May 7, 2024

ghost May 7, 2024

marcusschiesser May 9, 2024

marcusschiesser May 7, 2024

marcusschiesser May 7, 2024

ghost May 7, 2024

	const documents = await reader.loadData("../data/LlamaParseData");
	const documents = await reader.loadData("../data");

feat: LlamaParseReader: update supported file types, add support for array of file Paths, add support for directory Paths + example #808

feat: LlamaParseReader: update supported file types, add support for array of file Paths, add support for directory Paths + example #808

Conversation

ghost commented May 4, 2024

changeset-bot bot commented May 4, 2024 • edited Loading

⚠️ No Changeset found

vercel bot commented May 4, 2024 • edited Loading

ghost commented May 5, 2024 • edited by ghost Loading

ghost commented May 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

changeset-bot bot commented May 4, 2024 •

edited

Loading

vercel bot commented May 4, 2024 •

edited

Loading

ghost commented May 5, 2024 •

edited by ghost

Loading