Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: LlamaParseReader: update supported file types, add support for array of file Paths, add support for directory Paths + example #808

Closed
wants to merge 6 commits into from

Conversation

ghost
Copy link

@ghost ghost commented May 4, 2024

Should be mostly up to date with the python version now.

Main thing still missing is get_images function, will look at that next.

Copy link

changeset-bot bot commented May 4, 2024

⚠️ No Changeset found

Latest commit: 434175e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link

vercel bot commented May 4, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
llama-index-ts-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 5, 2024 2:33pm

@ghost
Copy link
Author

ghost commented May 5, 2024

I tried implementing it more with SimpleDirectoryReader by also defining numWorkers and adding parallel processing, but I had issues with it not respecting the numWorkers from LlamaParseReader, so I added directory reading straight into LlamaParseReader.

If someone wants to check out parallel processing in SimpleDirectoryReader and maybe has an idea how to respect the number of "workers": Here it is on a different branch and the Commit

@ghost ghost changed the title feat: LlamaParseReader: update supported file types, add support for multiple file Paths + example feat: LlamaParseReader: update supported file types, add support for array of file Paths, add support for directory Paths + example May 5, 2024
@ghost ghost marked this pull request as draft May 5, 2024 21:54
@ghost
Copy link
Author

ghost commented May 5, 2024

Something causing a Connect Timeout Error at loadFile and loadData, gonna check that out

Comment on lines +22 to +27
loadData(filePath: string, fs?: CompleteFileSystem): Promise<Document[]>;
loadData(filePaths: string[], fs?: CompleteFileSystem): Promise<Document[][]>;
loadData(
directoryPath: string,
fs?: CompleteFileSystem,
): Promise<Document[][]>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support directory input?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think directory input is essential in order to take full advantage of the ability to load and send multiple files at once.

But yeah the smarter way would be to integrate it in Simple Directory Reader but I didn't figure out how to not overload the Parser when adding parallel processing to the Directory Reader

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna check it out again and see. I'll probably have some time in 2-3 days.

The Connect Timeout Error seems to be kind of random. Sometimes it works as expected, sometimes it times out, so I'll recheck everything anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the SimpleDirectoryReader already supports loading multiple files from a directory.

I think we have three concerns:

  1. Walking the filesystem (already done by the SimpleDirectoryReader)
  2. Reading a single file (done by the reader)
  3. Loading files in parallel (new concern in this PR)

If we handle this new concern by a single reader (here LlamaParseReader), then we can only use it with this reader.

How about adding numWorkers to SimpleDirectoryReader instead?

@InsightByAI if you like you can split your PR in two, the first one just adding the missing filetypes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcusschiesser The issue for me with adding numWorkers to SimpleDirectoryReader was that it would load in batches just fine, but it would load the next batch right after finishing the 1st without waiting for the parse results.

I think the options would be to either:

  1. have SimpleDirectoryReader wait for a finish signal from the Parser
  2. have SimpleDirectoryReader pass on a list of all filePaths and let the parallel loading be handled by the Parser

I concentrated only on LlamaParse because I think the requirements for numWorkers in most other readers is quite different, since in LlamaParse the "workers" kind of refers to the max amount of files the LlamaParse API can accept at once, which is limited to 10.

If I have powerful enough hardware, I could probably use a 100 or more "workers" to load a 100 files parallel with PDFReader.
Not sure if a numWorkers that limits one reader to 10 workers and has to wait for results from an API endpoint, while allowing another one to use unlimited workers and only be limited by hardware speed would be easy to implement/work that well?

I'm going to check how python implements this. The Llama-Parse standalone can accept arrays of files but not a directory afaik, so I'll check how that works with an equivalent llama-index python directory Reader.

I'll split the PR later. Thanks for the elaborate review!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you re-add numWorkers to SimpleDirectorReader and add a test case for the problem? Then we can have a look at it

Comment on lines +22 to +27
loadData(filePath: string, fs?: CompleteFileSystem): Promise<Document[]>;
loadData(filePaths: string[], fs?: CompleteFileSystem): Promise<Document[][]>;
loadData(
directoryPath: string,
fs?: CompleteFileSystem,
): Promise<Document[][]>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the SimpleDirectoryReader already supports loading multiple files from a directory.

I think we have three concerns:

  1. Walking the filesystem (already done by the SimpleDirectoryReader)
  2. Reading a single file (done by the reader)
  3. Loading files in parallel (new concern in this PR)

If we handle this new concern by a single reader (here LlamaParseReader), then we can only use it with this reader.

How about adding numWorkers to SimpleDirectoryReader instead?

@InsightByAI if you like you can split your PR in two, the first one just adding the missing filetypes

packages/core/src/readers/utils.ts Show resolved Hide resolved
packages/core/src/readers/utils.ts Show resolved Hide resolved
});
const documents = await reader.loadData("../data/manga.pdf"); // The manga.pdf in the data folder is just a copy of the TOS, due to copyright laws. You have to place your own. I used "The Manga Guide to Calculus" by Hiroyuki Kojima
// Can either accept a single file path an array[] of file paths or a directory path
const documents = await reader.loadData("../data/LlamaParseData");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just use all the files in the existing data folder?

Suggested change
const documents = await reader.loadData("../data/LlamaParseData");
const documents = await reader.loadData("../data");

we can add new files, but then we need to take care about the license

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure, could also just split the brk-2022.pdf file into 6 parts and put them into a separate folder to showcase parallel processing if that makes things easier.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants