-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: LlamaParseReader: update supported file types, add support for array of file Paths, add support for directory Paths + example #808
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
I tried implementing it more with SimpleDirectoryReader by also defining numWorkers and adding parallel processing, but I had issues with it not respecting the numWorkers from LlamaParseReader, so I added directory reading straight into LlamaParseReader. If someone wants to check out parallel processing in SimpleDirectoryReader and maybe has an idea how to respect the number of "workers": Here it is on a different branch and the Commit |
Something causing a Connect Timeout Error at loadFile and loadData, gonna check that out |
loadData(filePath: string, fs?: CompleteFileSystem): Promise<Document[]>; | ||
loadData(filePaths: string[], fs?: CompleteFileSystem): Promise<Document[][]>; | ||
loadData( | ||
directoryPath: string, | ||
fs?: CompleteFileSystem, | ||
): Promise<Document[][]>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we support directory input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think directory input is essential in order to take full advantage of the ability to load and send multiple files at once.
But yeah the smarter way would be to integrate it in Simple Directory Reader but I didn't figure out how to not overload the Parser when adding parallel processing to the Directory Reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm gonna check it out again and see. I'll probably have some time in 2-3 days.
The Connect Timeout Error seems to be kind of random. Sometimes it works as expected, sometimes it times out, so I'll recheck everything anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the SimpleDirectoryReader already supports loading multiple files from a directory.
I think we have three concerns:
- Walking the filesystem (already done by the SimpleDirectoryReader)
- Reading a single file (done by the reader)
- Loading files in parallel (new concern in this PR)
If we handle this new concern by a single reader (here LlamaParseReader), then we can only use it with this reader.
How about adding numWorkers
to SimpleDirectoryReader instead?
@InsightByAI if you like you can split your PR in two, the first one just adding the missing filetypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcusschiesser The issue for me with adding numWorkers
to SimpleDirectoryReader was that it would load in batches just fine, but it would load the next batch right after finishing the 1st without waiting for the parse results.
I think the options would be to either:
- have SimpleDirectoryReader wait for a finish signal from the Parser
- have SimpleDirectoryReader pass on a list of all filePaths and let the parallel loading be handled by the Parser
I concentrated only on LlamaParse because I think the requirements for numWorkers
in most other readers is quite different, since in LlamaParse the "workers" kind of refers to the max amount of files the LlamaParse API can accept at once, which is limited to 10.
If I have powerful enough hardware, I could probably use a 100 or more "workers" to load a 100 files parallel with PDFReader.
Not sure if a numWorkers
that limits one reader to 10 workers and has to wait for results from an API endpoint, while allowing another one to use unlimited workers and only be limited by hardware speed would be easy to implement/work that well?
I'm going to check how python implements this. The Llama-Parse standalone can accept arrays of files but not a directory afaik, so I'll check how that works with an equivalent llama-index python directory Reader.
I'll split the PR later. Thanks for the elaborate review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you re-add numWorkers
to SimpleDirectorReader and add a test case for the problem? Then we can have a look at it
loadData(filePath: string, fs?: CompleteFileSystem): Promise<Document[]>; | ||
loadData(filePaths: string[], fs?: CompleteFileSystem): Promise<Document[][]>; | ||
loadData( | ||
directoryPath: string, | ||
fs?: CompleteFileSystem, | ||
): Promise<Document[][]>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the SimpleDirectoryReader already supports loading multiple files from a directory.
I think we have three concerns:
- Walking the filesystem (already done by the SimpleDirectoryReader)
- Reading a single file (done by the reader)
- Loading files in parallel (new concern in this PR)
If we handle this new concern by a single reader (here LlamaParseReader), then we can only use it with this reader.
How about adding numWorkers
to SimpleDirectoryReader instead?
@InsightByAI if you like you can split your PR in two, the first one just adding the missing filetypes
}); | ||
const documents = await reader.loadData("../data/manga.pdf"); // The manga.pdf in the data folder is just a copy of the TOS, due to copyright laws. You have to place your own. I used "The Manga Guide to Calculus" by Hiroyuki Kojima | ||
// Can either accept a single file path an array[] of file paths or a directory path | ||
const documents = await reader.loadData("../data/LlamaParseData"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just use all the files in the existing data folder?
const documents = await reader.loadData("../data/LlamaParseData"); | |
const documents = await reader.loadData("../data"); |
we can add new files, but then we need to take care about the license
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure, could also just split the brk-2022.pdf file into 6 parts and put them into a separate folder to showcase parallel processing if that makes things easier.
Should be mostly up to date with the python version now.
Main thing still missing is get_images function, will look at that next.