-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add parallel processing to SimpleDirectoryReader #830
Conversation
🦋 Changeset detectedLatest commit: c8ddbc2 The changes in this PR will be included in the next version bump. This PR includes changesets to release 9 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
); | ||
|
||
const results = await Promise.all(workerPromises); | ||
docs.push(...results.flat()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs.push(...results.flat()); | |
docs.push(...results.flat()); |
use Array.prototype.concat
, push will throw error when results larger than 1000 items
@@ -46,19 +47,53 @@ export class SimpleDirectoryReader implements BaseReader { | |||
const { | |||
directoryPath, | |||
fs = defaultFS, | |||
defaultReader = new TextFileReader(), | |||
defaultReader, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't change default reader
if (fileExtToReader && fileExt in fileExtToReader) { | ||
reader = fileExtToReader[fileExt]; | ||
} else if (defaultReader != null) { | ||
if (defaultReader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert code - defaultReader is supposed to be the fallback to fileExtToReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, misinterpreted the code then.
In that case, suggestion: add a new param to overwrite all fileExtToReader
if defined.
Could become more important with the addition of more external readers that are capable of reading multiple formats.
} = params; | ||
|
||
// Check if LlamaParseReader is used as the defaultReader and if so checks if numWorkers is in the valid range | ||
if (defaultReader instanceof LlamaParseReader && numWorkers > 9) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't work with setting LlamaParseReader in fileExtToReader -
suggestion: generally don't allow more than 9 workers
filePathQueue.push(filePath); | ||
} | ||
|
||
const workerPromises = Array.from({ length: numWorkers }, () => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this handling queue.length > numWorkers? maybe using something like p-limit helps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is, I thought about using p-limit, I just always try to avoid adding packages if possible. But will check it out and test around. Thanks for the review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MightBeAScam looks great! just a few notes and it needs to be rebased
let reader: BaseReader; | ||
const limit = pLimit(numWorkers); | ||
const workerPromises = filePathQueue.map((filePath) => | ||
limit(() => this.processFiles(filePath, processFilesParams)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it processing just one file?
limit(() => this.processFiles(filePath, processFilesParams)), | |
limit(() => this.processFile(filePath, processFilesParams)), |
examples/readers/package.json
Outdated
@@ -11,7 +11,7 @@ | |||
"start:pdf": "node --import tsx ./src/pdf.ts", | |||
"start:llamaparse": "node --import tsx ./src/llamaparse.ts", | |||
"start:notion": "node --import tsx ./src/notion.ts", | |||
"start:llamaparse2": "node --import tsx ./src/llamaparse_2.ts" | |||
"start:llamaparse2": "node --import tsx ./src/simple-directory-reader-with-llamaparse.ts" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"start:llamaparse2": "node --import tsx ./src/simple-directory-reader-with-llamaparse.ts" | |
"start:llamaparse-dir": "node --import tsx ./src/simple-directory-reader-with-llamaparse.ts" |
.changeset/stale-dots-search.md
Outdated
"llamaindex": patch | ||
--- | ||
|
||
add p-limit for enhanced concurrency managment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add p-limit for enhanced concurrency managment | |
add concurrency managment for SimpleDirectoryReader |
@marcusschiesser fixed the suggested changes and rebased the branch. Seems like I did something wrong, is there a way to check the vercel for git logs if the deploy is not in my account? Not really sure what could be the issue and where to start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This account has been deleted, closing
@himself65 sorry my bad, realized 2 accounts are stupid, so I deleted the newer one.. But I messed up rebasing anyway, so let me try again and I'll open a new PR |
Second part of #808
Reworked everything, a lot cleaner and simpler now and implemented straight into SimpleDirectoryReader.
Did not overload LlamaParse when testing this time.
The data in the example is now just the brk-2022.pdf split in 6 parts.