Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add parallel processing to SimpleDirectoryReader #830

Closed
wants to merge 15 commits into from

Conversation

ghost
Copy link

@ghost ghost commented May 10, 2024

Second part of #808

Reworked everything, a lot cleaner and simpler now and implemented straight into SimpleDirectoryReader.

Did not overload LlamaParse when testing this time.
The data in the example is now just the brk-2022.pdf split in 6 parts.

Copy link

changeset-bot bot commented May 10, 2024

🦋 Changeset detected

Latest commit: c8ddbc2

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 9 packages
Name Type
llamaindex Patch
docs Patch
@llamaindex/experimental Patch
@llamaindex/cloudflare-worker-agent-test Patch
@llamaindex/next-agent-test Patch
@llamaindex/nextjs-edge-runtime-test Patch
@llamaindex/waku-query-engine-test Patch
@llamaindex/autotool-01-node-example Patch
@llamaindex/autotool-02-next-example Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link

vercel bot commented May 10, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
llama-index-ts-docs ❌ Failed (Inspect) May 22, 2024 6:14pm

);

const results = await Promise.all(workerPromises);
docs.push(...results.flat());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
docs.push(...results.flat());
docs.push(...results.flat());

use Array.prototype.concat, push will throw error when results larger than 1000 items

@@ -46,19 +47,53 @@ export class SimpleDirectoryReader implements BaseReader {
const {
directoryPath,
fs = defaultFS,
defaultReader = new TextFileReader(),
defaultReader,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change default reader

if (fileExtToReader && fileExt in fileExtToReader) {
reader = fileExtToReader[fileExt];
} else if (defaultReader != null) {
if (defaultReader) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert code - defaultReader is supposed to be the fallback to fileExtToReader

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, misinterpreted the code then.
In that case, suggestion: add a new param to overwrite all fileExtToReader if defined.

Could become more important with the addition of more external readers that are capable of reading multiple formats.

} = params;

// Check if LlamaParseReader is used as the defaultReader and if so checks if numWorkers is in the valid range
if (defaultReader instanceof LlamaParseReader && numWorkers > 9) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't work with setting LlamaParseReader in fileExtToReader -
suggestion: generally don't allow more than 9 workers

filePathQueue.push(filePath);
}

const workerPromises = Array.from({ length: numWorkers }, () =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this handling queue.length > numWorkers? maybe using something like p-limit helps?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is, I thought about using p-limit, I just always try to avoid adding packages if possible. But will check it out and test around. Thanks for the review!

@ghost ghost marked this pull request as draft May 14, 2024 16:51
@ghost ghost marked this pull request as ready for review May 15, 2024 18:26
@ghost ghost requested a review from marcusschiesser May 15, 2024 21:07
Copy link
Collaborator

@marcusschiesser marcusschiesser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MightBeAScam looks great! just a few notes and it needs to be rebased

let reader: BaseReader;
const limit = pLimit(numWorkers);
const workerPromises = filePathQueue.map((filePath) =>
limit(() => this.processFiles(filePath, processFilesParams)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it processing just one file?

Suggested change
limit(() => this.processFiles(filePath, processFilesParams)),
limit(() => this.processFile(filePath, processFilesParams)),

@@ -11,7 +11,7 @@
"start:pdf": "node --import tsx ./src/pdf.ts",
"start:llamaparse": "node --import tsx ./src/llamaparse.ts",
"start:notion": "node --import tsx ./src/notion.ts",
"start:llamaparse2": "node --import tsx ./src/llamaparse_2.ts"
"start:llamaparse2": "node --import tsx ./src/simple-directory-reader-with-llamaparse.ts"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"start:llamaparse2": "node --import tsx ./src/simple-directory-reader-with-llamaparse.ts"
"start:llamaparse-dir": "node --import tsx ./src/simple-directory-reader-with-llamaparse.ts"

"llamaindex": patch
---

add p-limit for enhanced concurrency managment
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
add p-limit for enhanced concurrency managment
add concurrency managment for SimpleDirectoryReader

@KindOfAScam
Copy link
Contributor

@marcusschiesser fixed the suggested changes and rebased the branch. Seems like I did something wrong, is there a way to check the vercel for git logs if the deploy is not in my account? Not really sure what could be the issue and where to start

Copy link
Member

@himself65 himself65 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This account has been deleted, closing

@himself65 himself65 closed this May 23, 2024
@himself65
Copy link
Member

image

@KindOfAScam
Copy link
Contributor

@himself65 sorry my bad, realized 2 accounts are stupid, so I deleted the newer one..

But I messed up rebasing anyway, so let me try again and I'll open a new PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants