-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: LlamaParseReader: update supported file types, add support for array of file Paths, add support for directory Paths + example #808
Changes from all commits
57b3367
153af3f
c30575c
cb2b673
3c0f03e
434175e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,26 +1,31 @@ | ||
import fs from "fs/promises"; | ||
import { LlamaParseReader } from "llamaindex"; | ||
import { LlamaParseReader, VectorStoreIndex } from "llamaindex"; | ||
|
||
async function main() { | ||
// Load PDF using LlamaParse. set apiKey here or in environment variable LLAMA_CLOUD_API_KEY | ||
const reader = new LlamaParseReader({ | ||
resultType: "markdown", | ||
language: "en", | ||
numWorkers: 3, //Load files in batches of 2 | ||
parsingInstruction: | ||
"The provided document is a manga comic book. Most pages do NOT have title. It does not contain tables. Try to reconstruct the dialogue happening in a cohesive way. Output any math equation in LATEX markdown (between $$)", | ||
"The provided documents are datasheets and Quick-Installation-Guides for Solplanet's Ai-LB series of batteries. They contain tables and graphics. There is also a lot of technical information. The goal is to extract and structure the knowledge in a coherent way", | ||
}); | ||
const documents = await reader.loadData("../data/manga.pdf"); // The manga.pdf in the data folder is just a copy of the TOS, due to copyright laws. You have to place your own. I used "The Manga Guide to Calculus" by Hiroyuki Kojima | ||
// Can either accept a single file path an array[] of file paths or a directory path | ||
const documents = await reader.loadData("../data/LlamaParseData"); | ||
|
||
// Assuming documents contain an array of pages or sections | ||
const parsedManga = documents.map((page) => page.text).join("\n---\n"); | ||
// Flatten the array of arrays of files | ||
const flatdocuments = documents.flat(); | ||
|
||
// Output the parsed manga to .md file. Will be placed in ../example/readers/ | ||
try { | ||
await fs.writeFile("./parsedManga.md", parsedManga); | ||
console.log("Output successfully written to parsedManga.md"); | ||
} catch (err) { | ||
console.error("Error writing to file:", err); | ||
} | ||
// Split text and create embeddings. Store them in a VectorStoreIndex | ||
const index = await VectorStoreIndex.fromDocuments(flatdocuments); | ||
|
||
// Query the index | ||
const queryEngine = index.asQueryEngine(); | ||
const response = await queryEngine.query({ | ||
query: "Which Batteries can be used in parallel connection?", | ||
}); | ||
|
||
// Output response | ||
console.log(response.toString()); | ||
} | ||
|
||
main().catch(console.error); |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,7 +5,7 @@ import type { Document } from "../Node.js"; | |
* A reader takes imports data into Document objects. | ||
*/ | ||
export interface BaseReader { | ||
loadData(...args: unknown[]): Promise<Document[]>; | ||
loadData(...args: unknown[]): Promise<Document[] | Document[][]>; | ||
} | ||
|
||
/** | ||
|
@@ -15,7 +15,19 @@ export interface FileReader extends BaseReader { | |
loadData(filePath: string, fs?: CompleteFileSystem): Promise<Document[]>; | ||
} | ||
|
||
// For LlamaParseReader.ts | ||
/** | ||
* A reader takes single and multiple file paths as well as a directory Path and imports data into an array of Document objects. | ||
*/ | ||
export interface MultiReader extends BaseReader { | ||
loadData(filePath: string, fs?: CompleteFileSystem): Promise<Document[]>; | ||
loadData(filePaths: string[], fs?: CompleteFileSystem): Promise<Document[][]>; | ||
loadData( | ||
directoryPath: string, | ||
fs?: CompleteFileSystem, | ||
): Promise<Document[][]>; | ||
Comment on lines
+22
to
+27
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we support directory input? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think directory input is essential in order to take full advantage of the ability to load and send multiple files at once. But yeah the smarter way would be to integrate it in Simple Directory Reader but I didn't figure out how to not overload the Parser when adding parallel processing to the Directory Reader There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm gonna check it out again and see. I'll probably have some time in 2-3 days. The Connect Timeout Error seems to be kind of random. Sometimes it works as expected, sometimes it times out, so I'll recheck everything anyway. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, the SimpleDirectoryReader already supports loading multiple files from a directory. I think we have three concerns:
If we handle this new concern by a single reader (here LlamaParseReader), then we can only use it with this reader. How about adding @InsightByAI if you like you can split your PR in two, the first one just adding the missing filetypes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @marcusschiesser The issue for me with adding I think the options would be to either:
I concentrated only on LlamaParse because I think the requirements for If I have powerful enough hardware, I could probably use a 100 or more "workers" to load a 100 files parallel with PDFReader. I'm going to check how python implements this. The Llama-Parse standalone can accept arrays of files but not a directory afaik, so I'll check how that works with an equivalent llama-index python directory Reader. I'll split the PR later. Thanks for the elaborate review! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you re-add |
||
} | ||
|
||
// For LlamaParseReader | ||
|
||
export type ResultType = "text" | "markdown" | "json"; | ||
export type Language = | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just use all the files in the existing data folder?
we can add new files, but then we need to take care about the license
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure, could also just split the brk-2022.pdf file into 6 parts and put them into a separate folder to showcase parallel processing if that makes things easier.