Skip to content

Commit

Permalink
feat: add page number to read PDFs (#815)
Browse files Browse the repository at this point in the history
  • Loading branch information
marcusschiesser authored May 7, 2024
1 parent 645fcf6 commit ce94780
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 4 deletions.
5 changes: 5 additions & 0 deletions .changeset/ninety-doors-impress.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"llamaindex": patch
---

Add page number to read PDFs and use generated IDs for PDF and markdown content
11 changes: 7 additions & 4 deletions packages/core/src/readers/PDFReader.ts
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,13 @@ export class PDFReader implements BaseReader {
fs: GenericFileSystem = defaultFS,
): Promise<Document[]> {
const content = await fs.readRawFile(file);
const text = await readPDF(content);
return text.map((text, page) => {
const id_ = `${file}_${page}`;
return new Document({ text, id_ });
const pages = await readPDF(content);
return pages.map((text, page) => {
const id_ = `${file}_${page + 1}`;
const metadata = {
page_number: page + 1,
};
return new Document({ text, id_, metadata });
});
}
}
Expand Down

0 comments on commit ce94780

Please sign in to comment.