-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to get where a Table occurs in the document relative to Lines #162
Comments
Hi @sam-goodwin , are you referring to Python, JS or .NET? |
JavaScript. I ended up doing it by sorting by the bounding box. Itd be nice if there were a way to traverse all the elements in the document in reading order, not just lines. But I understand that may not be easy to generalize in a way that works for all documents. |
Understand that this could be useful but agree it may be difficult to generalise... Since Textract's new native layout analysis feature may help give a more generalisable basis than our current pseudo-paragraph heuristic, I'll probably suggest to park this until we've tackled #164 |
With today's release of amazon-textract-response-parser v0.4.0, users can run the source document through Amazon Textract with Layout analysis enabled and then use TRP.js to loop through the content elements which are returned in estimated reading order and can map to import { ApiBlockType, LayoutKeyValue, LayoutTable } from "amazon-textract-response-parser";
// layout.listItems() are implicitly in human reading order:
page.layout.listItems().forEach((layItem) => {
if (layItem.blockType === ApiBlockType.LayoutKeyValue) {
// *Usually* multiple K-V fields per LayoutKeyValue block:
const fields = (layItem as LayoutKeyValue).listFields();
fields.forEach((field) => console.log(field.key.text));
} else if (layItem.blockType === ApiBlockType.LayoutTable) {
// *Usually* just one table per LayoutTable block:
const tables = (layItem as LayoutTable).listTables();
tables.forEach((table) => console.log(table.nCells));
} else {
// Other items e.g. title, section header, paragraph, etc
layItem.listTextLines().forEach((line) => console.log(line.text));
}
}); I tentatively believe this should support the original use-case of linking from human reading order to not just
@sam-goodwin (or others) it'd be great to hear if this method already solves your needs? We could consider exposing some kind of I probably wouldn't want to dive in to a big project extending the heuristic |
Take, for example, the below table. It occurs after "Key: Value" and before "Another Line". I'd like to be able to process
I'd like to be able to iterate through a page and see the following items in order:
Is this possible?
The text was updated successfully, but these errors were encountered: