Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get where a Table occurs in the document relative to Lines #162

Open
sam-goodwin opened this issue Sep 11, 2023 · 4 comments
Open
Labels
enhancement New feature or request javascript Relates to the JavaScript/TypeScript version of TRP

Comments

@sam-goodwin
Copy link

sam-goodwin commented Sep 11, 2023

Take, for example, the below table. It occurs after "Key: Value" and before "Another Line". I'd like to be able to process

Some Line
Key: Value

| Some Header  | Another Header |
| - | - |
| Some Value | Another Value

Another Line

I'd like to be able to iterate through a page and see the following items in order:

  1. Line ("Some Line")
  2. KeyValue ("Key", "Value")
  3. Table
  4. Line ("Another Line")

Is this possible?

@schadem
Copy link
Contributor

schadem commented Sep 11, 2023

Hi @sam-goodwin , are you referring to Python, JS or .NET?

@sam-goodwin
Copy link
Author

JavaScript. I ended up doing it by sorting by the bounding box. Itd be nice if there were a way to traverse all the elements in the document in reading order, not just lines. But I understand that may not be easy to generalize in a way that works for all documents.

@athewsey athewsey added enhancement New feature or request javascript Relates to the JavaScript/TypeScript version of TRP labels Oct 2, 2023
@athewsey
Copy link
Contributor

athewsey commented Oct 2, 2023

Understand that this could be useful but agree it may be difficult to generalise... Since Textract's new native layout analysis feature may help give a more generalisable basis than our current pseudo-paragraph heuristic, I'll probably suggest to park this until we've tackled #164

@athewsey
Copy link
Contributor

athewsey commented Feb 6, 2024

With today's release of amazon-textract-response-parser v0.4.0, users can run the source document through Amazon Textract with Layout analysis enabled and then use TRP.js to loop through the content elements which are returned in estimated reading order and can map to FORMS and TABLES items... Something like:

import { ApiBlockType, LayoutKeyValue, LayoutTable } from "amazon-textract-response-parser";

// layout.listItems() are implicitly in human reading order:
page.layout.listItems().forEach((layItem) => {
  if (layItem.blockType === ApiBlockType.LayoutKeyValue) {
    // *Usually* multiple K-V fields per LayoutKeyValue block:
    const fields = (layItem as LayoutKeyValue).listFields();
    fields.forEach((field) => console.log(field.key.text));
  } else if (layItem.blockType === ApiBlockType.LayoutTable) {
    // *Usually* just one table per LayoutTable block:
    const tables = (layItem as LayoutTable).listTables();
    tables.forEach((table) => console.log(table.nCells));
  } else {
    // Other items e.g. title, section header, paragraph, etc
    layItem.listTextLines().forEach((line) => console.log(line.text));
  }
});

I tentatively believe this should support the original use-case of linking from human reading order to not just LINEs of text but also other analyses' results - subject to the caveats that:

  1. It depends on the Textract-side LAYOUT analysis feature being enabled, and
  2. Due to the nature of the Layout analysis feature, there's no guarantee of an exact 1-to-1 correspondence from text LINEs in the LayoutTable to text in the Table.
    • A simple approach could be to only use layoutKeyValue.listFields() and layoutTable.listTables() and not pay any attention to what text LINEs those blocks contain.
    • Today we work around this in our LayoutTableGeneric.html() method by scanning through the LayoutTable's linked LINEs and WORDs, inserting the full representation of the TABLE wherever we first see an overlap, and then continuing the LINEs scan but omitting any content that's already been rendered.
    • ...But didn't expose this logic yet because it seemed a bit early/experimental.

@sam-goodwin (or others) it'd be great to hear if this method already solves your needs?

We could consider exposing some kind of LayoutTable.listXYZ() API that tries to do the .html()-like reconciliation and return a linear list of {TABLE, LINE, and/or WORD}? I'm just nervous about edge cases e.g. I think in some example docs I've even seen TABLEs that overlap with each other.

I probably wouldn't want to dive in to a big project extending the heuristic getLineClustersInReadingOrder to also account for tables/forms when they're present but Layout wasn't enabled: Because Layout should be the canonical source for Reading Order information on non-trivial docs as it's AI-powered and should usually perform much better than our TRP-side heuristics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request javascript Relates to the JavaScript/TypeScript version of TRP
Projects
None yet
Development

No branches or pull requests

3 participants