How to get where a Table occurs in the document relative to Lines #162

sam-goodwin · 2023-09-11T06:06:14Z

Take, for example, the below table. It occurs after "Key: Value" and before "Another Line". I'd like to be able to process

Some Line
Key: Value

| Some Header  | Another Header |
| - | - |
| Some Value | Another Value

Another Line

I'd like to be able to iterate through a page and see the following items in order:

Line ("Some Line")
KeyValue ("Key", "Value")
Table
Line ("Another Line")

Is this possible?

The text was updated successfully, but these errors were encountered:

schadem · 2023-09-11T12:56:28Z

Hi @sam-goodwin , are you referring to Python, JS or .NET?

sam-goodwin · 2023-09-12T19:29:09Z

JavaScript. I ended up doing it by sorting by the bounding box. Itd be nice if there were a way to traverse all the elements in the document in reading order, not just lines. But I understand that may not be easy to generalize in a way that works for all documents.

athewsey · 2023-10-02T12:24:24Z

Understand that this could be useful but agree it may be difficult to generalise... Since Textract's new native layout analysis feature may help give a more generalisable basis than our current pseudo-paragraph heuristic, I'll probably suggest to park this until we've tackled #164

athewsey · 2024-02-06T07:23:49Z

With today's release of amazon-textract-response-parser v0.4.0, users can run the source document through Amazon Textract with Layout analysis enabled and then use TRP.js to loop through the content elements which are returned in estimated reading order and can map to FORMS and TABLES items... Something like:

import { ApiBlockType, LayoutKeyValue, LayoutTable } from "amazon-textract-response-parser";

// layout.listItems() are implicitly in human reading order:
page.layout.listItems().forEach((layItem) => {
  if (layItem.blockType === ApiBlockType.LayoutKeyValue) {
    // *Usually* multiple K-V fields per LayoutKeyValue block:
    const fields = (layItem as LayoutKeyValue).listFields();
    fields.forEach((field) => console.log(field.key.text));
  } else if (layItem.blockType === ApiBlockType.LayoutTable) {
    // *Usually* just one table per LayoutTable block:
    const tables = (layItem as LayoutTable).listTables();
    tables.forEach((table) => console.log(table.nCells));
  } else {
    // Other items e.g. title, section header, paragraph, etc
    layItem.listTextLines().forEach((line) => console.log(line.text));
  }
});

I tentatively believe this should support the original use-case of linking from human reading order to not just LINEs of text but also other analyses' results - subject to the caveats that:

It depends on the Textract-side LAYOUT analysis feature being enabled, and
Due to the nature of the Layout analysis feature, there's no guarantee of an exact 1-to-1 correspondence from text LINEs in the LayoutTable to text in the Table.
- A simple approach could be to only use layoutKeyValue.listFields() and layoutTable.listTables() and not pay any attention to what text LINEs those blocks contain.
- Today we work around this in our LayoutTableGeneric.html() method by scanning through the LayoutTable's linked LINEs and WORDs, inserting the full representation of the TABLE wherever we first see an overlap, and then continuing the LINEs scan but omitting any content that's already been rendered.
- ...But didn't expose this logic yet because it seemed a bit early/experimental.

@sam-goodwin (or others) it'd be great to hear if this method already solves your needs?

We could consider exposing some kind of LayoutTable.listXYZ() API that tries to do the .html()-like reconciliation and return a linear list of {TABLE, LINE, and/or WORD}? I'm just nervous about edge cases e.g. I think in some example docs I've even seen TABLEs that overlap with each other.

I probably wouldn't want to dive in to a big project extending the heuristic getLineClustersInReadingOrder to also account for tables/forms when they're present but Layout wasn't enabled: Because Layout should be the canonical source for Reading Order information on non-trivial docs as it's AI-powered and should usually perform much better than our TRP-side heuristics.

athewsey added enhancement New feature or request javascript Relates to the JavaScript/TypeScript version of TRP labels Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get where a Table occurs in the document relative to Lines #162

How to get where a Table occurs in the document relative to Lines #162

sam-goodwin commented Sep 11, 2023 •

edited

Loading

schadem commented Sep 11, 2023

sam-goodwin commented Sep 12, 2023

athewsey commented Oct 2, 2023

athewsey commented Feb 6, 2024

How to get where a Table occurs in the document relative to Lines #162

How to get where a Table occurs in the document relative to Lines #162

Comments

sam-goodwin commented Sep 11, 2023 • edited Loading

schadem commented Sep 11, 2023

sam-goodwin commented Sep 12, 2023

athewsey commented Oct 2, 2023

athewsey commented Feb 6, 2024

sam-goodwin commented Sep 11, 2023 •

edited

Loading