Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
co-authored with @nexilus18 @phongcao
This PR uses the cursor points on mouse down/mouse up to detect specific words and recreates a citation based on the detected words in readability order. This approach is a departure from using the detected rectangles in PDFjs and composing the readable excerpts from those.
The document analysis JSON files are updated to include the "content" property. Most test have been updated to account for the slight value changes in detected xy-coordinates.
Some irregular shapes that now highlight as requested in #105:
Note
Edge case that this doesn't account for: cross-page citations. There is a possibility if the page has a footnote or text that's not in the expected order, it could included in the citation even if that's not the intension of the user. We haven't been able to test this from the UI because the UI only shows one page at a time.
for example:
They are used...
and ends with...Google Docs and Zimbra Col-
to include the copyright footnote. Document intelligence is sensitive enough to know that the copyright is not in that order when reading. In this caseThey are used...
is in paragraph ID 11 and...Google Docs and Zimbra Col-
is in paragraph ID 12; the copyright is technically paragraph 17.However, there was a highlight across two pages, let's say from the bottom of the same first page, onto the next page:
In the "content" property in the DI analysis includes the copright inline (some newlines added in this copy-paste for readability):
Humans wouldn't consider read it as such, but we may be at the mercy of DI's results with this one.