Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add New Nonrectangular Citation #109

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

mindlessroman
Copy link

co-authored with @nexilus18 @phongcao

This PR uses the cursor points on mouse down/mouse up to detect specific words and recreates a citation based on the detected words in readability order. This approach is a departure from using the detected rectangles in PDFjs and composing the readable excerpts from those.

The document analysis JSON files are updated to include the "content" property. Most test have been updated to account for the slight value changes in detected xy-coordinates.

Some irregular shapes that now highlight as requested in #105:

Description Before After
two lines that aren't contiguous image image
non-rectangular selection, Single paragraph image image
non-rectangular select, multiple paragraphs, paragraph indent separator image image
non-rectangular selection, multiple paragraphs, line spacing separator image image

Note

Edge case that this doesn't account for: cross-page citations. There is a possibility if the page has a footnote or text that's not in the expected order, it could included in the citation even if that's not the intension of the user. We haven't been able to test this from the UI because the UI only shows one page at a time.

for example:

Before After
image image
PDFjs (presumably) renders the selected text starting with They are used... and ends with ...Google Docs and Zimbra Col- to include the copyright footnote. Document intelligence is sensitive enough to know that the copyright is not in that order when reading. In this case They are used... is in paragraph ID 11 and ...Google Docs and Zimbra Col- is in paragraph ID 12; the copyright is technically paragraph 17. What is rendered in the GUI once you add the citation are the two paragraphs as they are read in order, 11 and 12. This is what we expect!

However, there was a highlight across two pages, let's say from the bottom of the same first page, onto the next page:

Before After
image image ?????
The citation for that span of text will include the copyright because it starts with 16 and technically ends on paragraph 20. The rendering hasn't been tested for this PR

In the "content" property in the DI analysis includes the copright inline (some newlines added in this copy-paste for readability):

Each compiled trace covers one path through the program with one mapping of values to types. When the VM executes a
compiled trace, it cannot guarantee that the same path will be followed or that the same types will occur in subsequent loop
iterations.\nPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted 
without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this 
notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, 
requires prior specific permission and/or a fee.\nPLDI'09, June 15-20, 2009, Dublin, Ireland.\nCopyright @ 2009 ACM 
978-1-60558-392-1/09/06 ... $5.00\nHence, recording and compiling a trace speculates that the path

Humans wouldn't consider read it as such, but we may be at the mercy of DI's results with this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants