Add New Nonrectangular Citation #109

mindlessroman · 2025-02-27T22:20:02Z

This PR uses the cursor points on mouse down/mouse up to detect specific words and recreates a citation based on the detected words in readability order. This approach is a departure from using the detected rectangles in PDFjs and composing the readable excerpts from those.

The document analysis JSON files are updated to include the "content" property. Most test have been updated to account for the slight value changes in detected xy-coordinates.

Some irregular shapes that now highlight as requested in #105:

Description	Before	After
two lines that aren't contiguous
non-rectangular selection, Single paragraph
non-rectangular select, multiple paragraphs, paragraph indent separator
non-rectangular selection, multiple paragraphs, line spacing separator

Note

Edge case that this doesn't account for: cross-page citations. There is a possibility if the page has a footnote or text that's not in the expected order, it could included in the citation even if that's not the intension of the user. We haven't been able to test this from the UI because the UI only shows one page at a time.

for example:

Before	After

PDFjs (presumably) renders the selected text starting with `They are used...` and ends with `...Google Docs and Zimbra Col-` to include the copyright footnote. Document intelligence is sensitive enough to know that the copyright is not in that order when reading. In this case `They are used...` is in paragraph ID 11 and `...Google Docs and Zimbra Col-` is in paragraph ID 12; the copyright is technically paragraph 17.	What is rendered in the GUI once you add the citation are the two paragraphs as they are read in order, 11 and 12. This is what we expect!

However, there was a highlight across two pages, let's say from the bottom of the same first page, onto the next page:

Before	After
	?????
The citation for that span of text will include the copyright because it starts with 16 and technically ends on paragraph 20.	The rendering hasn't been tested for this PR

In the "content" property in the DI analysis includes the copright inline (some newlines added in this copy-paste for readability):

Each compiled trace covers one path through the program with one mapping of values to types. When the VM executes a
compiled trace, it cannot guarantee that the same path will be followed or that the same types will occur in subsequent loop
iterations.\nPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted 
without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this 
notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, 
requires prior specific permission and/or a fee.\nPLDI'09, June 15-20, 2009, Dublin, Ireland.\nCopyright @ 2009 ACM 
978-1-60558-392-1/09/06 ... $5.00\nHence, recording and compiling a trace speculates that the path

Humans wouldn't consider read it as such, but we may be at the mercy of DI's results with this one.

phongcao and others added 8 commits February 26, 2025 11:30

Add new citation using mouse down/up points.

ce69fe9

Account for highlighting LTR and RTL

d01a36c

Update the prepackaged files.

d7f685a

Closest word for the Start word index

b1d7466

Tighten up code

ff67ade

Adjust expected values for tests.

1707c28

Update the tested condition expectation for single point not in a word.

5f424ce

Fix unit tests

c82a69a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add New Nonrectangular Citation #109

Add New Nonrectangular Citation #109

mindlessroman commented Feb 27, 2025

Add New Nonrectangular Citation #109

Are you sure you want to change the base?

Add New Nonrectangular Citation #109

Conversation

mindlessroman commented Feb 27, 2025

Note