Improve performance for large documents with many annotations #130

vrish88 · 2022-08-31T22:37:16Z

Hello and thank you for this wonderful project. It's provided some excellent shoulders to stand on.

Context

I'm extracting footnotes embedded in markdown and converting them annotations. Some of these markdown files have over 500k characters in them and have over 100 footnotes. After a quite circuitous route, I'm using mdast/hast/remark to convert the markdown into html and then loading the html into a jsdom Document.

The basic flow is like this:

for each footnote
- convert it to a span that stays in the same spot in the document
- create a range from the footnote span
- invoke describeTextQuote to determine the selector for that node
- save the selector within an annotation saved off in a separate file

The Problem

I found that extracting footnotes for some of the larger files was taking 7 - 10 minutes to process. Running a profiler, it looked like 70% of the time was spent determining if the node intersected the document/scope.

That call is happening when the node is being converted to a chunk, which happens many times, per annotation. It is also only being used to ensure that the node is apart of the document (as far as I can tell).

The Solution

This PR removes that check. It improved the performance on my machine by 75% for the large files.

Behaviorally I think it is the same. The two things which invoke nodeToChunk appear to be already checking if those nodes are a part of the scope.

Treora · 2022-09-15T12:20:01Z

Sorry for the silence, and thanks a lot for the contribution! Glad to hear this library is useful to you; we’d love to hear in what way it helps you, for what purpose, and how it could be more helpful still!

So far we not focussed on actual performance, though we have kept it in mind while designing the APIs (hence the small functions, the generators, async functions everywhere, etc.). Profiling the code execution is a great start already.

I think I added that check myself. Within the class, you may well be right that we do not need such a check at all. I probably added for in case somebody would use the chunker in other ways, to not return an invalid chunk if the given node is outside of the scope. Perhaps we can split the method into a private method without the check, and keep the public method that does the check and then runs the private method?

Remove redundant, poorly performing nodeIntersect check

0051186

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for large documents with many annotations #130

Improve performance for large documents with many annotations #130

vrish88 commented Aug 31, 2022 •

edited

Loading

Treora commented Sep 15, 2022

Improve performance for large documents with many annotations #130

Are you sure you want to change the base?

Improve performance for large documents with many annotations #130

Conversation

vrish88 commented Aug 31, 2022 • edited Loading

Context

The Problem

The Solution

Treora commented Sep 15, 2022

vrish88 commented Aug 31, 2022 •

edited

Loading