Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for large documents with many annotations #130

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vrish88
Copy link
Contributor

@vrish88 vrish88 commented Aug 31, 2022

Hello and thank you for this wonderful project. It's provided some excellent shoulders to stand on.

Context

I'm extracting footnotes embedded in markdown and converting them annotations. Some of these markdown files have over 500k characters in them and have over 100 footnotes. After a quite circuitous route, I'm using mdast/hast/remark to convert the markdown into html and then loading the html into a jsdom Document.

The basic flow is like this:

  • for each footnote
    • convert it to a span that stays in the same spot in the document
    • create a range from the footnote span
    • invoke describeTextQuote to determine the selector for that node
    • save the selector within an annotation saved off in a separate file

The Problem

I found that extracting footnotes for some of the larger files was taking 7 - 10 minutes to process. Running a profiler, it looked like 70% of the time was spent determining if the node intersected the document/scope.
image

That call is happening when the node is being converted to a chunk, which happens many times, per annotation. It is also only being used to ensure that the node is apart of the document (as far as I can tell).

The Solution

This PR removes that check. It improved the performance on my machine by 75% for the large files.

Behaviorally I think it is the same. The two things which invoke nodeToChunk appear to be already checking if those nodes are a part of the scope.

@Treora
Copy link
Contributor

Treora commented Sep 15, 2022

Sorry for the silence, and thanks a lot for the contribution! Glad to hear this library is useful to you; we’d love to hear in what way it helps you, for what purpose, and how it could be more helpful still!

So far we not focussed on actual performance, though we have kept it in mind while designing the APIs (hence the small functions, the generators, async functions everywhere, etc.). Profiling the code execution is a great start already.

I think I added that check myself. Within the class, you may well be right that we do not need such a check at all. I probably added for in case somebody would use the chunker in other ways, to not return an invalid chunk if the given node is outside of the scope. Perhaps we can split the method into a private method without the check, and keep the public method that does the check and then runs the private method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants