-
Notifications
You must be signed in to change notification settings - Fork 5.7k
perf(fmt): speed up file diffing #30644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This swaps out dissimilar for imara which is substantially faster at diffing strings. Note that this is a proof of concept and I did not have enough time to make the output pretty. I just shows that the diff is fast. Applying colors should be doable without changing much about the perf. Fixes denoland#30634
dprint-plugin-typescript = "=0.95.11" | ||
env_logger = "=0.11.6" | ||
fancy-regex = "=0.14.0" | ||
imara-diff = "=0.2.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm willing to fix this up if I get an OK about the general direction.
I think this sounds good. I kind of wonder if there's a diffing library that allows bailing after X many differences though as it would work well for incredibly large files. I wonder if we could contribute that to dissimilar and if they'd take a patch that does that (maybe it's not too difficult?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened dtolnay/dissimilar#21 -- it might be more worthwhile to pursue this path than rewrite to imara-diff, which still might not be fast enough with very large files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm willing to fix this up if I get an OK about the general direction.
I think this sounds good. I kind of wonder if there's a diffing library that allows bailing after X many differences though as it would work well for incredibly large files. I wonder if we could contribute that to dissimilar and if they'd take a patch that does that (maybe it's not too difficult?).
I think both dissimilar and imara expose an iterator over the patches, so I would assume that we can just stop iterating and thereby abort the computation of the diff early.
I have yet to check if my assumption is correct, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason beyond the cost of migration why you'd like to stay with dissimilar? From my superficial understanding, it looks like imara is simply a better (=faster) diffing lib in all respects that are relevant for deno.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason beyond the cost of migration why you'd like to stay with dissimilar?
We know the diff output of dissimilar is ok, but not sure yet about imara. Generally diffs are only shown in error cases so perf doesn't matter too much, but obviously several minutes is not acceptable 😅. How much faster is imara for this diff? I guess if it's fast enough on this case then maybe that's good enough and we don't need to worry about doing some iterator or max results approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my machine, deno 2.5.0 needs around 36 minutes (!) to check the formatting of the file and print out the diff. My branch (still in debug build, did not compile with -r
yet) cuts it down to 0.4 seconds.
I did not try larger files using Deno 2.5.0 but I tried them with this branch. The results are as follows:
- 1 MB file: 0.4 seconds
- 10 MB file: 4 seconds
- 100 MB file: 40 seconds
(not evaluated this very scientifically, please take it with a grain of salt)
All files had a similar format as shown #30634.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have started working on bringing back properly formatted diffs. One thing I have noticed is that imara is extremely good at finding line diffs, but it does not have built-in word-diffing (see pascalkuthe/imara-diff#1). I will ask if they accept contributions, but otherwise I'm afraid we will have to add the complexity here. This is something that dissimilar provides out of the box, but they seem to do it by not even tokenizing the input at all, which explains why it is so slow. (Note that this also means that imara might get a lot slower once we run word diffs with it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream work continues: pascalkuthe/imara-diff#33
This swaps out dissimilar for imara which is substantially faster at diffing strings.
Note that this is a proof of concept and I did not have enough time to make the output pretty. I just shows that the diff is fast. Applying colors should be doable without changing much about the perf. I'm willing to fix this up if I get an OK about the general direction.
Fixes #30634