Skip to content

Conversation

KnorpelSenf
Copy link
Contributor

@KnorpelSenf KnorpelSenf commented Sep 8, 2025

This swaps out dissimilar for imara which is substantially faster at diffing strings.

Note that this is a proof of concept and I did not have enough time to make the output pretty. I just shows that the diff is fast. Applying colors should be doable without changing much about the perf. I'm willing to fix this up if I get an OK about the general direction.

Fixes #30634

This swaps out dissimilar for imara which is substantially faster at diffing strings.

Note that this is a proof of concept and I did not have enough time to make the output pretty. I just shows that the diff is fast. Applying colors should be doable without changing much about the perf.

Fixes denoland#30634
@KnorpelSenf KnorpelSenf changed the title perf: speed up file diffing perf(fmt): speed up file diffing Sep 10, 2025
dprint-plugin-typescript = "=0.95.11"
env_logger = "=0.11.6"
fancy-regex = "=0.14.0"
imara-diff = "=0.2.0"
Copy link
Member

@dsherret dsherret Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm willing to fix this up if I get an OK about the general direction.

I think this sounds good. I kind of wonder if there's a diffing library that allows bailing after X many differences though as it would work well for incredibly large files. I wonder if we could contribute that to dissimilar and if they'd take a patch that does that (maybe it's not too difficult?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened dtolnay/dissimilar#21 -- it might be more worthwhile to pursue this path than rewrite to imara-diff, which still might not be fast enough with very large files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm willing to fix this up if I get an OK about the general direction.

I think this sounds good. I kind of wonder if there's a diffing library that allows bailing after X many differences though as it would work well for incredibly large files. I wonder if we could contribute that to dissimilar and if they'd take a patch that does that (maybe it's not too difficult?).

I think both dissimilar and imara expose an iterator over the patches, so I would assume that we can just stop iterating and thereby abort the computation of the diff early.

I have yet to check if my assumption is correct, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason beyond the cost of migration why you'd like to stay with dissimilar? From my superficial understanding, it looks like imara is simply a better (=faster) diffing lib in all respects that are relevant for deno.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason beyond the cost of migration why you'd like to stay with dissimilar?

We know the diff output of dissimilar is ok, but not sure yet about imara. Generally diffs are only shown in error cases so perf doesn't matter too much, but obviously several minutes is not acceptable 😅. How much faster is imara for this diff? I guess if it's fast enough on this case then maybe that's good enough and we don't need to worry about doing some iterator or max results approach.

Copy link
Contributor Author

@KnorpelSenf KnorpelSenf Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine, deno 2.5.0 needs around 36 minutes (!) to check the formatting of the file and print out the diff. My branch (still in debug build, did not compile with -r yet) cuts it down to 0.4 seconds.

I did not try larger files using Deno 2.5.0 but I tried them with this branch. The results are as follows:

  • 1 MB file: 0.4 seconds
  • 10 MB file: 4 seconds
  • 100 MB file: 40 seconds

(not evaluated this very scientifically, please take it with a grain of salt)

All files had a similar format as shown #30634.

Copy link
Contributor Author

@KnorpelSenf KnorpelSenf Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have started working on bringing back properly formatted diffs. One thing I have noticed is that imara is extremely good at finding line diffs, but it does not have built-in word-diffing (see pascalkuthe/imara-diff#1). I will ask if they accept contributions, but otherwise I'm afraid we will have to add the complexity here. This is something that dissimilar provides out of the box, but they seem to do it by not even tokenizing the input at all, which explains why it is so slow. (Note that this also means that imara might get a lot slower once we run word diffs with it.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream work continues: pascalkuthe/imara-diff#33

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

deno fmt --check gets stuck on large HTML files
2 participants