Skip to content

Conversation

@arybczak
Copy link

The way folding functions were written (explicit recursion) prevented GHC from inlining them and exposing them to further optimizations (like specialization). Making them inlinable results in massive performance improvements.

Relevant benchmark results (needs #51 as a baseline) with GHC 9.10.3:

$ cabal run traversals -- --baseline traversals.csv
All
  vector
    imap
      native:  OK
        35.3 μs ± 3.3 μs,       same as baseline
      imap:    OK
        35.6 μs ± 2.8 μs,       same as baseline
      default: OK
        91.4 μs ± 4.7 μs, 21% less than baseline
    itraverse: OK
      98.1 μs ± 7.8 μs, 68% less than baseline
...
  list
    imap
      native:  OK
        44.6 μs ± 3.3 μs,       same as baseline
      imap:    OK
        42.0 μs ± 3.5 μs,       same as baseline
      default: OK
        39.0 μs ± 3.8 μs, 33% less than baseline
    itraverse: OK
      60.9 μs ± 1.7 μs, 76% less than baseline
$ cabal run folds -- --baseline folds.csv
All
  vector
    itoList
      native:          OK
        29.2 μs ± 2.3 μs,       same as baseline
      itoList:         OK
        34.5 μs ± 1.4 μs,       same as baseline
    ifoldMap ([]):     OK
      57.1 μs ± 5.5 μs, 55% less than baseline
    ifoldMap (vector): OK
      16.2 ms ± 1.6 ms, 77% less than baseline
    ifoldMap' (sum):   OK
      9.09 μs ± 688 ns,       same as baseline
    ifoldr:            OK
      34.6 μs ± 1.3 μs,        same as baseline
    ifoldl:            OK
      29.9 μs ± 2.9 μs,       same as baseline
    ifoldr':           OK
      6.70 μs ± 516 ns,       same as baseline
    ifoldl':           OK
      9.17 μs ± 596 ns,       same as baseline
...
  list
    itoList
      native:          OK
        34.7 μs ± 1.3 μs,       same as baseline
      itoList:         OK
        39.1 μs ± 3.7 μs, 15% less than baseline
    ifoldMap ([]):     OK
      73.9 μs ± 6.6 μs, 54% more than baseline
    ifoldMap (vector): OK
      191  μs ±  13 μs, 99% less than baseline
    ifoldMap' (sum):   OK
      10.4 μs ± 944 ns, 64% less than baseline
    ifoldr:            OK
      41.0 μs ± 3.4 μs, 12% less than baseline
    ifoldl:            OK
      200  μs ±  11 μs,       same as baseline
    ifoldr':           OK
      154  μs ±  11 μs,       same as baseline
    ifoldl':           OK
      13.1 μs ± 1.1 μs, 57% less than baseline

ifoldMap is slower now when mappend is relatively cheap (on the other hand it was previously unusable with vector/bytestring/text etc.), so I'm not going to fiercely defend change to ifoldMapListOff, but I think it's less surprising for it to agree wrt. time complexity with foldMap from base.

Btw, I noticed this while checking out core for checkIxAdjoin from optics test suite. This gets almost twice as fast with my change.

Before:

$ cabal run perf-test -- --csv perf.csv
All
  test: OK
    35.1 μs ± 2.0 μs

Now:

$ cabal run perf-test -- --baseline perf.csv
All
  test: OK
    18.5 μs ± 1.4 μs, 47% less than baseline

The way folding functions were written (explicit recursion) prevented GHC from
inlining them and exposing them to further optimizations. Making them inlinable
results in a massive performance improvement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant