Benchmark on a bigger example #4

hirrolot · 2024-07-11T20:27:10Z

We need to benchmark the supercompiler on some bigger example than those in the examples/ folder, to see which parts take most time. The usual suspects are substitution and homeomorphic embedding.

But before that, Mazeppa needs to be smart enough to emit reasonable code for all programs (see #2). It makes no sense to benchmark the algorithm that needs conceptual changes.

Note: all the benchmarks below were conducted on AMD Ryzen 9 5900HX.

The text was updated successfully, but these errors were encountered:

hirrolot · 2024-07-28T12:49:43Z

It seems that the self-interpreter example is the best candidate for benchmarking so far. The input list length can be set to 10 or some higher value. After I'm done with a few implementation changes, I'll be posting here benchmarking data and musings about what improvements can be made.

hirrolot · 2024-07-31T12:45:52Z

I'm now pretty confident that homeomorphic embedding is what slows down the supercompilation process the most (unsurprisingly). Take this example:

main() := loop(5000u32, expand());

loop(i, _t) := match =(i, 0u32) {
    T() -> Done(),
    F() -> loop(-(i, 1u32), expand())
};

expand() := Combine(
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar())
);

The results are:

$ time ../scripts/mazeppa.sh run

real    0m37.573s
user    0m37.530s
sys     0m0.040s

However, if I change the implementation of Homeomorphic_emb.decide to just return false:

$ time ../scripts/mazeppa.sh run

real    0m2.162s
user    0m2.115s
sys     0m0.047s

hirrolot · 2024-08-01T07:10:50Z

After commit e791c38 on the above example (0m37.573s real previously):

real    0m15.506s
user    0m15.375s
sys     0m0.130s

A self-interpreter with the following main function:

main() :=
    let input :=
        cons(Const(1i32), cons(Const(2i32), cons(Const(3i32),
        cons(Const(4i32), cons(Const(5i32), cons(Const(6i32),
        cons(Const(7i32), cons(Const(8i32), cons(Const(9i32),
        cons(Const(10i32), nil()))))))))));
    eval(Nil(), FCall("sum", Cons(input, Nil())));

Before:

real    0m7.495s
user    0m7.446s
sys     0m0.049s

After:

real    0m0.507s
user    0m0.458s
sys     0m0.049s

hirrolot · 2024-08-09T10:55:47Z

Benchmarking a self-interpreter with the following main function:

main() :=
    let input :=
        cons(Const(10i32), cons(Const(10i32), cons(Const(10i32),
        cons(Const(10i32), cons(Const(10i32), cons(Const(10i32),
        cons(Const(10i32), cons(Const(10i32), cons(Const(10i32),
        cons(Const(10i32), cons(Const(10i32), cons(Const(10i32),
        cons(Const(10i32), cons(Const(10i32), cons(Const(10i32),
        nil())))))))))))))));
    eval(Nil(), FCall("sum", Cons(input, Nil())));

Before commit 26a88fc (now using hyperfine):

Benchmark 1: ../scripts/mazeppa.sh run
  Time (mean ± σ):      8.524 s ±  0.195 s    [User: 8.469 s, System: 0.053 s]
  Range (min … max):    8.293 s …  8.861 s    10 runs

After:

Benchmark 1: ../scripts/mazeppa.sh run
  Time (mean ± σ):      2.521 s ±  0.047 s    [User: 2.449 s, System: 0.072 s]
  Range (min … max):    2.469 s …  2.615 s    10 runs

Also, it turns out that both caches, the global size cache and the local result cache, improve performance a lot. If we remove any one of them, the performance will be much worse.

hirrolot · 2024-08-09T12:52:27Z

Also, I should note that the above contrived example now takes around 32 seconds to supercompile (before adding the result cache, it used to take around 15 seconds). The reasons are:

Querying and extending the cache is useless because there seems to be little or no duplication of work.
Result_cache.create 128 takes time.
Passing the cache as a function parameter takes time.

However, I think that self-interpretation exhibits a more "real-life" scenario of supercompilation, which is far more important to us. Therefore, I'll keep the result cache anyway.

hirrolot · 2024-08-10T17:27:59Z

This is a self-interpretation memory benchmark (yes, now using generateList):

main() := sum(generateList(10i32));

--print-gc-stats gives the following data:

minor_collections:      405           
major_collections:      13
compactions:            0
forced_major_collections: 1

minor_words:    102296710
promoted_words:   1392482
major_words:      1701843

top_heap_words: 1177774
heap_words:      110856
live_words:       17435
free_words:       92917
largest_free:         0
fragments:          504

live_blocks: 2937
free_blocks: 0
heap_chunks: 0

That is, Mazeppa has allocated a total of ~782.8 megabytes:

(minor_words + major_words − promoted_words)×8÷1024÷1024 = (102296710 + 1701843 − 1392482)×8÷1024÷1024 ≈ 782.8

However, only ~13.0 megabytes (1701843*8/1024/1024) have ended up in the major heap. Also, there have been about 30 more minor collections than major collections.

What this data tells us? Probably that the OCaml GC is doing its best by collecting many short-lived values while using the major heap to its minimum.

hirrolot · 2024-08-11T09:43:45Z

A self-interpreter with the following main function:

main() := sum(generateList(20i32));

Without Flambda (OCaml version 5.1.2):

Benchmark 1: dune exec mazeppa --release -- run
  Time (mean ± σ):     10.742 s ±  0.098 s    [User: 10.661 s, System: 0.083 s]
  Range (min … max):   10.611 s … 10.933 s    10 runs

With Flambda (OCaml version 5.2.0):

Benchmark 1: dune exec mazeppa --release -- run
  Time (mean ± σ):     10.296 s ±  0.114 s    [User: 10.201 s, System: 0.097 s]
  Range (min … max):   10.182 s … 10.510 s    10 runs

So without any fine tuning, the performance with Flambda is slightly better. It's probably worth experimenting with different Flambda options for more aggressive optimization than -O3.

hirrolot · 2024-08-11T14:07:15Z

The same self-interpreter invocation as in the message above:

main() := sum(generateList(20i32));

After commit 192ad81 (with Flambda):

Benchmark 1: dune exec mazeppa --release -- run
  Time (mean ± σ):      8.365 s ±  0.076 s    [User: 8.284 s, System: 0.084 s]
  Range (min … max):    8.300 s …  8.537 s    10 runs

With regards to memory usage, it's about 1.5x better: previously, Mazeppa allocated ~755.1 MB on the major heap; now it's ~486.3 MB. That is, both speed and performance are better.

hirrolot · 2024-08-11T16:56:32Z

After commit 41871ff:

Benchmark 1: dune exec mazeppa --release -- run
  Time (mean ± σ):      8.169 s ±  0.057 s    [User: 8.085 s, System: 0.084 s]
  Range (min … max):    8.085 s …  8.256 s    10 runs

hirrolot · 2024-08-13T07:10:41Z

Running time remains the same after commit b3fc70b. Memory usage is reduced by a few megabytes (~484.4 MB on the major heap compared to ~486.3 MB previously).

hirrolot self-assigned this Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark on a bigger example #4

Benchmark on a bigger example #4

hirrolot commented Jul 11, 2024 •

edited

Loading

hirrolot commented Jul 28, 2024 •

edited

Loading

hirrolot commented Jul 31, 2024 •

edited

Loading

hirrolot commented Aug 1, 2024 •

edited

Loading

hirrolot commented Aug 9, 2024 •

edited

Loading

hirrolot commented Aug 9, 2024 •

edited

Loading

hirrolot commented Aug 10, 2024 •

edited

Loading

hirrolot commented Aug 11, 2024 •

edited

Loading

hirrolot commented Aug 11, 2024 •

edited

Loading

hirrolot commented Aug 11, 2024 •

edited

Loading

hirrolot commented Aug 13, 2024 •

edited

Loading

Benchmark on a bigger example #4

Benchmark on a bigger example #4

Comments

hirrolot commented Jul 11, 2024 • edited Loading

hirrolot commented Jul 28, 2024 • edited Loading

hirrolot commented Jul 31, 2024 • edited Loading

hirrolot commented Aug 1, 2024 • edited Loading

hirrolot commented Aug 9, 2024 • edited Loading

hirrolot commented Aug 9, 2024 • edited Loading

hirrolot commented Aug 10, 2024 • edited Loading

hirrolot commented Aug 11, 2024 • edited Loading

hirrolot commented Aug 11, 2024 • edited Loading

hirrolot commented Aug 11, 2024 • edited Loading

hirrolot commented Aug 13, 2024 • edited Loading

hirrolot commented Jul 11, 2024 •

edited

Loading

hirrolot commented Jul 28, 2024 •

edited

Loading

hirrolot commented Jul 31, 2024 •

edited

Loading

hirrolot commented Aug 1, 2024 •

edited

Loading

hirrolot commented Aug 9, 2024 •

edited

Loading

hirrolot commented Aug 9, 2024 •

edited

Loading

hirrolot commented Aug 10, 2024 •

edited

Loading

hirrolot commented Aug 11, 2024 •

edited

Loading

hirrolot commented Aug 11, 2024 •

edited

Loading

hirrolot commented Aug 11, 2024 •

edited

Loading

hirrolot commented Aug 13, 2024 •

edited

Loading