-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Effects: double translation of functions and dynamic switching between direct-style and CPS code #1461
base: master
Are you sure you want to change the base?
Conversation
I think this PR is best reviewed as a whole, not commit by commit. |
It looks like these two effect handler benchmarks are slower with this PR, 18 % and 8 % slower, respectively. I need to spend some time on it to understand why.
|
Chameneos runs are too short. You should increase the input size. It takes the input size as a command line argument. Something that runs for a second is more representative as it eliminates the noise due to JIT and other constant time costs. |
Good point. I find that chameneos is 9,8 % slower with this PR, 3.428 s versus 3.753 s. My theory is that effect handlers are slightly slower due to the fact that function calls in CPS code cost an additional field access (applying |
I agree with the reasoning and do not expect real programs to behave like |
Btw, the original numbers aren't useful to understand the improvements brought about by this PR. For this, you need 3 variants:
I'd be interested to see the difference between (2) and (3) in addition to the current numbers which show the difference between (1) and (3). |
Note that I had to do that in #1397: js_of_ocaml/compiler/lib/generate.ml Lines 954 to 959 in 3b3f66b
|
I believe that the form
Here are the graphs showing the difference between |
Thanks. The execution time improvement is smaller than what I would have expected. Is that surprising to you or does it match your expectation? Also, it would be useful to have all the variants plotted in the same graph with |
It more or less matches my expectation. My reasoning is the following: on most of these small, monomorphic benchmarks, the static analysis will eliminate most CPS calls at compile time. Therefore, the dynamic switching will not change the run time a lot and maybe slightly worsen it. On benchmarks that heavily use effect handlers, I also expect the run time to be worse: most of the time is spent in CPS code anyway, the dynamic switching only adds overhead. I therefore expect the biggest improvements to happen on larger programs, on which the static analysis does not work as well due to higher-order and mutability; and which do not spend most of their time in effect handlers. If my hypothesis is verified, then the question is: is this trade-off acceptable? Keeping in mind that there might be ways to improve this PR to save more performance. |
I have updated the PR message with new graphs showing all the variants. |
I tried to build a benchmark that uses Domainslib, as an example of a more typical effect-using program. But the linker complains that the primitive I tried to add it in a new Also, when I build js_of_ocaml I’m getting a lot of primitive-related messages that I don’t really understand:
|
I solved it by downgrading to lockfree 0.3.0 as suggested by @jonludlam. But the resulting program never completes. I assume that the mock parallelism provided by the runtime doesn’t suffice for using Domainslib—a “domain” must be stuck forever spinwaiting or something. |
5362ff4
to
91e352e
Compare
I think that this PR is ready for review. The only two problems that prevent the CI from being green are:
|
Just add it to the stdlib module compiler/lib/stdlib.ml |
Lambda_lifting doesn't seem to be used anymore, is it expected ? Should Lambda_lifting_simple replace Lambda_lifting ? |
From a quick look at the PR, the benefit of such change is not clear, can you highlight examples where we see clear improvements ? |
0625907
to
3cd6c95
Compare
As I discovered just today, There are now two lambda lifting passes for two different reasons. For this reason I am not convinced that there is an interest in merging the two modules.
I am convinced that most real-world effect-using programs will benefit from this PR, for the reasons given in my above message; but it’s hard to prove it, because we don’t have (yet) examples of such typical programs that work in Javascript. Programs using Domainslib don’t work well with js_of_ocaml (and are arguably not really relevant as JS is not a multicore language). Concurrency libraries like Eio are a more natural fit. I am currently trying to cook up a benchmark using the experimental JS backend for Eio. |
Given the size impact of this change, it would be nice to be able to disable (or control) this optimization. There are programs that would not benefit from this optimization, it be nice to not have to pay the size cost for no benefits. The two lambda_lifting modules are confusing, we should either merge them (allowing to share duplicated code) or at least find better names.
Do you know where this come from ? does it mostly come from the new lambda lifting pass ? or are other passes heavily affected as well (generate, js_assign, ..) |
Thank you for the review and sorry for the response delay, I have been prioritizing another objective in the previous weeks. One update is that there is no performance gain on programs that use Eio, which is a shame as it is expected to be one of the central uses of effects. More generally, when the program stays almost all the time within at least one level of effect handlers, there is essentially no performance gain as we run the CPS versions of every function. And I expect this programming pattern (installing a topmost effect handler at the beginning of the program) to be the most common with effects. So it is unclear to me yet if the implementation of double translation can be adapted to accommodate this. |
f4bd063
to
b025490
Compare
2e21f2e
to
dded0e8
Compare
Co-authored-by: hhugo <[email protected]>
@hhugo Is there something remaining for this to be merged? Do you consider ocaml/dune#11222 to be a necessary prerequisite? |
I'd like to do another review with a proper screen. Reviewing on a phone is not ideal |
ocaml/dune#11222 is clearly not a prerequisite. |
(func (export "%perform") (param $eff (ref eq)) (result (ref eq)) | ||
(if (i32.eqz (global.get $effect_allowed)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see any equivalent logic on the js side. I seems that caml_assume_no_perform completely forbid effect with wasm while effects could still happen with js after resuming a stack ? I must admit my understanding is poor in the area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both JS and Wasm, performing under caml_assume_no_perform
must raise Effect.Unhandled
, expect after installing a handler, i.e. resuming a stack. On the js_of_ocaml side, it is ensured by having caml_assume_no_perform be a synonym of caml_callback
, which switches to a blank fiber stack. On the wasm_of_ocaml side, it’s not possible to do it for a reason that @vouillon explained me but that I forgot. An alternative is to have this global boolean to track whether performing an effect should raise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we have different behavior between wasm and js ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the behavior is the same, as checked by the tests in tests-ocaml/lib-effects/
.
@rickyvetter If you want to run benchmarks on this branch and check that nothing behaves unexpectedly, I expect this to be close to the final version. |
This feature makes programs that use OCaml 5 effects run faster in Javascript, by running as little continuation-passing style (CPS) code as possible. Based on an initial suggestion by @lpw25, we generate two versions for each functions that may be called from inside an effect handler (according to the existing static analysis): a direct-style version and a CPS version. At runtime, direct-style versions are used, except when entering an effect handler, in which case only CPS is run until the outermost effect handler is exited. This approach trades speed for program size, since—because a number of functions are compiled to two versions—the generated programs are bigger. For this reason, the feature is opt-in behind the
--enable doubletranslate
flag1. This is a joint work with @vouillon.We encountered a design difficulty: when functions are transformed into pairs of functions, it is unclear how to deal with captured identifiers when the functions are nested. To avoid this problem, functions that must be transformed are lambda-lifted, and thus no longer have any free variables except toplevel ones.
The transform is rather successful in preserving the performance of small / monomorphic programs.
I hypothesize thatEdit: I am not able to reproduce a slowdown on hamming in my latest round of benchmarks.hamming
is slower for the same reason as on current master: it uses lazy values, which are an obstacle for the global flow analysis.fft
is slightly slower, however. The generated codes look very similar.EDIT: Note that in the following benchmarks,
--enable=effects
means the vanilla CPS transform;--enable=effects w. dyn. sw.
(with dynamic switching) means the CPS transform with double translation enabled.The difference becomes negligible on large programs. CAMLboy is actually… 3 % faster with effects enabled (compared to 25 % slower previously): 520 FPS instead of 505 FPS, although the standard deviation is high at ~11 FPS, so it would be fair to say that the difference is not discernable.
ocamlc
is not discernably slower, either (compared to 10 % slower previously).#1461 (comment) breaks down which parts of program are actually made faster or slower, and why typical effect-using programs will be faster with this feature.
As some functions must be generated in two versions, the size of the generated code is larger (up to 76 % larger), a few percents larger when compressed.
Compiling
ocamlc
is about 70 % slower; the resulting file is 64 % larger when compressed.A caveat of this approach is that all benefits are lost as soon as an effect handler is installed. This is an issue for scheduling libraries such as Eio, as they usually work by having an effect handler installed for the program’s entire lifetime. To mitigate this, we provide
Js_of_ocaml.Js.Effect.assume_no_perform : (unit -> 'a) -> 'a
. Evaluatingassume_no_perform f
runs the direct style version off
—the faster version. This also applies to transitive callees off
that do not use effect handlers. The programmer must ensure that these functions do not perform effects (not without installing a new effect handler).Footnotes
During the review process, this has been modified to
--effects=double-translation
. ↩