Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Avoid computing flags even lazy #466

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

viorelcanja
Copy link
Contributor

@viorelcanja viorelcanja commented May 7, 2021

This PR tries to optimize lazy flags computation. In the current implementation, information about the operation and operands is stored and is in some cases overwritten by subsequent operations. In other cases operations force generation of particular flags since they only do partial lazy computation.

The proposed optimization adds metadata to the analyzer to be able to identify which flags are tested and which are modified by each instruction. Using this information each jitted BB is analyzed for potential otpimizations and instructions which generate flags that are completely overwritten by subsequent instructions and thus never used are being replaced with alternate implementations which do not do flags computation (this includes not storing anything for lazy computation).

This improves performance significantly in some benchmars like running a tight loop that does mostly register operations.

Since it was not clear to me what the correct metadata is for some instructions, if a particular instruction does not have the metadata it will disable the optimization from that point on until the end of the BB.

Still to do:

  • metadata was generated semi-automatically and it could be wrong in some places, it should be double checked
  • I am just beginning to learn rust (I know enough to be dangerous) so I am probably using non-idiomatic rust. I probably need to go back and change some code.
  • I need to do more testing
  • I need to do more profiling
  • I need to adjust some tests (generated wasm has been changed)

Challenges:

  • duplicating functions to include a "_noflags" version is tedious and error prone. On the other hand dealing with this at runtime (including the check for flag computation in the jitted code) negates some of the perf advantages.
  • intruction metadata in the analyzer must be in sync with the actual implementation of the jit and interpreter, also the regular and _noflags jit versions need to be in sync

Initial benchmarks:

NEW

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          230.87  :       5.92  :       1.94
STRING SORT         :          7.0302  :       3.14  :       0.49
BITFIELD            :      1.0887e+08  :      18.68  :       3.90 <----
FP EMULATION        :          19.962  :       9.58  :       2.21
FOURIER             :          563.87  :       0.64  :       0.36
ASSIGNMENT          :          5.8462  :      22.25  :       5.77 <----
IDEA                :          544.41  :       8.33  :       2.47
HUFFMAN             :          341.86  :       9.48  :       3.03 <----
NEURAL NET          :         0.38348  :       0.62  :       0.26
LU DECOMPOSITION    :          13.835  :       0.72  :       0.52
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 9.261
FLOATING-POINT INDEX: 0.657
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : GenuineIntel 0f/06 1000MHz
L2 Cache            : 6144 KB
OS                  : Linux 4.19.0-16-686
C compiler          : gcc version 8.3.0 (Debian 8.3.0-6)
libc                : libc-2.28.so
MEMORY INDEX        : 2.220
INTEGER INDEX       : 2.381
FLOATING-POINT INDEX: 0.364
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

----

OLD

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          226.04  :       5.80  :       1.90
STRING SORT         :          7.0279  :       3.14  :       0.49
BITFIELD            :      6.4915e+07  :      11.14  :       2.33 <----
FP EMULATION        :           19.46  :       9.34  :       2.15
FOURIER             :          553.47  :       0.63  :       0.35
ASSIGNMENT          :          4.4541  :      16.95  :       4.40 <----
IDEA                :          545.29  :       8.34  :       2.48
HUFFMAN             :          321.95  :       8.93  :       2.85 <----
NEURAL NET          :         0.38921  :       0.63  :       0.26
LU DECOMPOSITION    :          14.207  :       0.74  :       0.53
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 8.150
FLOATING-POINT INDEX: 0.662
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : GenuineIntel 0f/06 1000MHz
L2 Cache            : 6144 KB
OS                  : Linux 4.19.0-16-686
C compiler          : gcc version 8.3.0 (Debian 8.3.0-6)
libc                : libc-2.28.so
MEMORY INDEX        : 1.707
INTEGER INDEX       : 2.320
FLOATING-POINT INDEX: 0.367
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

@viorelcanja viorelcanja marked this pull request as ready for review May 7, 2021 07:41
@viorelcanja
Copy link
Contributor Author

I added an new optimization unrelated to the old one to be able to see the cummultive effect. Please let me know if I should split the PR.

The new benchmarks are:

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          240.29  :       6.16  :       2.02
STRING SORT         :           7.026  :       3.14  :       0.49
BITFIELD            :      1.1096e+08  :      19.03  :       3.98
FP EMULATION        :          20.457  :       9.82  :       2.27
FOURIER             :          571.01  :       0.65  :       0.36
ASSIGNMENT          :           6.089  :      23.17  :       6.01
IDEA                :          634.61  :       9.71  :       2.88
HUFFMAN             :          360.39  :       9.99  :       3.19
NEURAL NET          :         0.40564  :       0.65  :       0.27
LU DECOMPOSITION    :          14.703  :       0.76  :       0.55
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 9.707
FLOATING-POINT INDEX: 0.686
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : GenuineIntel 0f/06 1000MHz
L2 Cache            : 6144 KB
OS                  : Linux 4.19.0-16-686
C compiler          : gcc version 8.3.0 (Debian 8.3.0-6)
libc                : libc-2.28.so
MEMORY INDEX        : 2.264
INTEGER INDEX       : 2.548
FLOATING-POINT INDEX: 0.380
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

@viorelcanja viorelcanja closed this May 8, 2021
@viorelcanja viorelcanja reopened this May 8, 2021
@copy
Copy link
Owner

copy commented May 8, 2021

Very nice contribution, thanks! I'll need some time to review this in detail, but here are some initial thoughts:

  • First of all, this is clearly a worthwhile optimisation
  • An interesting data point would be how often this optimisation prevents the generation of lazy flags. You can use the existing APIs to generate stats from the generated code at runtime, see profiler.rs, src/browser/print_stats.js and codegen::gen_profiler_stat_increment. Run make debug-with-profiler and use debug.html to see the statistics at the bottom of the page
  • The profiling info may also inform whether extending the set of functions called at runtime is worth it, for example add16, which is a called from the generated module vs add32, which is generated without a function call. The former is on its way out (see for example my recent change d929f4d), so we might as well leave called functions like add16 unoptimised for now. Alternatively, generate code in the noflags case and generate a call otherwise.
  • The nasmtest generator script (tests/nasm/create_tests.js) should be extended to hit both cases at least once (for each arithmetic instruction, once where flags are tested after the instruction (e.g. by adding a pushf) and once where they are overridden after the instruction). You can bump NO_TESTS to 10 or so for local testing and randomise which of the two cases is hit.
  • Not urgent, but you should at some point start running rustfmt on your code (cargo fmt --all)

Unfortunately, any instruction that may trigger a fault, including all memory accesses may read flags, so in the following sequence flags for add have to be computed:

add eax, ebx
mov [edx], 42

We may later have another optimisation that moves the flags computation into the slow path of the memory access.

Please let me know if I should split the PR.

Yes, please. You second optimisation seems useful on its own and will make this PR easier to review.

@viorelcanja
Copy link
Contributor Author

viorelcanja commented May 9, 2021

I reverted the commit with the mem fast path optimization and created a new PR.

  • re:profiling, I will add it
  • re: tests, I will add tests to cover both code paths
  • re: formating, of course :) I just need to get a bit used to the whole rust environment

Re: exceptions, agreed, Although for "well behaved code" it should not be an issue, in general, it is. You can get deterministic behavior by generating exceptions on purpose and reading the flags so they should be computed properly. For the general case we cannot optimize it back in the mem access slow path:

1 add eax,ebx
2 mov eax,0xffffffff
3 mov ebx,0xffffffff
4 mov [edx], eax
5 add eax,ebx

With the current optimization, information about how to compute flags is not saved for instr 1 since 5 completely rewrites the flags. If 4 triggers an exception we need the flags and without saved stated (which now is optimized) we cannot do it.We could partially get back some perf by looking at instructions at analysis time and still optimize the saving of state if operations between 1 and 4 allow reconstruction of necessary state. For instance, this could still be optimized since subsequent operations do not destroy the required state and information could be recovered in the slow path:

  1. add eax, ebx
  2. xchg eax, esi
  3. xchg ebx,edi
  4. mov [edx], eax
  5. add eax,ebx

Still, it gets complicated and I am not sure it's worth implementing this. Maybe a flag to ignore corner cases ( like enable_optimistic_optimizations ) might add more value to people looking at running "well behaved code". I'm curious what you think of this.

LE: edited the 2nd example to be more relevant.

@copy
Copy link
Owner

copy commented May 17, 2021

I reverted the commit with the mem fast path optimization and created a new PR.

Thanks!

Re: exceptions, agreed, Although for "well behaved code" it should not be an issue, in general, it is. You can get deterministic behavior by generating exceptions on purpose and reading the flags so they should be computed properly. For the general case we cannot optimize it back in the mem access slow path:

1 add eax,ebx
2 mov eax,0xffffffff
3 mov ebx,0xffffffff
4 mov [edx], eax
5 add eax,ebx

With the current optimization, information about how to compute flags is not saved for instr 1 since 5 completely rewrites the flags. If 4 triggers an exception we need the flags and without saved stated (which now is optimized) we cannot do it.We could partially get back some perf by looking at instructions at analysis time and still optimize the saving of state if operations between 1 and 4 allow reconstruction of necessary state. For instance, this could still be optimized since subsequent operations do not destroy the required state and information could be recovered in the slow path:

add eax, ebx
xchg eax, esi
xchg ebx,edi
mov [edx], eax
add eax,ebx

Good analysis, I believe you're correct. Tracking which registers contain which values from a previous arithmetic instruction could also help optimising jmpcc, setcc and cmov which currently only handles cases where the two instructions are next to each other (see ctx.last_instruction, gen_getzf, etc.). So I do think this is a worthwhile strategy, although it doesn't necessarily need to be implemented as part of this PR.

Maybe a flag to ignore corner cases ( like enable_optimistic_optimizations ) might add more value to people looking at running "well behaved code".

Not a fan of this. I don't know if I would enable it for copy.sh/v86, because whenever some OS doesn't work, I would need to turn off the flag to verify. I think the aforementioned profiling to determine how often a memory instructions prevents flags from being optimised could help inform the decision.

@copy copy mentioned this pull request Oct 7, 2021
Repository owner deleted a comment from micwoj92 Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants