Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely low performance for large test vectors #280

Open
Kuree opened this issue Nov 3, 2020 · 22 comments
Open

Extremely low performance for large test vectors #280

Kuree opened this issue Nov 3, 2020 · 22 comments
Assignees

Comments

@Kuree
Copy link
Collaborator

Kuree commented Nov 3, 2020

When I did an exhaustive test on floating points, I noticed performance issue with fault generated testbench. The test vectors size is 0x500 * 0x500 = 0x190000 ~ 1 million data points. Here is the profiling result:
image

The entire runtime takes 22k seconds, about 6 hours. As a comparison, complete exhaustive test using SystemVerilog + DPI only takes 10 hours to finish (4 billion test points). So the performance gap is orders of magnitude (2000x).

Notice that the RTL simulation takes about 12k seconds to run. This is due to the sheer size of testbench code generated, which is 408M.

I think there are several places where it can be improved:

  1. hwtypes computation. The native C implementation is about 100x faster. I looked at the actual float implementation and it is very inefficient:
exp = 1 - bias
s = ['-0.' if sign[0] else '0.', mantissa.binary_string()]
s.append('e')
s.append(str(exp))
return cls(gmpy2.mpfr(''.join(s),cls.mantissa_size+1, 2))

It is converting to and back from native gmp object using string. I think native conversion is available?
2. Getting frame information, particular filename and line number. I had the same performance issue before and was able to resolve it by implementing the same logic in native Python C API. It is included in kratos package. I can give you a pointer on how to use it, if you're interested. On my benchmark it is about 100-500x faster than doing the same thing in Python.

The real questions is whether the inefficient test bench is an intrinsic problem in fault. My understanding is that all the expect values have to be known during compilation time, which makes the testbench unscalable when there are lots of test vectors. One way to solve that is to allow simulator call python code during simulation. I have a prototype working: https://github.com/Kuree/kratos-dpi/blob/master/tests/test_function.py#L18-L28
It works with both Verilator and commercial simulators.

Please let me know what you think.

@leonardt
Copy link
Owner

leonardt commented Nov 3, 2020

For 2. are you using the syntax tester.circuit.O.expect(value)? If so, you can try tester.expect(DUT.O, value) instead. The tester.circuit syntax has some intrinsic performance issues (it uses the inspect library). Using tester.expect instead avoids the inspect library calls. I think a longer term solution is to use a deep embedding with AST introspection rather than using the tester.circuit interface. Again, the interface relies inherently on the inspect library which is slow. While we can work on optimizing it, I think efforts are better spent on a better interface that avoids inspect all together (since even if we optimize with the C-api, we'll still get an inherent performance penalty versus approaches that avoid inspect).

1/3 seem related. I think if we can do the expect computation in the test bench, we can avoid the overhead in the Python implementation. While calling python code in the simulation is probably useful in general, in this case we could probably compile the expect logic to the target. One simple example is to generate the input vectors into a binary file, run a C program to compute the expected output file (this could use the fast native implementation and avoid python overhead), then use the fault file i/o (like we do in TBG) to simply load the input/expected test vectors into memory and check them. It would interesting if we could wrap this pattern into a simple Python API.

So two things to try:

  • use tester.expect rather than tester.circuit.expect (also you can use tester.poke rather than tester.circuit.I = value)
  • try using a file based IO pattern for input/expected vectors (this should make the test generation small/fast since we just need to compile the inner loop rather than all input/output values)

Can we see how much this improves performance? Then we can look into providing a cleaner/simpler API for this pattern.

@leonardt
Copy link
Owner

leonardt commented Nov 3, 2020

Looking at https://github.com/Kuree/kratos-dpi/blob/master/tests/test_function.py#L18-L28, it does seem quite nifty, I think we could simply integrate this into fault by having fault be aware of these DPI functions and generate the code to call them?

@leonardt
Copy link
Owner

leonardt commented Nov 3, 2020

e.g. tester.expect(test.call(<dpi_func>, ...args)), so we could add a call action that accepts a DPI function and generates the code to invoke it

@leonardt
Copy link
Owner

leonardt commented Nov 3, 2020

Actually for 2. I noticed now that you're using the lassen PE to compute the expected result, so that doesn't make it easy to do the file based pattern unless we add support for compiling lassen to fast C code, so adding a tester.call for the DPI function might be the best approach to avoid having to generate everything ahead of time.

@leonardt
Copy link
Owner

leonardt commented Nov 3, 2020

Also, are input values being generated ahead of time? (I'm guessing so since expect values are generated ahead of time). That might be a related problem, having to store all the inputs in memory and scan through them will be inherently less memory efficient than just generating the inputs, using them to poke the DUT and compute the expected value, then discarding them (only need to keep one set of input/outputs in memory versus scanning through an entire test set).

@leonardt
Copy link
Owner

leonardt commented Nov 3, 2020

Thinking about this pattern at a higher level though, it seems like ideally the functional model should not change (maybe every so often when obscure bugs are found). So, pre-computing the expect output vectors and storing them in a file (even if this is done in Python using lassen) may not be a bad idea. If we expect the RTL to change more often than the functional model, reusing the input/output vectors might be smarter than re-computing them every time we run the test.

Furthermore, if we want to lift these tests to the tile level or something, we can avoid recomputing the expected output if we just store those. So, we may want to consider this file based approach, but perhaps we can abstract it using a TestVector data structure that is serializable and fault is aware of it (so the user doesn't have to manage the details of handling the binary file).

@Kuree
Copy link
Collaborator Author

Kuree commented Nov 3, 2020

I think writing functional model in python is way easier than C, and probably faster to implement than C++. The prototype I had does the following thing:

  1. Take that python functions and embedded it into a C++ function call via pybind
  2. Use pybind to compile the Python interpreter into the DPI library

You can see the generated C++ code here:
https://github.com/Kuree/kratos-dpi/blob/master/tests/gold/generator_func.cc

Couple thoughts:

  1. External libraries are supported, but they are looked up via sys.path. We should prepare the path to support any user libraries.
  2. DPI also allows pointer pass. So we could even allow class function calls by proper casting pointer types in the generated wrapper code.

@leonardt
Copy link
Owner

leonardt commented Nov 3, 2020

Here are the things I think fault needs from kratos to call the compiled DPI function:

  • the DPI function declaration (i.e. import "DPI-C" function shortint test_add(input shortint a, input shortint b);). we could maybe just add an API to introspect the kratos object to determine what type the inputs/return value are and generate the rest of the declaration.
  • path to the generated DPI code (i.e. generator_func.cc)

Seems like dpi_compile could return an object/dictionary with this information

@Kuree
Copy link
Collaborator Author

Kuree commented Nov 3, 2020

Sounds good. That being said, that project was an experiment to see the potential of DPI-based simulation with Python, so it lacks many capabilities.

Here are some enhancement I can think of:

  1. SV DPI supports chandle argument type, which is essentially a C raw pointer. We can use that to represent Python C object, which can get converted to a Python object natively. Doing so allows a functional model seamlessly interact with the test bench.
  2. Some complex assertion logic can be handled in the Python side as well. This may allow both Verilator and SV simulators share the same assertion backend.

Point 1 may be a stretch goal, especially the Python class part. If we requires class object to be pickle-able, I think the implementation can be straight-forward. If you think this is a good idea, I can sketch out a detailed proposal and some prototypes.

Point 2 requires a solution for persistent Python object cross different DPI calls, which might be able to resolved by porint 1.

Please let me know what you think.

@leonardt
Copy link
Owner

leonardt commented Nov 4, 2020

  1. Why does it need to be pickle-able? Is it because you want to pickle the object during test bench generation time, then load it in during simulation run time? Or does the object need to be serialized/deserialized during execution?
  2. Seems related to 1 as you said, serializing/deserializing objects might be expensive. Is this because each invocation of a python DPI function spins up a fresh interpreter instance? In general it should be possible to keep one active interpreter running and dispatch execution to it, but perhaps there's some reason this isn't possible with the current prototype? I've done this in the past with this prototype code: https://github.com/leonardt/pycoreir/blob/master/libcoreir-python/coreir-python.cpp but it uses the raw C-API rather than pybind, so not sure if it's more difficult there. The GIL is a bit tricky to work with when doing this though.

@Kuree
Copy link
Collaborator Author

Kuree commented Nov 4, 2020

  1. It's the former. I want to load the object during simulation. Since at that point we have lost the python objects, we need a way to reconstruct the functional model. Pure functional DPI call doesn't have this problem since it's stateless, but I doubt any complex functional model can be expressed in such way.
  2. Each python DPI call doesn't spins up a fresh interpreter instance, if done carefully. I will take a closer look at pybind, but I think it's very doable. If we can guarantee only one thread can invoke the python execution, we should be able to workaround the GIL.

@leonardt
Copy link
Owner

leonardt commented Nov 9, 2020

@Kuree, have you had a chance to try whether using tester.poke(port, value) avoids the inspect overhead (versus tester.circuit.port = value)? If you can send me a pointer to the test file, I can give it a try

@Kuree
Copy link
Collaborator Author

Kuree commented Nov 10, 2020

No I haven't. I was caught up by other work. I will try to run it this week.

@leonardt
Copy link
Owner

Another use case for this came up: a stateful functional model in Python that is used to verify the behavior in RTL. This can't be done at compile time since the behavior of the model depends on runtime values (e.g. random ready/valid signals generated at runtime will determine functional model behavior). What we'd like to do is be able to call into a Python functional model at simulation time when an event occurs (e.g. when a ready/valid handshake occurs, call functional model so it updates it state). We would need to be able to have persistent state for functional models during simulation run, but it doesn't necessarily need to be setup/persistent from compile time (although maybe it could be nice).

@Kuree
Copy link
Collaborator Author

Kuree commented Nov 15, 2020

I think I'm doing something wrong here. Below is the new profile result after the change:
image

The code is at container keyi-romantic_ptolemy on kiwi. To attach the container, you can do docker attach keyi-romantic_ptolemy. Please detach the container instead of exit when you're done inspecting.
To run the code, simply do:

python -m cProfile -o profile.stat tests/test_pe.py 

@Kuree
Copy link
Collaborator Author

Kuree commented Nov 15, 2020

Just follow up the conversation about native support for Python models. I started to implement a new Python-to-DPI library that's designed to be framework-independent. I'm still trying to refactor the interface to make it easier to use, but the core implementation is there. https://github.com/Kuree/pysv

What is working now:

  • Foreign pip packages such as numpy and tensorflow
  • Python class

You can see the examples here in tests:

I'm working on generate SV and C++ class wrapper to allow users to use it directly without dealing with mingled function name. I'd appreciate it if you can help me the integration:

@leonardt
Copy link
Owner

Hmm, looks like there's some use of inspect that I forgot was added https://github.com/leonardt/fault/blame/4ea7b511181aa857a5d85862dc93d8a3b59ac9a0/fault/tester/base.py#L152, @sgherbst was this use as part of your flow? I'll need to dig in to where it's used, but I'm guessing we can add a flag to disable it for performance.

@sgherbst
Copy link
Collaborator

@leonardt that would be totally fine; it's just for debugging purposes. The traceback is mainly helpful for SPICE simulations, where Expect actions are implemented in post-processing.

@leonardt
Copy link
Owner

Makes sense, we can maybe disable it by default and add a "debug" flag to enable it? Also, it looks like it's using inspect.stack to get the calling frame. We can also optimize this to use inspect.getcurrentframe and a back pointer to be a little more efficient (just fetches the frame you want rather than getting the entire stack into a list)

@sgherbst
Copy link
Collaborator

Great - it's OK with me if the feature is disabled by default.

@Kuree Kuree mentioned this issue Nov 18, 2020
@leonardt
Copy link
Owner

#288 removes the inspect logic from the default code path, can you try this out and see if it improves performance (mainly avoids the inspect calls).

@Kuree
Copy link
Collaborator Author

Kuree commented Nov 22, 2020

Here is the result after using #288
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants