-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely low performance for large test vectors #280
Comments
For 2. are you using the syntax 1/3 seem related. I think if we can do the expect computation in the test bench, we can avoid the overhead in the Python implementation. While calling python code in the simulation is probably useful in general, in this case we could probably compile the expect logic to the target. One simple example is to generate the input vectors into a binary file, run a C program to compute the expected output file (this could use the fast native implementation and avoid python overhead), then use the fault file i/o (like we do in TBG) to simply load the input/expected test vectors into memory and check them. It would interesting if we could wrap this pattern into a simple Python API. So two things to try:
Can we see how much this improves performance? Then we can look into providing a cleaner/simpler API for this pattern. |
Looking at https://github.com/Kuree/kratos-dpi/blob/master/tests/test_function.py#L18-L28, it does seem quite nifty, I think we could simply integrate this into fault by having |
e.g. |
Actually for 2. I noticed now that you're using the |
Also, are input values being generated ahead of time? (I'm guessing so since expect values are generated ahead of time). That might be a related problem, having to store all the inputs in memory and scan through them will be inherently less memory efficient than just generating the inputs, using them to poke the DUT and compute the expected value, then discarding them (only need to keep one set of input/outputs in memory versus scanning through an entire test set). |
Thinking about this pattern at a higher level though, it seems like ideally the functional model should not change (maybe every so often when obscure bugs are found). So, pre-computing the expect output vectors and storing them in a file (even if this is done in Python using lassen) may not be a bad idea. If we expect the RTL to change more often than the functional model, reusing the input/output vectors might be smarter than re-computing them every time we run the test. Furthermore, if we want to lift these tests to the tile level or something, we can avoid recomputing the expected output if we just store those. So, we may want to consider this file based approach, but perhaps we can abstract it using a TestVector data structure that is serializable and fault is aware of it (so the user doesn't have to manage the details of handling the binary file). |
I think writing functional model in python is way easier than C, and probably faster to implement than C++. The prototype I had does the following thing:
You can see the generated C++ code here: Couple thoughts:
|
Here are the things I think fault needs from kratos to call the compiled DPI function:
Seems like |
Sounds good. That being said, that project was an experiment to see the potential of DPI-based simulation with Python, so it lacks many capabilities. Here are some enhancement I can think of:
Point 1 may be a stretch goal, especially the Python class part. If we requires class object to be pickle-able, I think the implementation can be straight-forward. If you think this is a good idea, I can sketch out a detailed proposal and some prototypes. Point 2 requires a solution for persistent Python object cross different DPI calls, which might be able to resolved by porint 1. Please let me know what you think. |
|
|
@Kuree, have you had a chance to try whether using |
No I haven't. I was caught up by other work. I will try to run it this week. |
Another use case for this came up: a stateful functional model in Python that is used to verify the behavior in RTL. This can't be done at compile time since the behavior of the model depends on runtime values (e.g. random ready/valid signals generated at runtime will determine functional model behavior). What we'd like to do is be able to call into a Python functional model at simulation time when an event occurs (e.g. when a ready/valid handshake occurs, call functional model so it updates it state). We would need to be able to have persistent state for functional models during simulation run, but it doesn't necessarily need to be setup/persistent from compile time (although maybe it could be nice). |
Just follow up the conversation about native support for Python models. I started to implement a new Python-to-DPI library that's designed to be framework-independent. I'm still trying to refactor the interface to make it easier to use, but the core implementation is there. https://github.com/Kuree/pysv What is working now:
You can see the examples here in tests:
I'm working on generate SV and C++ class wrapper to allow users to use it directly without dealing with mingled function name. I'd appreciate it if you can help me the integration:
|
Hmm, looks like there's some use of |
@leonardt that would be totally fine; it's just for debugging purposes. The traceback is mainly helpful for SPICE simulations, where |
Makes sense, we can maybe disable it by default and add a "debug" flag to enable it? Also, it looks like it's using |
Great - it's OK with me if the feature is disabled by default. |
#288 removes the inspect logic from the default code path, can you try this out and see if it improves performance (mainly avoids the inspect calls). |
Here is the result after using #288 |
When I did an exhaustive test on floating points, I noticed performance issue with fault generated testbench. The test vectors size is 0x500 * 0x500 = 0x190000 ~ 1 million data points. Here is the profiling result:
The entire runtime takes 22k seconds, about 6 hours. As a comparison, complete exhaustive test using SystemVerilog + DPI only takes 10 hours to finish (4 billion test points). So the performance gap is orders of magnitude (2000x).
Notice that the RTL simulation takes about 12k seconds to run. This is due to the sheer size of testbench code generated, which is 408M.
I think there are several places where it can be improved:
hwtypes
computation. The native C implementation is about 100x faster. I looked at the actual float implementation and it is very inefficient:It is converting to and back from native gmp object using string. I think native conversion is available?
2. Getting frame information, particular filename and line number. I had the same performance issue before and was able to resolve it by implementing the same logic in native Python C API. It is included in
kratos
package. I can give you a pointer on how to use it, if you're interested. On my benchmark it is about 100-500x faster than doing the same thing in Python.The real questions is whether the inefficient test bench is an intrinsic problem in fault. My understanding is that all the expect values have to be known during compilation time, which makes the testbench unscalable when there are lots of test vectors. One way to solve that is to allow simulator call python code during simulation. I have a prototype working: https://github.com/Kuree/kratos-dpi/blob/master/tests/test_function.py#L18-L28
It works with both Verilator and commercial simulators.
Please let me know what you think.
The text was updated successfully, but these errors were encountered: