-
-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alloc shims #191
base: master
Are you sure you want to change the base?
Add alloc shims #191
Conversation
There is also another PR #189, that tries to solve the problem. It also uses the standard approach to overriding malloc with LD_PRELOAD (providing its own dummy implementation), and that is OK, that is standard approach. However, it is based on false assumptions that it is enough to provide it during coz initialization (which might not always be true), and that memory allocated during initialization will never be deallocted later. On the other hand, not all allocations during initialization need to be allocated using dummy implementation. It is also falsely assumed that there is no bug in the implementation of malloc shim, and that no program that used to get profiled with coz before, could now crash because of the provided shim. Also, it has larger performance impact due to if condition in malloc. |
How to review this PR?The change is not that big as it seems. The implementation of the shim logic itself, is contained in single file with 200 lines of code, with big portion of it being comments actually. However, it might be intimidating to accept such change, as although there is not much code, the logic is rather tricky and it has to deal with concurrency, which is always error prone. First, take a look at other files (not Then take a look at What I have tested?It is reasonable to assume that there is some situation I haven't thought about, or failed to get it right. I've made all the reasonable efforts to make sure this works even for programs that don't require it and usually will not load this library. I've tested it on couple of programs using tcmalloc, partition alloc from google, jemalloc, and also with standard malloc with preloading shim library. I've made all the assumptions I could think of, about situations that I haven't encountered, but which could happen in production. I've spent days, tweaking and adjusting code, to make it as safe, simple, and reliable as possible. I've tested |
@ccurtsinger Hello. Since you're assigned to the related bug, and since it seems to be a common problem, could you please take a look? At least so that I know it will be considered. Thanks. |
Hello, @kormang! I could not deny that your approach solves issue. And that it is more precise and safe for issue that bothers both of us. I'd be just ok to have your solution in COZ. Though engineer' curiosity will not let ignore critique of my own PR :) It's not for arguing but pure interest. It's been few weeks since I revisited it last time but I'll try:
|
Please take a look at https://github.com/emeryberger/Heap-Layers/blob/master/wrappers/heapredirect.h to see if we can use that instead (I'd prefer to use an existing, time-tested shim). |
@VGubarev Hello, I'll share my thoughts with you in the comments in your PR. @emeryberger Thanks for suggestion. I've only reacted with thumbs up on your comments when I've read it, didn't have time for much more than that. :) I'll try today to figure out what it is, and how to use it, and get back to you. |
@emeryberger I have looked into https://github.com/emeryberger/Heap-Layers/blob/master/wrappers/heapredirect.h. But first, I have read all 3 papers related to heap layers. I find it very interesting, and elegant. Few years ago, I implemented my own allocation algorithm for extremely resource constrained embedded system. It was huge win in terms of performance and fragmentation. But it ended up having obvious layers for different allocations sizes and different strategies, but still being monolithic piece code. I wish I knew about heap layers back then. Then I continued to study source code of heap layers, and I started from example of usage in The most important thing I've learned from the source code is that we can just redefine So let me explain why we only need code from Creating a wrapper around allocation functions is not our goal at all (it was a way of solving another problem). Code from So, I'd like to copy some pieces of |
Well, unfortunately it does not work. I completely forgot that we need One thing I could do, is, simplify code by using thread local, like in One reason, why I don't switch to using this approach with thread local and atomic pointer, now, is because I don't think I'll gain anything. I've already spent a lot of time testing current solution, I'll have to do it all over again. However I will try to simplify my code a little bit, and use different naming where possible, to make it clear that it is in fact really simple. |
I think it is easier to comprehend now. Any way, my reasoning is like this: From my own testing experience the code works well. And I think I have pretty representative cases, because I've tested it on programs with 4 different allocation algorithms, and I have already encountered almost all other bugs in the issue list, and few that haven't been reported yet. Next, I'd like to move on to next bug that prevents me from effectively use coz - which is reporting lines that were never accessed (if it is even possible to solve it). ..but I'm still open for suggestion, the most important thing is to get this done and make everybody happy. |
The problem
Some programs that require more optimized dynamic memory allocation, especially in multithread environments, use alternative implementations of malloc, such as tcmalloc or jemalloc. They replace system malloc with alternative implementation.
Calls to dlsym (from resolve_some_symbol functions in
real.cpp
, in particular) sometimes require dynamic memory allocation (specifically, by callingcalloc
).Alternative malloc implementations call some of the functions wrapped in
real.cpp
, which results in calls to dlsym, which ends up with either stack overflow (infinite recursion), or deadlock.The solution
Provide shim, yet another implementation of malloc, that is preloaded as as separate library from libcoz. That makes it optional. It is perfectly reasonable to assume that there are certain conditions, which are difficult to predict in advance, under which such shim would actually break things instead of improving them. Programs that don't need it, and don't want risk of breaking something because of the shim itself, can simply not load it. Programs that use jemalloc, or tcmalloc, and can not be profiled with coz without this shim. Such programs can preload this library, and benefit from it.
Shim should call real malloc most of the time, and with as little overhead as possible.
Shim should call its own malloc implementation, only while resolving real symbols using libdl, only from the thread that actually resolves a symbol.
Implementation details
Shims are located in alloc_shims.cpp, together with two instrumentation function (
coz_lock_and_set_dummy_alloc_shims
andcoz_restore_real_alloc_shims_and_unlock
), they end up being compiled into separate library -libcozallocshims.so
.The two instrumentation functions are weakly referenced from
libcoz
, and whenlibcozallocshims
is loaded, they are called before and after calls todlsym
inresolve_some_symbol
functions in real.cpp. Whenlibcozallocshims
is not loaded, they are nullptrs inlibcoz
, and they are not called, so there is no effect on running program except for comparing these functions to nullptrs, which happens very limited number of times during runtime of the program.Also, when
libcozallocshims
is not loaded, malloc and calloc it provides are not loaded either, so there is no effect on the program being profiled, they use real malloc directly.By default shims are not loaded, to preload this library user should pass
--with-alloc-shims
command line switch to coz python script, which will add it to LD_PRELOAD.Library tries to find real malloc when either program calls malloc for the first time (which results in
first_malloc
function being called), or when first time symbol is resolved in real.cpp (which ever happens first).Library insures that its own malloc is called only while resolving symbols with
dlsym
inreal.cpp
, and only by that thread. Other threads during that time should end up calling real malloc. To simplify concurrent code, the two instrumentation functions use spinlock, to ensure that only one thread resolves symbols inreal.cpp
at a time. But this doesn't happen very often and usually there should never be congestion there, we just want to make sure.Because this alternative malloc is used only for
dlsym
calls during symbol resolution inreal.cpp
, it can be really simple, and it doesn't have to use large memory pools. That is why it is calleddummy_malloc
.