-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mcount: Change TLS type to initial-exec #1606
base: master
Are you sure you want to change the base?
Conversation
-. Benefit changed the model of the TLS variable mtd to initial-exec. The initial-exec model doesn't ocurred problem because it is not make any calls to refer the TLS. the way to refer to variables in the intial-exec is adding or subtracting offsets from static TLS blocks as follows. mov 0x2d6a6 (% rip),% rax # 35fc8 <.got + 0x38> mov% fs: (% rax),% rbx as a rough bench result, the initial-exec type TLS variable can be expected to increase the reference speed by about x2 times or more than the TLS variable using the POSIX pthread. In addition, the initial-exec type TLS variable could be expected to improve the reference speed by about 20% compared to the dynamic type TLS variable created when using a general __thread. -. Limitation the initial-exec TLS model have limitation. TLS will allocated by loader and the initial-exec TLS block allocate statically. after process already running and the initial-exec allocated, it can be that there is not enough space in initial-exec static TLS block to allocate mtd. typical execution environment no need to worry about. --. Limitation mitigation but to prepare future update, there is need to reduce size of `tls_mtd` to minimized. added new two TLS variable pointer `mtd_tls` and bool `mcount_recursion_marker`. allocate `struct mcount_thread_data` to heap. and manage it by `*tls_mtd` which declared as TLS variable. and `mcount_recursion_marker` has same role with `mcount_recursion_marker` that inside of `struct mcount_thread_data`. -.. reference For more information on this, please see the following link: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt reference document written in Korean : https://die4taoam.tistory.com/37 Signed-off-by: Hanbum Park <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in my benchmark, it doesn't give a noticeable performance benefit.
Base:
# uftrace bench
Self avg Self min Function
========== ========== ====================
0.089 us 0.072 us bar
45.807 us 3.584 us linux:schedule (pre-empted)
0.030 us 0.025 us baz
0.030 us 0.026 us foo
Your change:
# uftrace bench
Self avg Self min Function
========== ========== ====================
0.091 us 0.073 us bar
46.826 us 3.639 us linux:schedule (pre-empted)
0.029 us 0.025 us foo
0.029 us 0.024 us baz
I've just merged the simple benchmark framework. Please use |
ok I'll check. @namhyung
the reason for this test is that sometimes the execution results vary greatly. because of this, I modified it to exclude some test results. |
and weird result
as you can see, some counting is missed while calling bar and foo. |
Hmm.. strange. But I don't see it on my machine. |
New test results notified! First of all, the loss of counts didn't happen when I tested it on a real machine, but I'm still not sure why. The new test was tested on ubuntu 22.04 on a real machine. 351 items within 1 std of standard deviation from approximately 1000 samples.
Points to note. As you can see in the image, the difference between when the standard deviation is applied and when it is not is large. |
Did you run it with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really sorry for the late review.
You need to split the change as it contains a lot of changes. Regarding the limitation, I think we can start from the simple implementation and then add complexities gradually.
@@ -763,10 +775,15 @@ static void mcount_init_file(void) | |||
sigaction(SIGSEGV, &sa, &old_sigact[1]); | |||
} | |||
|
|||
struct mcount_thread_data *mcount_thread_data_alloc(void) | |||
{ | |||
return get_thread_data() = xzalloc(sizeof(struct mcount_thread_data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird. Why not just set mtdp
directly?
#else | ||
#define TLS __thread | ||
#define get_thread_data() pthread_getspecific(mtd_key) | ||
#define check_thread_data(mtdp) (mtdp == NULL) | ||
#define TLS_ATTR __attribute__((tls_model("initial-exec"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can simply set
#define TLS __thread __attribute__((tls_model("initial-exec")))
@@ -783,15 +800,16 @@ struct mcount_thread_data *mcount_prepare(void) | |||
|
|||
compiler_barrier(); | |||
|
|||
if (!mtdp) | |||
mtdp = mcount_thread_data_alloc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. you did it here. :)
|
||
pthread_once(&once_control, mcount_init_file); | ||
prepare_shmem_buffer(mtdp); | ||
|
||
pthread_setspecific(mtd_key, mtdp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, mtd_dtor()
won't get called when a thread exits. We rely on this to synchronize thread status with uftrace process and release the resources. So even if we don't use the TSD APIs anymore, we should keep it for this reason.
-. Benefit
changed the model of the TLS variable mtd to initial-exec. The initial-exec model doesn't ocurred problem because it is not make any calls to refer the TLS. the way to refer to variables in the intial-exec is adding or subtracting offsets from static TLS blocks as follows.
mov 0x2d6a6 (% rip),% rax # 35fc8 <.got + 0x38>
mov% fs: (% rax),% rbx
as a rough bench result, the initial-exec type TLS variable can be expected to increase the reference speed by about x2 times or more than the TLS variable using the POSIX pthread.
In addition, the initial-exec type TLS variable could be expected to improve the reference speed by about 20% compared to the dynamic type TLS variable created when using a general __thread.
-. Limitation
the initial-exec TLS model have limitation. TLS will allocated by loader and the initial-exec TLS block allocate statically. after process already running and the initial-exec allocated, it can be that there is not enough space in initial-exec static TLS block to allocate mtd. typical execution environment no need to worry about.
--. Limitation mitigation
but to prepare future update, there is need to reduce size of
tls_mtd
to minimized. added new two TLS variable pointermtd_tls
and boolmcount_recursion_marker
.allocate
struct mcount_thread_data
to heap. and manage it by*tls_mtd
which declared as TLS variable. andmcount_recursion_marker
has same role withmcount_recursion_marker
that inside ofstruct mcount_thread_data
.-.. reference
For more information on this, please see the following link: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt
reference document written in Korean :
https://die4taoam.tistory.com/37
233 items excluding standard deviations from approximately 240 samples.
the above benchmark was conducted with the code that exists in review/bench.