mcount: Change TLS type to initial-exec #1606

ParkHanbum · 2022-12-18T20:51:07Z

-. Benefit
changed the model of the TLS variable mtd to initial-exec. The initial-exec model doesn't ocurred problem because it is not make any calls to refer the TLS. the way to refer to variables in the intial-exec is adding or subtracting offsets from static TLS blocks as follows.

mov 0x2d6a6 (% rip),% rax # 35fc8 <.got + 0x38>
mov% fs: (% rax),% rbx

as a rough bench result, the initial-exec type TLS variable can be expected to increase the reference speed by about x2 times or more than the TLS variable using the POSIX pthread.

In addition, the initial-exec type TLS variable could be expected to improve the reference speed by about 20% compared to the dynamic type TLS variable created when using a general __thread.

-. Limitation
the initial-exec TLS model have limitation. TLS will allocated by loader and the initial-exec TLS block allocate statically. after process already running and the initial-exec allocated, it can be that there is not enough space in initial-exec static TLS block to allocate mtd. typical execution environment no need to worry about.

--. Limitation mitigation
but to prepare future update, there is need to reduce size of tls_mtd to minimized. added new two TLS variable pointer mtd_tls and bool mcount_recursion_marker.

allocate struct mcount_thread_data to heap. and manage it by *tls_mtd which declared as TLS variable. and mcount_recursion_marker has same role with mcount_recursion_marker that inside of struct mcount_thread_data.

-.. reference
For more information on this, please see the following link: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt

reference document written in Korean :
https://die4taoam.tistory.com/37

233 items excluding standard deviations from approximately 240 samples.

Reduced execution time by 8.98 ms on average.
About 7.2% performance improvement

the above benchmark was conducted with the code that exists in review/bench.

-. Benefit changed the model of the TLS variable mtd to initial-exec. The initial-exec model doesn't ocurred problem because it is not make any calls to refer the TLS. the way to refer to variables in the intial-exec is adding or subtracting offsets from static TLS blocks as follows. mov 0x2d6a6 (% rip),% rax # 35fc8 <.got + 0x38> mov% fs: (% rax),% rbx as a rough bench result, the initial-exec type TLS variable can be expected to increase the reference speed by about x2 times or more than the TLS variable using the POSIX pthread. In addition, the initial-exec type TLS variable could be expected to improve the reference speed by about 20% compared to the dynamic type TLS variable created when using a general __thread. -. Limitation the initial-exec TLS model have limitation. TLS will allocated by loader and the initial-exec TLS block allocate statically. after process already running and the initial-exec allocated, it can be that there is not enough space in initial-exec static TLS block to allocate mtd. typical execution environment no need to worry about. --. Limitation mitigation but to prepare future update, there is need to reduce size of `tls_mtd` to minimized. added new two TLS variable pointer `mtd_tls` and bool `mcount_recursion_marker`. allocate `struct mcount_thread_data` to heap. and manage it by `*tls_mtd` which declared as TLS variable. and `mcount_recursion_marker` has same role with `mcount_recursion_marker` that inside of `struct mcount_thread_data`. -.. reference For more information on this, please see the following link: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt reference document written in Korean : https://die4taoam.tistory.com/37 Signed-off-by: Hanbum Park <[email protected]>

namhyung

But in my benchmark, it doesn't give a noticeable performance benefit.

Base:

# uftrace bench
    Self avg    Self min  Function
  ==========  ==========  ====================
    0.089 us    0.072 us  bar
   45.807 us    3.584 us  linux:schedule (pre-empted)
    0.030 us    0.025 us  baz
    0.030 us    0.026 us  foo

Your change:

# uftrace bench
    Self avg    Self min  Function
  ==========  ==========  ====================
    0.091 us    0.073 us  bar
   46.826 us    3.639 us  linux:schedule (pre-empted)
    0.029 us    0.025 us  foo
    0.029 us    0.024 us  baz

namhyung · 2022-12-18T22:22:06Z

I've just merged the simple benchmark framework. Please use make bench on a real machine (hopefully in idle state) as it would take care of things like cpufreq to get more reliable results.

ParkHanbum · 2022-12-19T13:29:10Z

ok I'll check.

@namhyung
I forgot to write that the above result is the test result of my modified code.
I tested it by modifying the bench-code.

diff --git a/misc/bench.c b/misc/bench.c
index b6041342..323e653a 100644
--- a/misc/bench.c
+++ b/misc/bench.c
@@ -22,10 +22,8 @@ int bench(int count)
        volatile int result = 0;
 
        for (int i = 0; i < count; i++) {
-               if (i % 2 == 0)
-                       foo(&result);
-               else
-                       bar(&result);
+               foo(&result);
+               bar(&result);
        }
        return result;
 }
@@ -36,11 +34,6 @@ int main(int argc, char *argv[])
        int loop = 1000000;
        int result = 0;
 
-       if (argc > 1)
-               n = atoi(argv[1]);
-       if (argc > 2)
-               loop = atoi(argv[2]);
-
        for (int i = 0; i < n; i++)
                result += bench(loop);

the reason for this test is that sometimes the execution results vary greatly. because of this, I modified it to exclude some test results.

ParkHanbum · 2022-12-19T16:16:14Z

and weird result

root@f039c186f545:/uftrace# ./uftrace record --libmcount-path=. misc/bench;./uftrace report;
  Total time   Self time       Calls  Function
  ==========  ==========  ==========  ====================
  155.624 ms    0.167 us           1  main
  155.624 ms   59.497 ms           1  bench
   76.973 ms   57.184 ms      500000  bar
   19.789 ms   19.789 ms      441893  baz
   19.153 ms   19.153 ms      420017  foo
    0.459 us    0.459 us           1  __monstartup
    0.042 us    0.042 us           1  __cxa_atexit
root@f039c186f545:/uftrace# vi misc/bench.c 
root@f039c186f545:/uftrace# ./uftrace record --libmcount-path=. misc/bench;./uftrace report;
  Total time   Self time       Calls  Function
  ==========  ==========  ==========  ====================
  160.097 ms    0.166 us           1  main
  160.097 ms   63.551 ms           1  bench
   77.389 ms   57.533 ms      500000  bar
   19.856 ms   19.856 ms      435272  baz
   19.156 ms   19.156 ms      415368  foo
    0.416 us    0.416 us           1  __monstartup
    0.083 us    0.083 us           1  __cxa_atexit

as you can see, some counting is missed while calling bar and foo.
do you think that reason of this weird test result is because it was not tested on a real machine?

namhyung · 2022-12-20T05:28:18Z

Hmm.. strange. But I don't see it on my machine.

ParkHanbum · 2022-12-20T08:47:43Z

New test results notified!

First of all, the loss of counts didn't happen when I tested it on a real machine, but I'm still not sure why.

The new test was tested on ubuntu 22.04 on a real machine.
Here are the test results.

351 items within 1 std of standard deviation from approximately 1000 samples.

Reduced execution time by 5.04 ms on average.
About 3.3% performance improvement

Points to note. As you can see in the image, the difference between when the standard deviation is applied and when it is not is large.

namhyung · 2022-12-27T04:47:49Z

Did you run it with misc/bench.sh (in other words, did you change the cpufreq governor to performance?) Please share what the script shows otherwise I have no idea what the number means.

namhyung

I'm really sorry for the late review.

You need to split the change as it contains a lot of changes. Regarding the limitation, I think we can start from the simple implementation and then add complexities gradually.

namhyung · 2023-03-07T02:54:38Z

libmcount/mcount.c

@@ -763,10 +775,15 @@ static void mcount_init_file(void)
 	sigaction(SIGSEGV, &sa, &old_sigact[1]);
 }

+struct mcount_thread_data *mcount_thread_data_alloc(void)
+{
+	return get_thread_data() = xzalloc(sizeof(struct mcount_thread_data));


This is weird. Why not just set mtdp directly?

namhyung · 2023-03-07T02:56:09Z

libmcount/internal.h

 #else
 #define TLS __thread
-#define get_thread_data() pthread_getspecific(mtd_key)
-#define check_thread_data(mtdp) (mtdp == NULL)
+#define TLS_ATTR __attribute__((tls_model("initial-exec")))


Maybe we can simply set

#define TLS __thread __attribute__((tls_model("initial-exec")))

namhyung · 2023-03-07T02:57:17Z

libmcount/mcount.c

@@ -783,15 +800,16 @@ struct mcount_thread_data *mcount_prepare(void)

 	compiler_barrier();

+	if (!mtdp)
+		mtdp = mcount_thread_data_alloc();


Hmm.. you did it here. :)

namhyung · 2023-03-07T03:11:25Z

libmcount/mcount.c


 	pthread_once(&once_control, mcount_init_file);
 	prepare_shmem_buffer(mtdp);

-	pthread_setspecific(mtd_key, mtdp);


With this change, mtd_dtor() won't get called when a thread exits. We rely on this to synchronize thread status with uftrace process and release the resources. So even if we don't use the TSD APIs anymore, we should keep it for this reason.

namhyung reviewed Dec 18, 2022

View reviewed changes

namhyung mentioned this pull request Mar 7, 2023

unittest: Fix c99 compiler error #1642

Merged

namhyung reviewed Mar 7, 2023

View reviewed changes

ParkHanbum mentioned this pull request Feb 6, 2025

[suggest] feature for tracing to already running process #1414

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mcount: Change TLS type to initial-exec #1606

mcount: Change TLS type to initial-exec #1606

ParkHanbum commented Dec 18, 2022

namhyung left a comment

namhyung commented Dec 18, 2022 •

edited

Loading

ParkHanbum commented Dec 19, 2022 •

edited

Loading

ParkHanbum commented Dec 19, 2022 •

edited

Loading

namhyung commented Dec 20, 2022

ParkHanbum commented Dec 20, 2022

namhyung commented Dec 27, 2022

namhyung left a comment

namhyung Mar 7, 2023

namhyung Mar 7, 2023

namhyung Mar 7, 2023

namhyung Mar 7, 2023

mcount: Change TLS type to initial-exec #1606

Are you sure you want to change the base?

mcount: Change TLS type to initial-exec #1606

Conversation

ParkHanbum commented Dec 18, 2022

namhyung left a comment

Choose a reason for hiding this comment

namhyung commented Dec 18, 2022 • edited Loading

ParkHanbum commented Dec 19, 2022 • edited Loading

ParkHanbum commented Dec 19, 2022 • edited Loading

namhyung commented Dec 20, 2022

ParkHanbum commented Dec 20, 2022

namhyung commented Dec 27, 2022

namhyung left a comment

Choose a reason for hiding this comment

namhyung Mar 7, 2023

Choose a reason for hiding this comment

namhyung Mar 7, 2023

Choose a reason for hiding this comment

namhyung Mar 7, 2023

Choose a reason for hiding this comment

namhyung Mar 7, 2023

Choose a reason for hiding this comment

namhyung commented Dec 18, 2022 •

edited

Loading

ParkHanbum commented Dec 19, 2022 •

edited

Loading

ParkHanbum commented Dec 19, 2022 •

edited

Loading