Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcount: Change TLS type to initial-exec #1606

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ParkHanbum
Copy link
Contributor

-. Benefit
changed the model of the TLS variable mtd to initial-exec. The initial-exec model doesn't ocurred problem because it is not make any calls to refer the TLS. the way to refer to variables in the intial-exec is adding or subtracting offsets from static TLS blocks as follows.

mov 0x2d6a6 (% rip),% rax # 35fc8 <.got + 0x38>
mov% fs: (% rax),% rbx

as a rough bench result, the initial-exec type TLS variable can be expected to increase the reference speed by about x2 times or more than the TLS variable using the POSIX pthread.

In addition, the initial-exec type TLS variable could be expected to improve the reference speed by about 20% compared to the dynamic type TLS variable created when using a general __thread.

-. Limitation
the initial-exec TLS model have limitation. TLS will allocated by loader and the initial-exec TLS block allocate statically. after process already running and the initial-exec allocated, it can be that there is not enough space in initial-exec static TLS block to allocate mtd. typical execution environment no need to worry about.

--. Limitation mitigation
but to prepare future update, there is need to reduce size of tls_mtd to minimized. added new two TLS variable pointer mtd_tls and bool mcount_recursion_marker.

allocate struct mcount_thread_data to heap. and manage it by *tls_mtd which declared as TLS variable. and mcount_recursion_marker has same role with mcount_recursion_marker that inside of struct mcount_thread_data.

-.. reference
For more information on this, please see the following link: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt

reference document written in Korean :
https://die4taoam.tistory.com/37

스크린샷 2022-12-19 오전 5 10 24

233 items excluding standard deviations from approximately 240 samples.

  • Reduced execution time by 8.98 ms on average.
  • About 7.2% performance improvement

the above benchmark was conducted with the code that exists in review/bench.

-. Benefit
changed the model of the TLS variable mtd to initial-exec.
The initial-exec model doesn't ocurred problem because it is not make
any calls to refer the TLS. the way to refer to variables in the
intial-exec is adding or subtracting offsets from static TLS blocks
as follows.

mov 0x2d6a6 (% rip),% rax # 35fc8 <.got + 0x38>
mov% fs: (% rax),% rbx

as a rough bench result, the initial-exec type TLS variable can be
expected to increase the reference speed by about x2 times or more
than the TLS variable using the POSIX pthread.

In addition, the initial-exec type TLS variable could be expected to
improve the reference speed by about 20% compared to the dynamic type
TLS variable created when using a general __thread.

-. Limitation
the initial-exec TLS model have limitation. TLS will allocated by
loader and the initial-exec TLS block allocate statically. after
process already running and the initial-exec allocated, it can be
that there is not enough space in initial-exec static TLS block to
allocate mtd. typical execution environment no need to worry about.

--. Limitation mitigation
but to prepare future update, there is need to reduce size of
`tls_mtd` to minimized. added new two TLS variable pointer `mtd_tls`
and bool `mcount_recursion_marker`.

allocate `struct mcount_thread_data` to heap. and manage it by
`*tls_mtd` which declared as TLS variable. and `mcount_recursion_marker`
has same role with `mcount_recursion_marker` that inside of
`struct mcount_thread_data`.

-.. reference
For more information on this, please see the following link:
https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt

reference document written in Korean :
https://die4taoam.tistory.com/37

Signed-off-by: Hanbum Park <[email protected]>
Copy link
Owner

@namhyung namhyung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in my benchmark, it doesn't give a noticeable performance benefit.

Base:

# uftrace bench
    Self avg    Self min  Function
  ==========  ==========  ====================
    0.089 us    0.072 us  bar
   45.807 us    3.584 us  linux:schedule (pre-empted)
    0.030 us    0.025 us  baz
    0.030 us    0.026 us  foo

Your change:

# uftrace bench
    Self avg    Self min  Function
  ==========  ==========  ====================
    0.091 us    0.073 us  bar
   46.826 us    3.639 us  linux:schedule (pre-empted)
    0.029 us    0.025 us  foo
    0.029 us    0.024 us  baz

@namhyung
Copy link
Owner

namhyung commented Dec 18, 2022

I've just merged the simple benchmark framework. Please use make bench on a real machine (hopefully in idle state) as it would take care of things like cpufreq to get more reliable results.

@ParkHanbum
Copy link
Contributor Author

ParkHanbum commented Dec 19, 2022

ok I'll check.

@namhyung
I forgot to write that the above result is the test result of my modified code.
I tested it by modifying the bench-code.

diff --git a/misc/bench.c b/misc/bench.c
index b6041342..323e653a 100644
--- a/misc/bench.c
+++ b/misc/bench.c
@@ -22,10 +22,8 @@ int bench(int count)
        volatile int result = 0;
 
        for (int i = 0; i < count; i++) {
-               if (i % 2 == 0)
-                       foo(&result);
-               else
-                       bar(&result);
+               foo(&result);
+               bar(&result);
        }
        return result;
 }
@@ -36,11 +34,6 @@ int main(int argc, char *argv[])
        int loop = 1000000;
        int result = 0;
 
-       if (argc > 1)
-               n = atoi(argv[1]);
-       if (argc > 2)
-               loop = atoi(argv[2]);
-
        for (int i = 0; i < n; i++)
                result += bench(loop);
 

the reason for this test is that sometimes the execution results vary greatly. because of this, I modified it to exclude some test results.

@ParkHanbum
Copy link
Contributor Author

ParkHanbum commented Dec 19, 2022

and weird result

root@f039c186f545:/uftrace# ./uftrace record --libmcount-path=. misc/bench;./uftrace report;
  Total time   Self time       Calls  Function
  ==========  ==========  ==========  ====================
  155.624 ms    0.167 us           1  main
  155.624 ms   59.497 ms           1  bench
   76.973 ms   57.184 ms      500000  bar
   19.789 ms   19.789 ms      441893  baz
   19.153 ms   19.153 ms      420017  foo
    0.459 us    0.459 us           1  __monstartup
    0.042 us    0.042 us           1  __cxa_atexit
root@f039c186f545:/uftrace# vi misc/bench.c 
root@f039c186f545:/uftrace# ./uftrace record --libmcount-path=. misc/bench;./uftrace report;
  Total time   Self time       Calls  Function
  ==========  ==========  ==========  ====================
  160.097 ms    0.166 us           1  main
  160.097 ms   63.551 ms           1  bench
   77.389 ms   57.533 ms      500000  bar
   19.856 ms   19.856 ms      435272  baz
   19.156 ms   19.156 ms      415368  foo
    0.416 us    0.416 us           1  __monstartup
    0.083 us    0.083 us           1  __cxa_atexit

as you can see, some counting is missed while calling bar and foo.
do you think that reason of this weird test result is because it was not tested on a real machine?

@namhyung
Copy link
Owner

Hmm.. strange. But I don't see it on my machine.

@ParkHanbum
Copy link
Contributor Author

New test results notified!

First of all, the loss of counts didn't happen when I tested it on a real machine, but I'm still not sure why.

The new test was tested on ubuntu 22.04 on a real machine.
Here are the test results.

스크린샷 2022-12-20 오후 5 44 15

351 items within 1 std of standard deviation from approximately 1000 samples.

  • Reduced execution time by 5.04 ms on average.
  • About 3.3% performance improvement

Points to note. As you can see in the image, the difference between when the standard deviation is applied and when it is not is large.

@namhyung
Copy link
Owner

Did you run it with misc/bench.sh (in other words, did you change the cpufreq governor to performance?) Please share what the script shows otherwise I have no idea what the number means.

Copy link
Owner

@namhyung namhyung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really sorry for the late review.

You need to split the change as it contains a lot of changes. Regarding the limitation, I think we can start from the simple implementation and then add complexities gradually.

@@ -763,10 +775,15 @@ static void mcount_init_file(void)
sigaction(SIGSEGV, &sa, &old_sigact[1]);
}

struct mcount_thread_data *mcount_thread_data_alloc(void)
{
return get_thread_data() = xzalloc(sizeof(struct mcount_thread_data));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. Why not just set mtdp directly?

#else
#define TLS __thread
#define get_thread_data() pthread_getspecific(mtd_key)
#define check_thread_data(mtdp) (mtdp == NULL)
#define TLS_ATTR __attribute__((tls_model("initial-exec")))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can simply set

#define TLS  __thread __attribute__((tls_model("initial-exec")))

@@ -783,15 +800,16 @@ struct mcount_thread_data *mcount_prepare(void)

compiler_barrier();

if (!mtdp)
mtdp = mcount_thread_data_alloc();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. you did it here. :)


pthread_once(&once_control, mcount_init_file);
prepare_shmem_buffer(mtdp);

pthread_setspecific(mtd_key, mtdp);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, mtd_dtor() won't get called when a thread exits. We rely on this to synchronize thread status with uftrace process and release the resources. So even if we don't use the TSD APIs anymore, we should keep it for this reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants