Allow users to mark blocking/unblocking points around direct system calls #44

toddlipcon · 2016-02-26T05:14:37Z

My application uses futex to implement a spinlock. This is fairly common in high performance server code (eg gperftools uses futex for spinlocks, as do libraries like facebook's folly).

For better accuracy, libcoz should intercept the futex syscall and treat it similar to how condition variables are treated.

toddlipcon · 2016-02-26T05:26:00Z

The difficulty here is that system calls are often called by inline assembly 'int 0x80' so the LD_PRELOAD interception isn't sufficient. Probably need to use ptrace, or otherwise require the program being profiled to annotate these manually-implemented synchronization primitives using something similar to how the COZ_PROGRESS macro calls into the libcoz runtime.

toddlipcon · 2016-02-26T06:12:18Z

I put some hacky code here: https://gist.github.com/toddlipcon/761fa7f8bd9e91f8a8dd
though not getting very good results on my actual application. Hints would be great.

ccurtsinger · 2016-02-26T13:26:36Z

I'm not excited about paying ptrace's ~8% overhead all the time, since this would distort the program runtime quite a bit more than Coz already does. Extra instrumentation isn't ideal in terms of user effort, but it should work. Could you post your profile.coz file somewhere so I can take a look at the results?

A long term solution may be to use perf's trace events, which can count context switches and thread blocking counts, regardless of the cause of the blocking/unblocking event.

emeryberger · 2016-02-26T14:10:23Z

Another option might be for Coz to expose a mechanism that would let programmers indicate that certain functions correspond to pthread_mutex_lock and pthread_mutex_unlock and condvar friends.

ccurtsinger · 2016-04-06T19:24:27Z

@emeryberger: Yeah, that's roughly what the patch above does.

@toddlipcon: your implementation seems like it should be okay. Do you have a simple example where coz seems to do the wrong thing with your extra macros?

toddlipcon · 2016-04-07T05:51:51Z

I got a chance to look at this again today (thanks for pinging this issue).

Looking at the profile results, it looks like coz is just picking the same experiment over and over again to the point that it basically is never exploring any interesting parts of the code.

In particular, I'm profiling a benchmark program that looks like the following code:

SetUpRPCServer();
for (int i = 0; i < 8; i++) {
  threads.emplace_back(([]{ 
    while (true) {
      MakeRPCCallToLocalhost();
      COZ_PROGRESS;
    }

The implementation of 'MakeRPCCall' essentially delegates work to another thread (a libev event loop) and then blocks on a mutex/condvar until the call gets back.

The issue seems to be that, in the steady state, most of my threads are blocked in this stack trace waiting for a response. So the task-clock based profile collection means that it's extremely likely to pick this line of code for the experiment:

todd@todd-ThinkPad-T540p:~/git/coz$ grep experiment ../kudu/profile.coz  | awk '{print $2}' | sort | uniq -c | sort -nk1 | tail
      1 selected=/home/todd/git/kudu/thirdparty/gflags-2.1.2/include/gflags/gflags.h:154
      1 selected=/home/todd/git/kudu/thirdparty/installed-deps/include/google/protobuf/io/coded_stream.h:1091
      1 selected=/home/todd/git/kudu/thirdparty/libev-4.20/ev.c:3521
      2 selected=/home/todd/git/kudu/src/kudu/rpc/rpc-bench.cc:75
      2 selected=/home/todd/git/kudu/src/kudu/rpc/rpc-bench.cc:76
      2 selected=/home/todd/git/kudu/thirdparty/glog-0.3.4/src/logging.cc:2034
      3 selected=/home/todd/git/kudu/build/release/src/kudu/rpc/rtest.pb.cc:721
      4 selected=/home/todd/git/kudu/src/kudu/rpc/rpc-bench.cc:81
     33 selected=/home/todd/git/kudu/build/release/src/kudu/rpc/rtest.proxy.cc:23
   1454 selected=/home/todd/git/kudu/build/release/src/kudu/rpc/rtest.proxy.cc:84

(rtest.proxy.cc:84 is the last line within my source code for sending an RPC).

It seems almost as if the perf events are only getting collected from the "client" thread and not the "server" threads which are in the same process. Any ideas?

toddlipcon · 2016-04-07T06:17:36Z

I just noticed that if I don't pass '-s %%/src/kudu/%%' I end up getting a lot better spreading to experiments on other files. Maybe something is wrong with the way that experiment lines are getting filtered.

ccurtsinger · 2016-08-10T01:10:29Z

When you omit the -s flag, do you find lots of samples in source files that don't match that pattern?

If coz gets a sample that's not in the given source scope, it walks back up the stack to find the last callsite that is in scope (if any). I'm guessing your application runtime is dominated by computation that is invoked (indirectly) from the callsite where your hotspot is.

ccurtsinger · 2016-08-10T01:23:02Z

The fix for #57 should resolve your second issue, once it's done.

Could you submit the change in your gist as a pull request?

vlovich · 2022-03-11T13:41:09Z

Are the pre block/post block annotations needed for epoll too? Or is epoll understood properly and this only applies to condvar and futexes?

I have a complicated multiprocess system (processes spawned via forks of the main process which is always idle and uninteresting) and trying to see if coz will be a good fit to figure out why some complicated RPC between some components is slow.

Fixes plasma-umass#44

ccurtsinger added the feature This issue is a feature request label Aug 10, 2016

ccurtsinger added this to the v0.5 milestone Aug 10, 2016

ccurtsinger changed the title ~~libcoz should intercept futex syscalls~~ Allow users to mark blocking/unblocking points around direct system calls Aug 10, 2016

ccurtsinger removed this from the v0.5 milestone Dec 9, 2016

vlovich added a commit to vlovich/coz that referenced this issue Mar 11, 2022

Add pre_block/post_block/wake_other

9ece196

Fixes plasma-umass#44

vlovich linked a pull request Mar 11, 2022 that will close this issue

API to annotate block points #195

Open

vlovich added a commit to vlovich/coz that referenced this issue Oct 21, 2022

Add pre_block/post_block/wake_other

dbda214

Fixes plasma-umass#44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow users to mark blocking/unblocking points around direct system calls #44

Allow users to mark blocking/unblocking points around direct system calls #44

toddlipcon commented Feb 26, 2016 •

edited by ccurtsinger

Loading

toddlipcon commented Feb 26, 2016

toddlipcon commented Feb 26, 2016

ccurtsinger commented Feb 26, 2016

emeryberger commented Feb 26, 2016

ccurtsinger commented Apr 6, 2016

toddlipcon commented Apr 7, 2016

toddlipcon commented Apr 7, 2016

ccurtsinger commented Aug 10, 2016

ccurtsinger commented Aug 10, 2016

vlovich commented Mar 11, 2022

Allow users to mark blocking/unblocking points around direct system calls #44

Allow users to mark blocking/unblocking points around direct system calls #44

Comments

toddlipcon commented Feb 26, 2016 • edited by ccurtsinger Loading

toddlipcon commented Feb 26, 2016

toddlipcon commented Feb 26, 2016

ccurtsinger commented Feb 26, 2016

emeryberger commented Feb 26, 2016

ccurtsinger commented Apr 6, 2016

toddlipcon commented Apr 7, 2016

toddlipcon commented Apr 7, 2016

ccurtsinger commented Aug 10, 2016

ccurtsinger commented Aug 10, 2016

vlovich commented Mar 11, 2022

toddlipcon commented Feb 26, 2016 •

edited by ccurtsinger

Loading