-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed tracing support in Openwhisk #2192
Comments
Hi, thank you very much for working on this, sounds like an awesome idea. One important historic bit though: We've kinda build our own tracing using our logs (see the markers we attach to our logs). Those are parseable by, for example, Elasticsearch. It would be nice to integrate into this and generate as little code overhead as possible as we've been very cautious with instrumenting our code that way. Other than that: Looking forward to your pull-request. 👍 |
I did experiment with a client-side wrapper for actions to enable zipkin support: It does work but there are performance issues. This has also been discussed on the mailing list. |
cc @adriancole |
Damn, small world @jthomas. One option would be to consider slightly more general-purpose context propagation and deploy Zipkin on top of that, see (work in progress / under submission): https://github.com/JonathanMace/tracingplane |
@JonathanMace nice to see you! here or elsewhere we should revisit brave-tracingplane I have some examples of swapping its in-process propagation innards here https://github.com/openzipkin/brave/blob/master/brave/src/test/java/brave/features/log4j2_context/Log4JThreadContextTest.java else - most recent scala-brave stuff here, though this is not to presume what is best fit for this project https://github.com/bizreach/play-zipkin-tracing mainly the underlying tracer api is better now, so worth a look |
fyi I bumped the existing akka project in case they are interested in updating to latest greatest levkhomich/akka-tracing#95 |
Hey @JonathanMace - good to hear from you. Long time no see :) |
Thanks Everyone for feedback on this. |
Hi, Sandeep.
so reporting is out-of-band, and I think what you are asking about is
in-band propagation.
propagating across kafka is a bit hairy as john mace will tell you. We do
have folks doing it, but no standard as choices tend to be not great (abuse
keys or coordinate an envelope)
https://gist.github.com/adriancole/76d94054b77e3be338bd75424ca8ba30
…On Fri, Apr 28, 2017 at 9:57 AM, sandeep-paliwal ***@***.***> wrote:
Thanks Everyone for feedback on this.
@adriancole <https://github.com/adriancole> Can I use this
https://github.com/openzipkin/zipkin-reporter-java <http://url> for span
propagation via Kafka?
Something like - A Producer(Akka based actor) starts span sampling sends
them over via Kafka to Consumer(Akka based actor). Consumer create span
sample Producer span as it parent.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2192 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD612s35TCjh9k1z1M6-YyalqYmPJBDks5r0ZvxgaJpZM4NJ4Ba>
.
|
It looks many people in this thread are big fan of zipkin. AFAIK, main differences between Zipkin and Pinpoint are as follow: Zipkin requires code changes on core components. But Pinpoint uses "bytecode instrumentation" so it does not require any code changes on core modules. It intervenes codes to change bytecode at class loading time. This way, it introduces many advantages. First, it hides tracing api from core modules. Developers of core modules are not required to care about the tracing codes. Tracing codes are decoupled from core logics. Second, it`s easy to enable/disable the tracing.
Many languages do not support bytecode instrumentation, but OW is written in scala and running on JVM. It can take advantage of bytecode instrumentation. Pinpoints provide plugins for many java libraries such as httpclient, jetty, log4j, logback, thrift, cassandra, gson and so on. But for scala libraries, we need to develop plugin. Even though we should develop plugins for scala libraries, pinpoint is still a good option I believe. You can refer to this for more details: The Value of Bytecode Instrumentation ps. I have not caught Zipkin up for a few months. So if I am wrong, kindly correct me : ) |
I like pinpoint and hyun and folks are earnest and welcome oss folks.
I would say that theres a false dichotomy though. Theres nothing that says
zipkin instrumentation must not use bytecode instrumentation. For example,
we have tossed around some sort of integration between pinpoint and zipkin,
and there is already progress from stagemonitor which also does bytecode
instrumentation.
Regardless, fair and useful chat to see if openwhisk prefers to do bytecode
instrumentation and/or whether pinpoint is a better choice for that or
other reasons. Thats a call folks here should make.
Thanks for asking the hard questions!
On 28 Apr 2017 17:43, "Dominic Kim" <[email protected]> wrote:
It looks many people in this thread are big fan of zipkin.
However, one more option could be pinpoint
<https://github.com/naver/pinpoint> which is also based on Google Dapper.
Since core OW components run on JVM, pinpoint could be a good option I
think.
AFAIK, main differences between Zipkin and Pinpoint are as follow:
Zipkin requires code changes on core components.
It keeps "context" throughout all call stacks.
That means we should add that "context" arguments to every call.
It makes tracing code is tightly coupled with core logics.
But Pinpoint uses "bytecode instrumentation" so it does not require any
code changes on core modules. It intervenes codes to change bytecode at
class loading time.
[image: image]
<https://cloud.githubusercontent.com/assets/3447251/25522874/8026c21a-2c3f-11e7-85d8-f453af3e0fe7.png>
This way, it introduces many advantages.
First, it hides tracing api from core modules. Developers of core modules
are not required to care about the tracing codes. Tracing codes are
decoupled from core logics.
Second, it`s easy to enable/disable the tracing.
We can easily enable/disable pinpoint tracing logic by just adding/removing
jvm startup options.
…-javaagent:$AGENT_PATH/pinpoint-bootstrap-$VERSION.jar
-Dpinpoint.agentId=<Agent's UniqueId>
-Dpinpoint.applicationName=<The name indicating a same service
(AgentId collection)>
Many languages do not support bytecode instrumentation, but OW is written
in scala and running on JVM. It can take advantage of bytecode
instrumentation.
Pinpoints provide plugins for many java libraries such as httpclient,
jetty, log4j, logback, thrift, cassandra, gson and so on. But for scala
libraries, we need to develop plugin.
Even though we should develop plugins for scala libraries, pinpoint is
still a good option I believe.
You can refer to this for more details: The Value of Bytecode
Instrumentation
<https://github.com/naver/pinpoint/wiki/Technical-Overview-Of-Pinpoint#the-value-of-bytecode-instrumentation>
ps. I have not caught Zipkin up for a few months. So if I am wrong, kindly
correct me : )
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2192 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD61zLFivxIbBAVj8UrMBl4bj1Askwyks5r0bS2gaJpZM4NJ4Ba>
.
|
@adriancole I just thought even though zipkin has many advantages, ow is running on jvm and it is great to take advantage of bytecode instrumentation in pinpoint. If you share your experience on integration or making bytecode instrumentation available in zipkin, it would help for folks in this thread to figure it out which is better for OW. |
In which parts pinpoint and zipkin could be integrated?
Could you share the details?
Hyun and I spoke about trying to get their collector to emit to zipkin as
an alternative. Since the pinpoint model is more rich than zipkin's, it
should be possible. We haven't sat down and tried, yet.
I just thought even though zipkin has many advantages, ow is running on
jvm and it is great to take advantage of bytecode instrumentation in
pinpoint.
By running your own JVM, you have abilities beyond normal. For example,
normal you can install whatever tracing you want when a platform is
initialized. What you are hinting at is that there may be some third party
code you are running and not otherwise able to configure. Can you enumerate
concretely what is otherwise unconfigurable that make you want to primarily
use bytecode? This will help the discussion from wandering hypothetically.
If effort to make zipkin to uses bytecode instrumentation is lesser than
the one to make pinpoint plugin, there is no reason not to use zipkin.
If you share your experience on integration or making bytecode
instrumentation available in zipkin, it would help for folks in this thread
to figure it out which is better for OW.
Presuming bytecode instrumentation is a requirement, you probably would
need to see bytebuddy code examples using brave or similar libraries to
instrument things, right?
|
FWIW, here's one example of bytecode instrumentation approach (stagemonitor
which uses brave) https://github.com/stagemonitor/boot-zipkin
|
If we can get bytecode instrumentation feature of pinpoint along zipkin`s rich scala libraries support, it would be great.
Regarding my preference on bytecode, it`s relevant to code changes rather than configurable components. I am not quite expert on distributed tracing than you. So if I am wrong, kindly correct me. To introduce distributed tracing, we may need followings. With bytecode instrumentation, above procedures will be done by framework. Framework will insert above logics automatically at class loading time. Ideally no code changes on core modules are required. But for libraries, if there is no available plugin for libraries, we may need to implement new one to apply bytecode instrumentation on library code. So if we can use pinpoint along with zipkin, it would be the best. But at first I have no idea on integration of both framework, I preferred pinpoint. |
If you are unable to control the libraries OpenWhisk uses, and there is an
agent that somehow does, probably sounds like you should do what you
prefer, which is to use an agent. Just know that you dont live in
isolatation.. make sure whatever your agent uses for propagation can
interop with others. Either that or mention to all consumers that they too
need to use the same agent.
Most frameworks are aware of the libraries they use and don't then need to
rely on bytecode instrumentation. Black box instrumentation is usually done
when frameworks havent chosen a path for tracing. Frameworks that employ
instrumentation directly can easily unit test their tracing code and
guarantee things like remote in and out are traced.
This is actually the first conversation I have had with a framework who is
in control of their code, yet preferred to rely on agents to do tracing.
Your call of course, so go with whatever you like knowing pros and cons!
|
fyi all @Xylus is the person on pinpoint I've chatted with most. He's pretty aware of differences between it and Zipkin, too, at least from a high level. Although subject to hands available, I know both of us are happy to facilitate things that can make interop smoother. Regardless, it would be cool to let him know if you end up using pinpoint, and I'm sure he'd answer any questions you have. Best wishes! |
the question that I have if we are to use pinpoint is: can we track 1 activation through controller -> kafka -> invoker and back ? |
the question that I have if we are to use pinpoint is: can we track 1
activation through controller -> kafka -> invoker and back ?
in either approach, you'd need to either hijack the message key or wrap the
body to propagate the trace through kafka. Brave folks use the former
sometimes.
There's also this, which will make propagation very easy, likely for either
approach, but which version that goes into is unknown
https://cwiki.apache.org/confluence/display/KAFKA/KIP-82+-+Add+Record+Headers
<https://github.com/openzipkin/brave/pull/url>
ps is your use in kafka SPSC (one producer sends a message and only one
consumer gets it?)
|
What's the desired goal? Is it the tracing of individual transactions through the system?
If you look through the current system logs you can see we are already doing this. Log messages are all time stamped and some carry special markers which are deltas since a previous marker. You can trace how long an activation request for example spent in the database query vs Kafka vs invoker etc.
Does it make sense to tie into the existing instrumentation vs adding new instrumentation?
|
I assume you're talking about @rabbah is this in the same line with what you were thinking ? |
I think the key thing is that distributed tracing adds causality (vs
correlation which could be off due to clock skew etc). Here's some details
on the usual suspects
https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing
|
@adriancole ❤️ your slide deck ! thanks for sharing ! @sandeep-paliwal it looks like the TransactionId class has |
@rabbah JFYI, many tracing systems provide rich UI with additional information such as JVM heap size, PermGen, cpu usage and so on. Since these all information including tracing the transaction could be seen altogether, it would be easy to find the problem and what happens in the system at that time and so on. As @adriancole shared, with logs also we can populate tracing information. Generally there might be not a big difference, if we manually deploy the system which collects all the logs, a few required metrics, and create a combined view with them altogether. JFYI, following screen shots are what pinpoint provides. |
@rabbah <https://github.com/rabbah> JFYI, many tracing systems provide
rich UI and additional information such as JVM heap size, PermGen usage,
cpu usage and so on.
I think a good analogy for this is that some APMs are tracing systems and
some tracing systems are APMs, but not all tracing systems are also APMs.
Pinpoint aims to replace things like new relic, and has a vastly larger
feature set than distributed tracing. I actually intend to make a deck
about this point, as it is also something usually confused!
|
Hence the reason for my question. If the goal is to trace a transaction through the system, we have a mechanism for that. As was suggested you can build an adaptor to the existing logger (one or more) and consume these in your favorite stack. The metrics you describe (heap, cpu, etc) serve a different purpose. I've used, and built, tools for this. And you won't get that from the logs of course. But for the latter I would rather we made it plugable so that developers and operators can deploy their preferred tools and we don't have to pick and choose one.
|
Hence the reason for my question. If the goal is to trace a transaction
through the system, we have a mechanism for that.
you have a means to stain logs with transaction ids, but that won't give
causality, right? It won't be able to model any parent-child relationships
as they occur. Any system doing this would need to have a way to indicate
parent/child relationships. So, I would say you have a correlation system
in place, but not tracing. If that's what was meant by this issue, then
maybe close?
|
We shouldn't close the issue because it's not clear, from the discussion, that there's agreement on the goal or desired outcome. My point is that we should not pick one platform or another. Instead we should make it so that it's possible to use any of the tools described here - or others - because for example an organization might already have a standard platform and policies and we should force a particular choice. For operators of the platform, there are many metrics that are useful to generate and monitor and establish alerts for. Often, tracing a transaction through the system (which I've described above) serves a different purpose from the causal analysis I think you're alluding to. |
In situations like these, where there isnt a clear goal or direction from
the project itself, I usually defer to what end users ask for and see if
you can make that possible. Arbitrary portability isnt likely to predict
what users want.
Ex If users are clammering for X and Y, see how to make those work
together. If no one is clammering except people who are not end users, wait
until that is not the case.
My 2p
|
Thanks @ddragosd and @rabbah. I am working on the same line as suggested in previous comments to integrate tracing with the existing logging. I had made progress in getting the trace working with zipkin in context of given action invocation. Now on to get the markers used in Logging to work with tracing and in general fit the tracing changes in existing logging. |
Wow, very cool to see this coming along. |
looking awesome @sandeep-paliwal. Besides seeing it in action I'm looking forward to seeing how you've managed to setup a depth of 3 with child spans from the |
Hi, |
@adriancole I'd be interested to get your thoughts on whether Tracing should be used in Prod ( with sampling ) vs Non-Prod environments. |
@adriancole <https://github.com/adriancole> I'd be interested to get your
thoughts on whether Tracing should be used in Prod ( with sampling ) vs
Non-Prod environments.
tracing is for troubleshooting production, but it can also be used for
non-prod :)
typical concerns are volume of trace data, which implies sampling policy
|
Enables Tracing support via Zipkin and OpenTracer. It can be enabled via config tracing { zipkin { url = "http://localhost:9411" //url to connecto to zipkin server //sample-rate to decide a request is sampled or not. sample-rate = "0.01" // sample 1% of requests by default } } Tracing enables tracking of request from Controller to Invoker
The PR is now merged 🎉 |
Enables Tracing support via Zipkin and OpenTracer. It can be enabled via config tracing { zipkin { url = "http://localhost:9411" //url to connecto to zipkin server //sample-rate to decide a request is sampled or not. sample-rate = "0.01" // sample 1% of requests by default } } Tracing enables tracking of request from Controller to Invoker
I am creating this enhancement to discuss/add distributed tracing support to Openwhisk project.
This will help to gather latency data of various components(controller, invoker etc.) in various usage context(eg. action invocation). This data can help troubleshoot any performance bottlenecks.
One option is to use Zipkin - http://zipkin.io/
There is already instrumented library which supports Akka and Spray framework - https://github.com/levkhomich/akka-tracing
I have done some initial work around this and I can contribute it back once its complete.
It would be nice to get thoughts/concerns/suggestions around this.
The text was updated successfully, but these errors were encountered: