Real-time applications are characterized by their ability to execute code within a specific amount of time. To verify that the application is indeed "real-time compliant", it is important to collect and analyze data about the timing of code execution. This is known as tracing, where information and timing about the program execution is logged in a specialized manner such that it can be visualized later.
For real-time purposes, the most important data to collect is the time duration of certain blocks of code. This information can then be visualized on a timeline, or analyzed statistically (such as via mean, average, min, max, and percentile analysis), to experimentally determine whether or not an application conforms to its deadline requirements. This benchmark/validation step is very important during the development of a real-time application.
To make things more concrete, consider an example where a loop controlling a robot is executing in real time at 1000 Hz. Suppose the each iteration of the loop consists of the following three sequential actions:
Sense
Plan
Act
The sum of the duration of these actions must not exceed 1 ms (as the loop is 1000 Hz). Considering other overheads (such as operating system scheduling latency) and safety margin, a real-world design constraint may limit the sum of the duration to maybe 0.5ms or less. It is important to collect data from real-program execution and verify that the application timing indeed conforms to the specification.
To accomplish this verification without a proper tracing system, one may be tempted to write code similar to this:
void Loop() {
auto t1 = NowNs();
Sense();
auto t2 = NowNs();
Plan();
auto t3 = NowNs();
Act();
auto t4 = NowNs();
auto sense_duration = t2 - t1;
auto plan_duration = t3 - t2;
auto act_duration = t4 - t3;
auto loop_duration = t4 - t1;
CollectTiming(sense_duration, plan_duration, act_duration, loop_duration);
}
Since this code is supposed to execute in real-time, we have to assume that
CollectTiming
is real-time compliant and is therefore likely to be slightly
complex (how this would be done is left as an exercise for the reader). After
the timing information is collected, it must be manually analyzed and visualized
(either in spreadsheets or via scripts). This is all very tedious and error
prone.
Fortunately, cactus_rt is built with a real-time-compliant tracing system where you can instead write your code as follows (full example is here):
void Sense() {
auto span = Tracer()->WithSpan("Sense");
// Your custom code...
}
void Plan() {
auto span = Tracer()->WithSpan("Plan");
// Your custom code...
}
void Act() {
auto span = Tracer()->WithSpan("Act");
// Your custom code...
}
void Loop() {
Sense();
Plan();
Act();
}
It then generates a file that can be imported into an offline (not dependent on a remote server) Web UI where the following interactive timeline plot can be generated:
From the same WebUI, look for all the loop iterations and order by the highest duration:
You can even use SQL to analyze the data in the same UI. Here is an example of selecting the minimum, average, and maximum duration for the loop iteration:
See the tracing_example
for details.
A proper tracing system requires three components:
- The instrumentation API and the collection of data for the threads being traced;
- The serialization and storage of trace data;
- The visualization and analysis of trace data in terms of "spans" (also known as slice, which is basically a duration for a block of code).
Since cactus_rt
is designed to help write real-time applications, the
instrumentation API and the passing of trace information from the real-time must
be real-time-compliant.
A number of open-source tracing libraries and frameworks already exists that
provide some/all of these capabilities. A number of them were evaluated for
inclusion into cactus_rt
. Unfortunately, none of them satisfied all
requirements. My opinions on the surveyed technologies are as follows:
- [
lttng-ust
][lttng-ust] is real-time-compliant as far as I can tell, but its API is heavily based on C macros that is difficult to use. Trace analysis can be done via Trace Compass but the barrier of entry for that software is relatively high (have to basically install Eclipse) for it to work. Trace Compass is also somewhat complex and difficult to learn. - [BPF + USDT][usdt] is not real-time-compliant, is relatively difficult to use, and relatively difficult to analyze.
- [Perfetto][perfetto] is not real-time-compliant (as internally it uses a mutex without priority inheritance), but has a reasonable API and an intuitive and easy-to-use web UI at https://ui.perfetto.dev or via offline static files. It also features an impressive SQL system that allows developers to analyze the collected data more efficiently.
Through the investigations, it was determined that Perfetto's trace data storage
and analysis systems are reasonably independent and decoupled from the
instrumentation API and data collection code and is the most
usable system available. This enabled the development of a custom,
real-time-compliant instrumentation API and data collection library that outputs
Perfetto-compatible data files to be developed. The data files can then be used
in the Perfetto analysis system. A simple instrumentation API and data
collection library is thus developed to emit Perfetto-compatible data and this
is integrated in cactus_rt
.
The overall idea is simple and similar to how Quill operates, although this design is common across a wide range of data sharing from a thread sensitive to performance issues and is illustrated in the following diagram:
Each Thread
holds a ThreadTracer
object, which internally contains a SPSC
queue. Tracing data is sent to it via the
TrackEventInternal
struct. These queues are read by a central, non-real-time
thread called TraceAggregator
which then transforms the TrackEventInternal
objects into Perfetto-compatible protobuf messages. These are
then passed to Sink
objects which can write them to files, or send them
somewhere else (such as over the network).
The tracing system also allows for these things to occur:
- Tracing can be dynamically enabled and disabled via a simple API call on the process-level.
- Threads can be created and traced at any time during the program execution.
- Threads can be stopped at anytime and not impact the tracing system.
- Sinks can be registered at any time and the data written to these sinks only contains trace events emitted after the sink is registered.
To minimize performance overhead, tracing is enabled/disabled can be checked by reading a single global atomic boolean variable. This variable controls all traces from all threads within the process.
Upon enabling tracing via App::StartTraceSession
, cactus_rt
also creates and
starts the TraceAggregator
thread. Perfetto's file format specification
indicates that the track descriptor packets must be first written before the
actual trace event packets. Thus, after starting the TraceAggregator
, a
ProcessDescriptor
packet is first written. A
ThreadDescriptor
packet is emitted for each thread that is known to the TraceAggregator
. Then,
the main loop of the TraceAggregator
can run which will write track event
packets to the sinks.
When tracing is disabled via App::StopTraceSession
, the tracing enabled atomic
bool will be set to false. The system will request the TraceAggregator
thread
to drain all data from the existing ThreadTracers
and stop. Once this is done,
the file is closed and the TraceAggregator
states (interned data, sequence
states) are reset so they can be launched again.
Each Thread
owns a ThreadTracer
. However, when a thread starts, it must
notify the TraceAggregator
(if tracing is enabled and it exists) of its
existence and thread id so a
ThreadDescriptor
packet can be written to the data stream before any trace event data is written.
If tracing is not enabled right now, the TraceAggregator
will cache the
ThreadTracers
so that once tracing is enabled, the ThreadDescriptor
packet
is written out.
The Thread
is able to communicate with the TraceAggregator
by storing a
weak_ptr
to the TraceAggregator
. This pointer is set up during
App::RegisterThread
so there's no explicit dependency between Thread
and
App
during construction. This decision may be revisited in the future.
The structure is not ideal and has some problems, but works for the most part.
TODO...
Perfetto allows event names and event categories to be interned by assigning
each unique string with an id and emitting the id rather than the string to save
space in the output trace files. This feature is implemented via the
interned_data
, name_iid
, and category_iids
fields. A few details worth
noting:
- Perfetto uses
trusted_packet_sequence_id
to identify a single sequence. A single sequence must have monotonically increasing timestamps for its packets. Both Perfetto upstream and cactus-rt's implementation of the tracer gives a sequence id per thread. In cactus-rt, this is especially important as when we pop the data from the queue originating from multiple threads. - The
iid
s on their own are not sufficient to identify the string being interned as it is not "globally" unique. It is only unique per sequence (i.e. thus(trusted_packet_sequence_id, iid)
is sufficient to identify an interned string). This, along with (1), implies we have to intern strings on a per-thread interner. - When a tracing session stops, the string interner states for all known thread tracers are reset. This means a subsequent session will not have the same string iids.
- It seems that trace events are emitted via the
TRACE_EVENT
macro, which starts here withTRACE_EVENT_BEGIN
. - This then calls
PERFETTO_INTERNAL_TRACK_EVENT_WITH_METHOD
.- This also calls
perfetto_track_event::internal::isDynamicCategory
- This also calls
- This calls
perfetto_track_event::TrackEvent::TraceForCategory
, which eventually calls:WriteTrackEventImpl
which creates the protobuf packet via protozero.- This calls
TrackEventInternal::WriteEvent
- which calls
NewTracePacket
- which calls
TraceWriterBase::NewTracePacket
, which might beTraceWriterImpl::NewTracePacket
.
- This calls
From the other side:
SharedMemoryArbiterImpl::GetNewChunk
grabs a lock- called from
TraceWriterImpl::GetNewBuffer
. - This is called from
TraceWriterImpl::NewTracePacket
. This is called if the chunk is too full for the data.
- called from
This means that there is a small chance that a lock can be taken, and an even smaller chance that a priority inversion could take place. This means it is not suitable for hard real-time applications and a custom tracing library must be written.
WriteTrackEventImpl
callsTrackEventInternal::WriteEventName
which callsInternedEventName::Get
, and which is implemented in here.- In here there's memory allocation due to
LookupAndInsert
(called from here) using astd::map
. - Also
GetOrCreateIndex
callsnew
. Likely this means another allocation potential on the hot path.
- In here there's memory allocation due to
This means Perfetto uses memory allocation on the hot path, which is not acceptable for cactus-rt due to real-time constraints.
One problem with passing tracing data from a SPSC queue is that sometimes the
queue can get full. To be real-time-compliant, the SPSC queue within the
ThreadTracer
s are statically allocated and do not expand when it is full. This
means trace data can be lost when the volume of trace events is too large.
TODO...