Today, software architectures are increasingly tending towards “micro services”-oriented distributed approaches. The era of the monolithic application with a comprehensive set of permissions is over. Today, we are faced with a multitude of applications, each with their own dedicated and discrete rights. These “micro services” offer greater scalability due to their “loosely coupled” design, better availability, and improved security.
The other side of the coin? More entities operating in parallel in our systems, and consequently increasingly complex systems that are difficult to analyse. One of the challenges of tomorrow’s engineering is observability in complex embedded systems.
Ausy, through its “Embedded Center of Expertise”, is establishing itself as a key player ready to tackle this sizeable challenge, and offers an introduction to the current state of embedded Linux solutions.
1. In practical terms, what is observability?
Observability is the ability to measure the internal states of a system by examining only what it produces. A system is considered to be “observable” if its current state can be estimated using nothing but output information. Observability is an long-established concept that has its roots in automatism in the theory of monitoring of self-regulated systems (see Rudolf Kalman: https://en.wikipedia.org/wiki/Observability).
When applied to the computer field, the observability of a system is assessed via three criteria: metrics, logs and traces.
- Metrics: these make it possible to detect performance anomalies, when values deviate too much from their averages.
- Logs: these reveal WHEN an error has been produced. Nevertheless, logs generally do not provide a clear understanding of WHY an error has occurred. This is even more true in distributed systems in which it is necessary to analyse logs for several services, containers and processes operating in parallel, in which an error can have multiple main causes.
- Traces: these provide an understanding of how an error occurred. By putting traces at various key points in our system (crossing points), we can monitor the life cycle of transactions in our system without impacting the system’s dynamics too significantly. It is then easier to determine whether we are dealing with a “race-condition” problem or a latency problem linked to too high a load, or other problems.
Unlike “logs” and “metrics”, which are generally well known and understood, traces are often less so – because of a lack of knowledge and understanding of existing solutions.
So, in the remainder of this article, we will pay particular attention to them, detailing the various solutions available in Linux, from both the kernel and user space perspective. We will take care to explain how to implement and optimise them. This is to demonstrate how much more effective these solutions are than an “ad-hoc in-house” solution added on top of a logging mechanism.
2. Traces in the Linux kernel
The Linux kernel offers several mechanisms for adding traces:
- Trace points;
- Kernel probes.
2.1 Ftrace (the Function tracer)
Ftrace is a mechanism for tracing the various functions used in the Linux kernel. It provides an insight into which functions are executed, and when.
Historically, this kernel functionality is provided by “real-time Linux patches”. Several tracer display options are available:
- function: default tracer that displays each called function along with a timestamp.
- function_graph: same functionality as above, but with a display showing the function call graph.
- irqsoff, preempoff, preemptirqsoff, wakeup, wakeup_rt: specific tracer to measure the latency of different parts of the Linux kernel.
Ftrace works on the same principle as the “mcount” mechanism used by the GCC/Clang profiling options. Each function call and response goes through an “mcount” proxy function which is used to record which function is called, and when. As we are in kernel space, we need to optimize this mcount function to avoid slowing things down. To do this, we must be able to activate the functions that we want to trace on demand.
In such situations, Ftrace relies on the following optimisation: for each function call and response, a “5-byte NOP”-type assembler instruction is added. As a reminder, NOP assembly instructions correspond to “do nothing”. This addition therefore imposes a cost that is close to zero from a performance point of view (whether in CPU execution time or for cache lines). When you want to activate tracing for a function, this “NOP” instruction is then replaced by a “JUMP” instruction which will then execute the “mcount” proxy function. Even when a trace is enabled, it is necessary to minimise its execution time (i.e. the time required to record the trace) so that it has very little impact on the temporal dynamics of the system we want to measure.
This is a fundamental principle that we learn from physics: observing a physical quantity always introduces a bias. By measuring a quantity, we are implementing a system which will itself disturb the physical quantity that we are trying to measure. The measurement tool can therefore be considered valid only if the disturbance it introduces is of a negligible order of magnitude compared to the measurement itself.
In our case, we are seeking to measure (trace) the temporal behaviour of our system. To achieve this in the least intrusive way possible, a two-step strategy is used:
- The first part is the backup: the traces are saved in kernel memory, in a “per-cpu lockless ring-buffer”, to minimise the time required to save the traces;
- Then reading and exporting: the ring-buffers are read and exported by the user space for post-mortem analysis. Note: in this case, the time required for the export (display or network export) has no impact, as it is done after the recording session.
From a user point of view, Ftrace is used via the Linux tool of the same name.
Note: Ftrace is not only an API (described here), but also a tool that uses this API and other technologies presented below.
For more information on Ftrace:
Tracepoints are static traces found in Linux kernel code that can be activated on demand. This is comparable to “printks” that can be activated on demand.
There are now more than 1,300 different tracepoints, providing the ability to trace the entire operation of the Linux kernel: from memory management, through the scheduler, the network layers and the file system layers.
The cost of running a tracepoint (whether enabled or not) must also be as low as possible, so as not to put an unnecessary burden on the system. The same mechanism is implemented as for Ftrace; i.e. a simple “5-byte NOP” instruction which will be replaced by a JUMP if it is enabled. The trace is saved in the same “per cpu lockless ring-buffer” of the kernel and is then consumed after the fact by user space.
From a user’s perspective, tracepoints are accessible through many tools: Ftrace/Perf (the tool), Systemtap, BPF, Lttng, etc.
Compared to Ftrace – which could display the name of the functions called – here we can have additional information provided by the trace, such as the names and values of variables, etc.
For more information on tracepoints:
2.3 Kernel probes (aka Kprobes)
Kprobes is a Linux kernel mechanism that allows you to dynamically add an instrumentation function anywhere in the Linux kernel code. When a Kprobe is recorded, it will make a copy of the instruction where it is inserted and place a breakpoint there (example: int3 assembly instruction on x86).
When a CPU arrives at this breakpoint, this will generate a CPU exception (CPU trap); the “exception handler” will then call the associated Kprobe function, which will execute, and will then execute the original instruction that it copied.
With this mechanism, it is possible to instrument any part of the Linux kernel dynamically, without having to recompile and reload the kernel. Compared to tracepoints, which are static, Kprobes rely on the use of break points/CPU Traps. Their impacts on the performance of a system are therefore much greater.
Note: Under certain conditions and on certain architectures, the breakpoint can be replaced by a JUMP (kernel config: CONFIG_OPTPROBES).
From a user’s perspective, Kprobes can be enabled through many tools: Ftrace/Perf, Systemtap, BPF, Lttng, etc.
For more information on Kprobes, see:
3. User-space traces
On the user-space side, there are also several mechanisms for adding traces in an application:
- Uprobes (user probes);
- USDT (Userland Statically Defined Tracing);
- lttng-ust (Linux Trace Toolkit Next Generation User-Space Tracer).
3.1 Uprobes (user probes)
Uprobes is the Kprobes equivalent, but intended in this case for user space. The principle of operation is identical. The instruction where the probe is inserted is replaced by a break point, which will thus generate a CPU exception (trap). On the kernel space side, this exception will be handled, the trace function will be loaded, and the logs will be saved in the kernel’s ring-buffer (in the same way as for tracepoints) before the program resumes its normal execution in user space.
Uprobe is a powerful mechanism for instrumenting any part of an application code; but beware, it carries a significant overhead (use of a breakpoint and therefore of a CPU exception with context change). Depending on the specific measure to be taken, this can be prohibitive in terms of performance impacts. (It is often difficult to know which traces to add, and where; we do not always know or have control over all the source code that runs in user space on our systems.)
From a user’s perspective, Uprobes can be enabled through many tools: Ftrace/Perf, Systemtap, BPF, Lttng, etc.
For more information:
3.2 USDT (Userland Statically Defined Tracing)
USDT is a service historically derived from Solaris and its Dtrace tracing tool. The Linux port was made by Systemtap, which provides a “sys/sdt.h” header to enable USDTs to be added to a C/C++ application. (Note: For interpreted languages, traces can be added via the libstapsdt library).
When defining a USDT, the resulting assembly code is the addition of:
- the NOP instruction at the trace location;
- a “.note.stapsdt” note (in the binary in ELF format) containing the function of the trace in question with information about where the NOP instruction is located in the code.
As a reminder, notes for a binary in ELF format are not loaded by default when loading an ELF binary into memory. In this case, then, we have no increase in memory consumption when the trace is not enabled.
When the trace is enabled, using the information contained in the “.note.stapsdt” note, the NOP instruction is replaced by a break point which, once reached, triggers the trace function (this reuses the Uprobes api mentioned above). Compared to the “Uprobe” mechanism alone, USDTs are simple to use. Rather than having to know what to instrument via Uprobes, and where, USDTs are planned and developer-defined tracepoints that can be activated on demand.
From a user’s perspective, USDTs can be enabled through many tools: Ftrace/Perf, Systemtap, BPF, Lttng, etc.
For more information:
3.3 lttng-ust (Linux tracing toolkit next generation User-Tracepoint)
Lttng traces are based on the principles of Linux tracepoints, but in this case are implemented entirely in user space. The traces are saved in a “per-cpu lockless ring buffer” fully instantiated in user space (created via the “liburc aka Userspace ReadCopyUpdate” library). This ring buffer is in memory that is shared (shm) between the application to be traced and the trace software (lttng).
This solution is much more efficient than the USTD/Uprobes approach because in this case there is no “context switch” required for the use of a trace. When a trace is disabled, the penalty consists of a simple branching test (if statement unlikely optimized). The overhead is therefore slightly higher than a NOP.
The only usage constraint is the need to “link” your application with the “lttng-ust (LGPLv2.1 licence)” library, either statically or dynamically. When tracing in user space, lttng-ust should be the default solution because it is more efficient and has a lower impact. If the constraint of linking your application to the “lttng_ust” library is not acceptable for your project, then the USDT solution may be a good compromise.)
For more information:
4. In conclusion
We have just seen the different mechanisms available under Linux for adding observability to our systems. Below is a summary of the different choices:
Some of these technologies have been available for more than 10 years already, and too few projects consider implementing them due to a lack of knowledge. At Ausy, we are convinced of the benefits and utility of these solutions, and we promote them whenever we have the opportunity.
So the next time you are faced with the problem of observability of your software and the need to add more information to it via traces… instead of reinventing the wheel around the syslog api to turn it into a trace system (which it is not!), remember this article and lttng-ust. And feel free to contact Ausy which, through its “Embedded Center Of Expertise” and its state-of-the-art knowledge, will be able to support you and provide you with the best solution.
Special thanks to the blog produced by Brendan Gregg (Senior Performance Engineer at Netflix: https://www.brendangregg.com/blog/index.html) which is a key reference work on this subject.