Nov 15, 2022 5 min read

Observability vs Monitoring

Observability is a measure of how well we can understand and explain any state our system can get into, no matter how weird it is. We must be able to debug that strange state across all dimensions of system state data, and combinations of dimensions, in an ad hoc iterative investigation, without being required to define or predict those debugging needs in advance. If we can understand any bizarre or novel state without needing to ship new code, we have observability.

Observability alone is not the entire solution to all of software engineering problems. But it does help clearly see what’s happening in all the corners of our software, where we are otherwise typically stumbling around in the dark and trying to understand things.

A production software system is observable if we can understand new internal system states without having to make random guesses, predict those failure modes in advance, or ship new code to understand that state.

Why Are Metrics and Monitoring Not Enough?

Monitoring and metrics-based tools were built with certain assumptions about the architecture and organisation, assumptions that served in practice as a cap on complexity. These assumptions are usually invisible until we exceed them, at which point they cease to be hidden and become the bane of our ability to understand what’s happening. Some of these assumptions might be as follows:

Our application is a monolith.
There is one stateful data store (“the database”), which we run.
Many low-level system metrics are available (e.g., resident memory, CPU load average).
The application runs on containers, virtual machines (VMs), or bare metal, which we control.
System metrics and instrumentation metrics are the primary source of information for debugging code.
We have a fairly static and long-running set of nodes, containers, or hosts to monitor.
Engineers examine systems for problems only after problems occur.
Dashboards and telemetry exist to serve the needs of operations engineers.
Monitoring examines “black-box” applications in much the same way as local applications.
The focus of monitoring is uptime and failure prevention.
Examination of correlation occurs across a limited (or small) number of dimensions.

When compared to the reality of modern systems, it becomes clear that traditional monitoring approaches fall short in several ways. The reality of modern systems is as follows:

The application has many services.
There is polyglot persistence (i.e., different databases and storage systems).
Infrastructure is extremely dynamic, with capacity flicking in and out of existence elastically.
Many far-flung and loosely coupled services are managed, many of which are not directly under our control.
Engineers actively check to see how changes to production code behave, in order to catch tiny issues early, before they create user impact.
Automatic instrumentation is insufficient for understanding what is happening in complex systems.
Software engineers own their own code in production and are incentivized to proactively instrument their code and inspect the performance of new changes as they’re deployed.
The focus of reliability is on how to tolerate constant and continuous degradation, while building resiliency to user-impacting failures by utilizing constructs like error budget, quality of service, and user experience.
Examination of correlation occurs across a virtually unlimited number of dimensions.

The last point is important, because it describes the breakdown that occurs between the limits of correlated knowledge that one human can be reasonably expected to think about and the reality of modern system architectures. So many possible dimensions are involved in discovering the underlying correlations behind performance issues that no human brain, and in fact no schema, can possibly contain them.

With observability, comparing high-dimensionality and high-cardinality data becomes a critical component of being able to discover otherwise hidden issues buried in complex system architectures.

Distributed tracing and Why it matters?

Distributed tracing is a method of tracking the propagation of a single request - called a trace - as it’s handled by various services that make up an application. Tracing in that sense is “distributed” because in order to fulfill its function, a single request must often traverse process, machine and network boundaries.

Traces help understand system interdependencies. Those interdependencies can obscure problems and make them particularly difficult to debug unless the relationships between them are clearly understood. For example, if a database service experiences performance bottlenecks, that latency can cumulatively stack up. By the time that latency is detected three or four layers upstream, identifying which component of the system is the root of the problem can be incredibly difficult because now that same latency is being seen in dozens of other services.

Instrumentation with OpenTelemetry

OpenTelemetry is an open-source CNCF (Cloud Native Computing Foundation) project formed from the merger of the OpenCensus and OpenTracing projects. It provides a collection of tools, APIs, and SDKs for capturing metrics, distributed traces and logs from applications.

With OTel (short for OpenTelemetry), we can instrument our application code only once and send our telemetry data to any backend system of our choice (like Jaeger).

Automatic instrumentation

For this purpose, OTel includes automatic instrumentation to minimize the time to first value for users. Because OTel's charter is to ease adoption of the cloud native eco-system and microservices, it supports the most common frameworks for interactions between services. For example, OTel automatically generates trace spans for incoming and outgoing grpc, http, and database/cache calls from instrumented services. This will provide us with at least the skeleton of who calls whom in the tangled web of microservices and downstream dependencies.

To implement that automatic instrumentation of request properties and timings, the frameworks needs to call OTel before and after handling each request. Thus, common frameworks often support wrappers, interceptors, or middleware that OTel can hook into in order to automatically read context propagation metadata and create spans for each request.

Custom instrumentation

Once we have automatic instrumentation, we have a solid foundation for making an investment in custom instrumentation specific to our business logic. We can attach fields and rich values, such as user IDs, brands, platforms, errors, and more to the auto-instrumented spans inside our code. These annotations make it easier in the future to understand what’s happening at each layer.

By adding custom spans within our application for particularly expensive, time-consuming steps internal to our process, we can go beyond the automatically instrumented spans for outbound calls to dependencies and get visibility into all areas of our code. This type of custom instrumentation is what will help you practice observability-driven development, where we create instrumentation alongside new features in our code so that we can verify it operates as we expect in production in real time as it is being released.

Adding custom instrumentation to our code helps us work proactively to make future problems easier to debug by providing full context - that includes business logic - around a particular code execution path.

Exporting telemetry data to a backend system

After creating telemetry data by using the preceding methods, we’ll want to send it somewhere. OTel supports two primary methods for exporting data from our process to an analysis backend, we can proxy it through the openTelemetry collector or we can export it directly from our process to the backend.

Exporting directly from our process requires us to import, depend on and instantiate one or more exporters. Exporters are libraries that translate OTel’s in-memory span and metric objects into the appropriate format for various telemetry analysis tools.

Exporters are instantiated once, on program start-up, usually in the main function.

Typically, we’ll need to emit telemetry to only one specific backend. However, OTel allows us to arbitrarily instantiate and configure many exporters, allowing our system to emit the same telemetry to more than one telemetry sink at the same time. One possible use case for exporting to multiple telemetry sinks might be to ensure uninterrupted access to our current production observability tool, while using the same telemetry data to test the capabilities of a different observability tool we’re evaluating.

Conclusion

Monitoring is best suited to evaluate the health of your systems. Observability is best suited to evaluate the health of your software.

OTel is an open source standard that enables you to send telemetry data to any number of backend data stores you choose. OTel is a new vendor-neutral approach to ensure that you can instrument your application to emit telemetry data regardless of which observability system you choose.

References

Observability Engineering: Achieving Production Excellence

Ramadan Khalifa

Berlin