Root Cause Analysis (RCA) Using Distributed tracing

Root Cause Analysis (RCA) Using Distributed tracing
Photo by Maxim Hopman / Unsplash

Distributed tracing is a method of tracking the propagation of a single request as it’s handled by various services that make up an application. Tracing in that sense is “distributed” because in order to fulfill its function, a single request must often traverse process, machine and network boundaries.

Once we instrumented our application and exported our telemetry data to an observability backend (like somulogic or new relic), It’s time to use this data to debug our production system efficiently. In this article, we will explore debugging techniques applied to observability data and what separates them from traditional techniques used to debug production applications.

To learn more about tracing and what it means to instrument an application and export telemetry data, please check this article.

Before we start with how we can use traces and spans to debug our production applications during incidents, it's important to take a brief look at how we used to do it using logs and metrics, the old way.

Old way of debugging an application using logs and metrics

Prior to distributed tracing, system and application debugging mostly occurred by building upon what you know about a system. This can be observed when looking at the way the most senior members of an engineering team approach troubleshooting. It can seem magical when they know what is the right question to ask and instinctively know the right place to look at. That magic is born from intimate familiarity with the application.

To pass this magic to other team members, managers usually ask senior engineers to write detailed runbooks in an attempt to identify and solve every possible problem (Root Cause) they might encounter. But that time spent creating runbooks and dashboards is largely wasted, because modern systems rarely fail in precisely the same way twice.

Anyone who has ever written or used a runbook can tell you a story about just how woefully inadequate they are. Perhaps they work to temporarily address technical debt: there’s one recurring issue, and the runbook tells other engineers how to mitigate the problem until the upcoming sprint when it can finally be resolved. But more often, especially with distributed systems, a long thin tail of problems that almost never happen are responsible for cascading failures in production. Or, five seemingly impossible conditions will align just right to create a large-scale service failure in ways that might happen only once every few years.

Yet engineers typically embrace that dynamic as just the way that troubleshooting is done—because that is how the act of debugging has worked for decades. First, you must intimately understand all parts of the system—whether through direct exposure and experience, documentation, or a runbook. Then you look at your dashboards and then you…intuit the answer? Or maybe you make a guess at the root cause, and then start looking through your dashboards for evidence to confirm your guess.

Even after instrumenting your applications to emit observability data, you might still be debugging from known conditions. For example, you could take that stream of arbitrarily wide events and pipe it to tail -f and grep it for known strings, just as troubleshooting is done today with unstructured logs. Or you could take query results and stream them to a series of infinite dashboards, as troubleshooting is done today with metrics. You see a spike on one dashboard, and then you start flipping through dozens of other dashboards, visually pattern-matching for other similar shapes.

But what happens when you don’t know what’s wrong or where to start looking? When debugging conditions are completely unknown to you??

The real power of observability is that you don’t have to know so much in advance of debugging an issue. You should be able to systematically and scientifically take one step after another, to methodically follow the clues to find the answer, even when you are unfamiliar (or less familiar) with the system. The magic of instantly jumping to the right conclusion by inferring an unspoken signal, relying on past scar tissue, or making some leap of familiar brilliance is instead replaced by methodical, repeatable, verifiable process.

Debugging a production application using traces

Debugging a production application using traces and spans is different. It doesn't require much experience with the application itself. you just need to be curious to learn more about what's actually happening with the application in the production environment. It simply works like this:

  1. Start with the overall view of what prompted your investigation: what did the customer or alert tell you?
  2. then verify that what you know so far is true: is a notable change in performance happening somewhere in this system? Data visualizations can help you identify changes of behaviour as a change in a curve somewhere in the graph.
  3. Search for dimensions that might drive that change in performance. Approaches to accomplish that might include: Examining sample rows from the area that shows the change: are there any outliers in the columns that might give you a clue? Slicing those rows across various dimensions looking for patterns: do any of those views highlight distinct behaviour across one or more dimensions? Try an experimental group by on commonly useful fields, like status_code. Filtering for particular dimensions or values within those rows to better expose potential outliers.
  4. Do you now know enough about what might be occurring? If so, you’re done! If not, filter your view to isolate this area of performance as your next starting point. Then return to step 3.

You can use this loop as a brute-force method to cycle through all available dimensions to identify which ones explain or correlate with the outlier graph in question, with no prior knowledge or wisdom about the system required.

Example

For example, Let's say we have a spike in request latency of some APIs for different users. If we isolated those slow requests, we would easily see that slow-performing events are mostly originating from one particular availability zone (AZ) from our cloud infrastructure provider (assuming we have the AZ information in the spans). After digging deeper, we might notice that one particular virtual machine instance type appears to be more affected than others.

This information has been tremendously helpful: we now know the conditions that appear to be triggering slow performance. A particular type of instance in one particular AZ is much more prone to very slow performance than other infrastructure we care about. In that situation, the glaring difference pointed to what turned out to be an underlying network issue with our cloud provider’s entire AZ.

Another Example

Here’s another example of root cause analysis using spans to make sure it’s clear. Let's assume that after deploying a new version of our application, we noticed that some APIs are getting slower. To investigate this issue, we will follow our traditional way of debugging using distributed tracing. We started by taking a deeper look at the slow APIs and looking for dimensions that might drive that change in performance. After diving deeper, we found out that all those APIs are calling a payment_service. After diving deeper into spans related to payment_service, we found out that it fetches data from a postgresql db specifically from a db table called user_payments_history. Comparing those spans with similar spans from the same API calls but before that deployment, we found that those queries to user_payments_history table are new and actually they are taking some time to get the required data.

The problem here might be a missing index that causes the query to be slow or that db table user_payments_history might have too many records. There is no way to be sure what is the root cause here but at least we know for sure that there is something wrong with this db table user_payments_history in the payment_service.

Not all issues are as immediately obvious as this underlying infrastructure issue. Often you may need to look at other surfaced clues to triage code-related issues. The process remains the same, and you may need to slice and dice across dimensions until one clear signal emerges, similar to the preceding example.

Conclusion

With complex distributed systems, It became really hard to figure out what is really going on in a production application. That's why metrics and logs alone are not enough to debug those apps and find what is the root cause of an incident.

Traces and spans can help in that situation. With high cardinality events, We can collect lots of information about our system that will be really handy when dealing with incidents under time pressure. We have a systematic approach to find the root cause of incidents assuming we are collecting enough information (dimensions) in spans.

To learn more about observability, please check:

References