Sampling Traces In OpenTelemetry

Sampling Traces In OpenTelemetry
Photo by Ho Hyou / Unsplash

At a scale, the cost to collect, process and save traces can dramatically outweigh the benefits because many of these events are virtually identical and successful. The point of debugging is to search for patterns or examine failed events during an outage. That's why it’s wasteful to transmit 100% of all events to the observability backend.

To debug effectively, we just need a representative sample of successful events which can be compared to bad events.

We can sample events by using the strategies outlined in this article and still provide granular visibility into system state. Unlike pre-aggregated metrics that collapse all events into one coarse representation of system state over a given period of time, sampling allows us to make informed decisions about which events can help us surface unusual behaviour, while still optimizing for resource constraints. The difference between sampled events and aggregated metrics is that full cardinality is preserved on each dimension included in the representative event.

In OpenTelemetry, there are two approaches to achieve sampling, Head-Based Sampling and Tail Based Sampling. Let's review both approaches and see when to use them.

Head-based sampling

As the name suggests, head-based sampling means to make the decision to sample or not at the beginning of the trace.

This is the most common way of doing sampling today because of the simplicity, but since we don’t know everything in advance we’re forced to make arbitrary decisions (like a random percentage of all spans to sample) that may limit our ability to understand everything.

A disadvantage of head-based sampling is the fact that you can’t decide that you want to sample only spans with errors since you do not know this in advance because the decision to sample or not happens before the error happens.

Built-in samplers include ( ParentBased,  AlwaysOn,  AlwaysOff and ParentBased Samplers)

“AlwaysOn” (AlwaysSample) sampler

As the name suggests, It essentially means to sample all events – and take 100% of the spans. In a perfect world, we would use this only, without any cost considerations.

“AlwaysOff” (NeverSample) sampler

Also as the name suggests, the AlwaysOff sampler samples 0% of the spans. This means that no data will be collected whatsoever. You probably won’t be using this one much, but it could be useful in certain cases. For example, when you run load tests and don’t want to store the traces created by them.

ParentBased Sampler

This is the most popular sampler and is the one recommended by the official OpenTelemetry documentation. When a trace begins we make a decision whether to sample it or not. Whatever the decision is, The child span will follow it.

The main advantage to ParentBased Sampler is that you always get the complete picture.

How does this work? For the root span, we decide whether it will be sampled  or not. The decision is sent to the rest of the child spans in this trace via context propagation, making each child know if it needs to be sampled or not.

It is important to understand that this is a composite sampler, which means it does not live on its own but it lets us define how to sample for each use case. For example, we can define what to do when we have no parent by using the root sampler.

ParentBased(root=TraceIDRatioBased)

It’s recommended to use the parent-based sampler with TraceIDRatioBased sampler as the root sampler.

The TraceIDRatioBased based sampler uses the trace ID to calculate whether or not the trace should be sampled or not, with respect to the sample rate we choose.

Tail-based sampling

Contrary to head-based sampling, in Tail-based sampling we make the decision to sample or not at the collector level. This can be useful for metrics, for example, when we want to gather the latency, We must know the exact start and end times which cannot be done in advance.

Also, what was a disadvantage of the head-based is an advantage for tail-based sampling which is being able to only sample spans with errors.

So where should sampling be implemented?

Well, that depends on your specific use case so there is no one solution that fits all.

If you choose to do it at the OTEL distro level (Head-based sampling) , you remove redundant data at the source, never needing to worry about it again. You also minimize data transported in the network. However, when you need to update the sample rate you have to redeploy your services each time.

If you implement it in the collector you have a centralized place that controls sampling so you don’t need to redeploy your server when you change your sample rate. However, making the sampling decision requires buffering the data until a decision can be made and thus adds overhead.

Conclusion

Sampling traces and spans is almost always a good idea since it will save lots of money and most likely won't affect the debugging process using spans and traces in production. There are different approaches to implement it in opentelemetry. Head-based sampling is simpler to implement but it requires redeploying for each change. Tail-based sampling is a little bit harder to implement but it gives us the ability to only sample traces with errors.