Scaling IoT Monitoring and Observability Solutions

Spotflow

- Last Updated: December 2, 2024

Spotflow

- Last Updated: December 2, 2024

In the previous article, we discussed the essentials of monitoring and observability in IoT. Mainly, we presented how to leverage logs, metrics, traces, and structured events to enhance the observability of your IoT systems. It is no exception to operate tens of thousands of IoT devices. Scaling your IoT observability solution might quickly lead to insufficient performance and unbearable costs for your observability infrastructure. Thus, this article will focus on handling the large scale.

We’ll discuss a few techniques that can help you balance the trade-offs that come with a great IoT scaling:

Choosing a Performant Database
Sampling the Data
Setting Up Retention Policies

Choosing a Performant Database

Okay, we know what to collect, now we just dump all the data into our MySQL and we’re ready to observe, right? Well, not so fast (pun intended), this might not be the best idea for several reasons. We’ll look at our requirements for the database and then suggest a storage that will serve our needs better for IoT scaling.

First, let’s revise a few characteristics of storing IoT observability data:

The querying speed is important. When dealing with a production outage, the last thing you want is to wait several minutes until your debugging queries finish.
We will deal with many dimensions and high cardinality. The high number of dimensions comes from the idea of capturing many attributes of your operation to prepare for unknown conditions. Also, there will be important columns with high cardinality (the number of unique values of the column) such as the device IDs.
We need to query across all dimensions efficiently. We don’t know which attributes will be important when debugging a specific issue.
We will usually be interested in data coming from a limited time range. The time range will often correspond to the periods when you observe degraded service of your system.

There’s more to it, but this small set of characteristics will be enough to make our point.

General-purpose SQL Databases Might Be Insufficient

We’re probably all familiar with SQL databases, so it’s natural to consider it as a place to store our observability data. However, several technical aspects make SQL databases unsuitable for storing large-scale observability data.

Traditional row-oriented databases, like MySQL or PostgreSQL, struggle to efficiently handle queries on tables with many dimensions when only a subset of columns is required.

Another issue of high dimensionality is the difficulty of implementing efficient indexing. We can’t create database indices for a subset of columns beforehand, because we don’t know which dimensions will be important during troubleshooting. So we would either need to index all columns (which would be quite expensive), or the queries would be slow when filtering based on the unindexed columns.

Also, without explicit time-based data partitioning, there is usually no efficient way of discarding old data. Time-partitioning allows efficiently deleting large chunks of data when they get stale.

In case of reasonable motivations for using a traditional SQL database for observability data, you might want to consider Timescale. It is a PostgreSQL extension that addresses some of the challenges mentioned above with time partitioning and better compression while still using the row-based SQL model.

Signal-Specific Storages for IoT Scaling

The categorization of observability signals into metrics, logs, and traces has led to the development of specialized storages tailored to each signal type. For example, there is Mimir for metrics, Loki for logs, and Tempo/Jaeger for traces. Each of these storages is made with the specific signal type in mind, which makes them effective for monitoring use cases within the specific signal. However, it might be cumbersome to query data across these storages.

Additionally, certain storages have some specific limitations. For instance, the traditional time series databases (TSDBs, such as Mimir) cannot handle high cardinality data. TSDBs store a separate time series for each unique set of attributes. This approach can be very efficient with a limited number of dimensions and low cardinality as writing and querying within a single time series is very performant.

However, with high cardinality, the database needs to create a new series very often because it often encounters a unique combination of attributes. As a result, when retrieving aggregate values, the database needs to read through each time series, making the operation inefficient. This issue is particularly problematic within the IoT sector.

Use Column-Oriented, Time-Partitioned Storage for the Best Scalability

With the increasing demand for analytical workloads similar to ours (as described above), a new wave of databases emerged. They employ columnar storage, which makes the read operations more efficient as they only touch the columns required for the particular query. Thanks to time-partitioning, the database can limit the read operations only to a limited range of data, making the queries even more efficient.

The combination of these design choices makes the compression work faster as well, as the algorithm operates on single columns bounded by a time range. Notable examples of such storages include InfluxDB, QuestDB, and ClickHouse.

A diagram of a column oriented time partitioned database

Sampling the Data

At a certain scale, it becomes unbearable to collect and store every observability signal that your devices produce. Thankfully, this is usually unnecessary as you can successfully debug issues with only a fraction of the observability data.

For example, the events describing successful scenarios are often not as important as the ones describing failures. This is why we can discard most of these events and store only a few examples that are representative enough to reconstruct the particular historical situation.

Various sampling strategies exist to ensure that only a limited number of events are collected while still preserving sufficient detail. It's essential to choose a sampling approach that aligns with your specific needs. Instrumentation libraries, such as OpenTelemetry SDKs, often provide implementations of such sampling strategies. This makes sampling a relatively easy way to reduce storage and processing costs.

In the context of tracing, we distinguish two kinds of sampling for IoT scaling based on the point where the sampling decisions are made: head and tail sampling. Head sampling decides whether a span/trace will be sampled right at the device, while tail sampling makes this decision later once all the spans of the particular trace are collected.

The main advantages of head sampling are simplicity and cost efficiency. It reduces network traffic, which can be constrained in IoT environments, and avoids storing and processing unsampled data in observability backends.

However, tail sampling becomes necessary if you prefer to make sampling decisions based on the entire trace. This approach is useful if you want to sample traces with errors differently than the successful ones.

Setting Up Retention Policies

Observability data tends to lose their value over time quickly. The telemetry received today is usually much more valuable than data from the last year. This gives us another way to significantly trim the storage costs.

Retention policies allow the automatic removal of data beyond a specified timeframe. Time-based partitioning simplifies the implementation of retention policies which is why many modern databases support them out of the box.

Another strategy is utilizing tiered storage. That is, storing older data in low-cost object storages like Amazon S3 or Azure Blob Storage. Although querying from these storages might have higher latencies than local disks, it allows you to retain the data longer while still reducing storage costs.

Lastly, it is possible to reduce the resolution of historical data further. One approach is to perform a secondary round of downsampling on older data. An alternative approach is to explicitly create aggregates of historical data while discarding the original raw records.

Wrap Up: Choose Efficient Storage and Keep Only Essential Data

When setting up an IoT observability stack, you must decide where to store the data and select an appropriate observability backend. In this article, we have described various aspects to consider when making this decision to optimize cost-efficiency and IoT scaling. The main points to remember are the following:

Optimize Storage Selection: Evaluate the access patterns to your observability storage and go with a database tailored to your needs. Choose a general-purpose database only when you’re really sure it will suffice. Otherwise, go with battle-tested observability databases for better scalability.
Set Up Data Sampling: Employ data sampling techniques to save on storage costs without compromising critical insights.
Fine-Tune Retention Policies: Configure retention policies to discard obsolete data, ensuring your storage remains lean to save up on storage costs even more.