Thanos

Admin

6 months ago

Table of Contents

🚀 How OpsBridge Can Help
- Conclusion
- Used By

Overview

Thanos is a set of components that can be composed into a highly available metric system with unlimited storage capacity, which can be added seamlessly on top of existing Prometheus deployments.
Thanos leverages the Prometheus 2.0 storage format to cost-efficiently store historical metric data in any object storage while retaining fast query latencies. Additionally, it provides a global query view across all Prometheus installations and can merge data from Prometheus HA pairs on the fly.
Concretely the aims of the project are:

Global query view of metrics.
Unlimited retention of metrics.
High availability of components, including Prometheus.

Features

Global querying view across all connected Prometheus servers
Deduplication and merging of metrics collected from Prometheus HA pairs
Seamless integration with existing Prometheus setups
Any object storage as its only, optional dependency
Downsampling historical data for massive query speedup
Cross-cluster federation
Fault-tolerant query routing
Simple gRPC “Store API” for unified data access across all metric data
Easy integration points for custom metric providers

Design

Thanos is a set of components that can be composed into a highly available Prometheus setup with long-term storage capabilities. Its main goals are operation simplicity and retaining of Prometheus’s reliability properties.

The Prometheus metric data model and the 2.0 storage format (spec, slides) are the foundational layers of all components in the system.

Components

Following the KISS and Unix philosophies, Thanos is comprised of a set of components where each fulfills a specific role.

Sidecar: connects to Prometheus, reads its data for query and/or uploads it to cloud storage.
Store Gateway: serves metrics inside of a cloud storage bucket.
Compactor: compacts, downsamples, and applies retention on the data stored in the cloud storage bucket.
Receiver: receives data from Prometheus’s remote write write-ahead log, exposes it, and/or uploads it to cloud storage.
Ruler/Rule: evaluates recording and alerting rules against data in Thanos for exposition and/or upload.
Querier/Query: implements Prometheus’s v1 API to aggregate data from the underlying components.
Query Frontend: implements Prometheus’s v1 API to proxy it to Querier while caching the response and optionally splitting it by queries per day.

Deployment with Thanos Sidecar for Kubernetes:

Deployment via Receive in order to scale out or integrate with other remote write-compatible sources:

1 . Sidecar

Thanos integrates with existing Prometheus servers as a sidecar process, which runs on the same machine or in the same pod as the Prometheus server.

The purpose of Thanos Sidecar is to back up Prometheus’s data into an object storage bucket, and give other Thanos components access to the Prometheus metrics via a gRPC API.

Sidecar makes use of Prometheus’s reload endpoint. Make sure it’s enabled with the flag --web.enable-lifecycle.

2. Store Gateway

As Thanos Sidecar backs up data into the object storage bucket of your choice, you can decrease Prometheus’s retention in order to store less data locally. However, we need a way to query all that historical data again. Store Gateway does just that, by implementing the same gRPC data API as Sidecar, but backing it with data it can find in your object storage bucket. Just like sidecars and query nodes, Store Gateway exposes a Store API and needs to be discovered by Thanos Querier.

3. Compactor

A local Prometheus installation periodically compacts older data to improve query efficiency. Since Sidecar backs up data into an object storage bucket as soon as possible, we need a way to apply the same process to data in the bucket.

Thanos Compactor simply scans the object storage bucket and performs compaction where required. At the same time, it is responsible for creating downsampled copies of data in order to speed up queries.

4. Receiver

The Thanos receive command implements the Prometheus Remote Write API. It builds on top of existing Prometheus TSDB and retains its usefulness while extending its functionality with long-term storage, horizontal scalability, and downsampling. Prometheus instances are configured to continuously write metrics to it, and then Thanos Receive uploads TSDB blocks to an object storage bucket every 2 hours by default. Thanos Receive exposes the StoreAPI so that Thanos Queriers can query received metrics in real time.

5. Ruler/Rule

In case Prometheus running with Thanos Sidecar does not have enough retention, or if you want to have alerts or recording rules that require a global view, Thanos has just the component for that: the Ruler, which does rule and alert evaluation on top of a given Thanos Querier.

6. Querier/Query

Now that we have setup Sidecar for one or more Prometheus instances, we want to use Thanos’s global Query Layer to evaluate PromQL queries against all instances at once.

The Querier component is stateless and horizontally scalable, and can be deployed with any number of replicas. Once connected to Thanos Sidecar, it automatically detects which Prometheus servers need to be contacted for a given PromQL query.

Thanos Querier also implements Prometheus’s official HTTP API and can thus be used with external tools such as Grafana. It also serves a derivative of Prometheus’s UI for ad-hoc querying and checking the status of the Thanos stores.

7. Query Frontend

The thanos query-frontend command implements a service that can be put in front of Thanos Queriers to improve the read path. It is based on the Cortex Query Frontend component so you can find some common features like Splitting and Results Caching.

Query Frontend is fully stateless and horizontally scalable.

When Should You Use Thanos?

Thanos is an excellent choice if you:

Need high availability for your monitoring stack.
Want to store Prometheus data beyond retention limits.
Require a centralized view of metrics across multiple Prometheus instances.
Need cost-effective storage without compromising query performance.
Have a Kubernetes-based infrastructure where Prometheus instances are frequently restarted.

However, Thanos may introduce additional complexity in terms of infrastructure management, so consider your use case before deploying it.

Alternatives to Thanos

for scaling Prometheus, providing long-term storage, high availability, and global querying. Here are some of the most popular options:

1. Cortex

Best for: Multi-tenancy, cloud-native environments, and horizontal scaling.

Cortex is a horizontally scalable, multi-tenant Prometheus-based monitoring system.
It offers high availability and long-term storage by sharding and distributing Prometheus data across multiple backend storage solutions.
Unlike Thanos, Cortex ingests data directly instead of relying on Prometheus sidecars.

🔗 More Info: https://cortexmetrics.io/

2. Mimir (Grafana Mimir)

Best for: Organizations using Grafana for visualization and needing a scalable Prometheus backend.

Developed by Grafana Labs, Mimir is an evolution of Cortex with optimizations for large-scale Prometheus usage.
It provides long-term storage, deduplication, and multi-tenancy support.
Mimir supports efficient query performance and low-latency storage solutions.

🔗 More Info: https://grafana.com/oss/mimir/

3. VictoriaMetrics

Best for: High-performance storage and efficient resource utilization.

VictoriaMetrics is a fast, cost-effective, and scalable time-series database compatible with Prometheus.
Unlike Thanos, VictoriaMetrics is a single binary, making it easier to deploy and manage.
Supports long-term storage, global querying, and downsampling with less operational complexity than Thanos.

🔗 More Info: https://victoriametrics.com/

4. OpenTelemetry (OTel) with Prometheus Exporter

Best for: Organizations already adopting OpenTelemetry for observability.

OpenTelemetry provides end-to-end observability and can integrate with Prometheus exporters.
While it is not a direct replacement for Thanos, it allows Prometheus metrics to be stored in different backends like Elasticsearch, Kafka, or cloud-native solutions.

🔗 More Info: https://opentelemetry.io/

Choosing the Right Alternative

Feature	Thanos	Cortex	Mimir	VictoriaMetrics	OpenTelemetry
Global Querying	✅	✅	✅	✅	⚠️ Limited
Long-Term Storage	✅	✅	✅	✅	✅
Multi-Tenancy	⚠️ Limited	✅	✅	⚠️ Limited	✅
High Availability	✅	✅	✅	✅	✅
Ease of Deployment	⚠️ Complex	⚠️ Complex	⚠️ Complex	✅ Easy	⚠️ Complex
Cost-Effectiveness	✅	⚠️ Can be expensive	✅	✅ Very efficient	⚠️ Depends on backend

🚀 How OpsBridge Can Help

At OpsBridge, we specialize in designing and implementing scalable monitoring solutions using Prometheus and Thanos. Whether you need help with deploying Thanos, optimizing your Prometheus setup, or managing long-term storage efficiently, our DevOps experts can provide the right strategy and hands-on support.

Our services include:

✅ Setting up and managing Thanos and Prometheus for high availability.

✅ Optimizing storage and query performance for cost efficiency.

✅ Implementing alerting and monitoring best practices to improve system reliability.

✅ Providing custom solutions tailored to your infrastructure needs.

👉 If you’re looking for expert guidance on scaling your monitoring stack, contact us today!

Conclusion

Thanos is a powerful solution for scaling Prometheus, providing high availability, global querying, and long-term storage for metrics. By leveraging Thanos, DevOps and SRE teams can ensure reliability and observability across large-scale deployments without losing valuable monitoring data.

If you’re looking to enhance your monitoring setup, integrating Thanos with Prometheus is a great step forward. Have experience with Thanos? Share your thoughts with us!

Used By

Source: Thanos