Recently I became maintainer of Thanos – highly available Prometheus solution with long term storage & global view capabilities. Thanos gives you an easy way to connect multiple Prometheis into single mesh without sacrificing availability. Additionally you can configure global recording rules and execute alerts on top of Global Thanos mesh.
Just like Prometheus, Thanos was born in a company who doesn’t sell monitoring solutions. SREs at Improbable just had a problem scaling Prometheis and thus Thanos was born. I’m really thankful to Improbable that they decided to open source it. They have many cool projects like grpc-web, go-flagz, polyglot to name a few. I use their tools every day and they are awesome!
What I love about the project is that it’s completely community driven. There is no vendor lock-in as currently there are no vendors 😀 I think it fits very well with CNCF mantra of cloud native & vendor neutral projects.
Okay lets get back to the todays topic, 0.4.0-rc.0 release.
This release is huge. Therefore I’m not going to cover all details, please checkout the release page. But I definetely recommend to try it out.
We planned to do a release earlier, but there were a couple of pull requests we really wanted to make it. So we delayed and then PR queue grew. Stuff got reviewd & merged… So this is how we ended up with this jumbo release 😀
So finally me and Bartek sat there on a Thursday evening to finally finish this. We went thru changelog, did some final minor fixes. Then I had to leave him in the middle of the release (how nice :D), because I badly needed to go for a walk. So Bartek finished things solo close to midnight. Checkout the actual heading picture is from the evening walk with my wife 🙂
This is the last release that supports gossip. From Thanos v0.5.0, gossip will be completely removed.
We changed code to disable gossip by default, so if you are still using it, please migrate away or explicitly enable it.
There are a lot of performance improvements in this release. Definetely checkout all the new flags to tune concurrency. Additionaly we switched to Go 1.12, which includes a change in how memory is released to Linux. This will cause RSS to be reported as higher, but this is actually harmless. The memory is available to the kernel when it needs it.
Most notably Store Gateway startup process is massively improved in both efficiency and memory consumption. You can configure Thanos Compactor to produce index cache file into S3, meaning you Thanos Store Gateway won’t need to compute it at start up. Try it out using
flag on Compactor.
Additonally, you can control concurrency and sample limits on Store Gateway. Giedrius, another Lithuanian Thanos maintainer did a good blogpost about how to tune those paramaters. Definetely check it out.
Better support for newer Prometheus versions
TSDB got updated to 0.6.1.
Thanos now supports
flag, which sets default evaluation interval for sub queries.
Additionally we added a bunch of new endpoints that Prometheus natively supports. Therefore you can now access labels via
query_range endpoints support HTTP POST method.
Azure & S3 storage backends have graduated to stable maturity level. This means that now they are considered stable and generally recommended.
There is a new dns resolver option. If you are using kube-dns below v0.14.0, you need to use that due to this issue in Go.
Some more fixes in Thanos Query WEB UI and Thanos Rule WEB UI. Unresponsive Thanos Nodes are automatically removed after 5 minutes. This can be changed via
Network is reliable
Thanos is a distributed system. Its important to acknowledge Fallacies_of_distributed_computing. Most importantly the first fallacy, which is you can’t assume that The network is reliable.
There is a good description in wikipedia, about applications that don’t acknowledge this fallacy.
Software applications are written with little error-handling on networking errors. During a network outage, such applications may stall or infinitely wait for an answer packet, permanently consuming memory or other resources. When the failed network becomes available, those applications may also fail to retry any stalled operations or require a (manual) restart.
That’s why in this release I worked on timeouts. Furthermore Bartek improved this by adding Partial Response strategy, which I will discuss later.
Its really important that people running Thanos understand the implications of configuring Alerts on Thanos Rule vs having them in local Prometheus instance.
Running Alerts on Prometheus vs Thanos Ruler
When rules are configured locally on Prometheus, all the alerting rule evaluation, metric processing and firing is done in one process, thus it isn’t affected by Network is reliable fallacy, which isn’t the case for Thanos Ruler.
Thanos Ruler calls out to Thanos Query to get it’s metrics over HTTP API. Thanos Query then asks Thanos Store Gateways and Prometheis, which are part of Thanos mesh for the metrics.
Therefore you can see that there are a lot of network hops to get the data and actually evaluate it. So we need to have proper timeouts in place and strategies how to deal with missing data in response.
Thanos Querier has a flag
--query.partial-response, which when enabled will return Partial data, when it couldn’t manage to get all the responses back. If it’s disabled, it will just fail the query.
Thanos query & response timeouts
Since the old days we have
--query.timeout=2m, which will kill the long running queries. I suggest you check it out and tune it to your needs.
In this release, I added
--store.response-timeout=0ms, by default it’s disabled. But if you are running alerts on Thanos Ruler and seeing issues with slow responses or errors, I suggest you turn it on and tune it.
Currently, response timeout configures only
Series endpoint and applies a Timeout on top of GRPC metric stream. It has this very nice property of actually kill slow running nodes in the mesh, without slowing down Queries by that much.
Timing out GRPC stream
Let’s simplify GRPC streaming. So you can think of GRPC Streaming as sending multiple messages thru the same pipe.
What response timeout does is actually waits for message to arrive in configured time. If it didn’t it stops waiting and removes the Node from the Query. Then it continues getting data from other nodes.
So Imagine you have a slow Thanos Store and you set query timeout to 2 minutes. Then whole query would always wait 2 minutes, due to slow node not responding. In this case, if you set response timeout to 30 seconds, Thanos Query will wait 30s, kill that Node and respond with partial data if it it’s enabled.
Due to implementation details response timeout isn’t exact. If you have multiple slow Nodes, you might have to wait for the multiple
--store.response-timeou to finish. So generally I suggest setting query timeout way larger than
--query.timeout. Let’s say you put
--store.response-timeout=30s and you expect at max there will be 4 slow Thanos nodes, then you should set
to be bigger than 30s * 4 = 2 minutes.
Partial Response Strategy
Bartek did an amazing job with Partial Response Strategy. If you are running Alerts on Thanos Ruler please take a look at this documentation.
Basically now you can specify on your Alerts, what to do about partial responses. Right now there are 2 options,
groups: – name: "warn strategy" partial_response_strategy: "warn" rules: – alert: "some" expr: "up" – name: "abort strategy" partial_response_strategy: "abort" rules: – alert: "some" expr: "up" – name: "by default strategy is abort" rules: – alert: "some" expr: "up"
Warn means that partial data will be tolerated and warnings will be logged. Abort, which is default behaviour means that alerts will start failing.
Basically Thanos receiver exposes Prometheus Remote Write API and ships the blocks to S3 buckets. This way if you can’t add Thanos Sidecar to Prometheus due to * reasons, you can actually configure Prometheus to push data via Remote Write API to Thanos Receiver, which will push the data into S3, just like Thanos sidecar does.
I don’t think many users will need this component. But It’s a really interesting way to integrate with Prometheus. Additionally as of today it’s still experimental, so please be cautious, when trying it out.
That’s all that I have for today. Thanks for reading!