Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A different and often better way to downsample your Prometheus metrics (timescale.com)
96 points by LoriP on Oct 22, 2021 | hide | past | favorite | 21 comments


At some point somebody "invents" the circular buffers that have the multiple data resolutions that was RRDtool and maybe we get compact and fast time series storage and reporting again.


RRD is a terrible format. It's not compact, quickly becomes a burden when you have lots of cardinality, there's no metadata, HA is a joke and visualisation tools are basically non-existent. You basically need a whole set of extra tooling for metadata, visualisation, HA, querying to come even close to anything usable.

If RRD wasn't so terrible there wouldn't have been a myriad of replacements.


Interesting - it looks like this is the only past HN thread about it:

Beyond NoSQL: Using RRD to store temporal data - https://news.ycombinator.com/item?id=2742486 - July 2011 (18 comments)

I found a few other tiny threads asking about replacements, and that was it.


I'm curious how this can both avoid the average-of-averages problem (presumably by using the original full-rate data to compute multiple aggregates) and also supports backfilling. Is there a danger of the full-rate data expiring and having a different behavior for backfills past that horizon? Or am I wholly misunderstading both these features?


(NB: post author here)

Great question. We support average of averages by storing the intermediate state of the aggregate (for average that's the sum and count) so we could cleanly re-aggregate.

Eventually, we'll be able to incrementally update the aggregate if we backfill even if the raw data is no longer available. That's not implemented yet though, so backfill only updates the aggregate if the raw data is still around by re-computing the intermediate state of the aggregate off the raw data for affected buckets. For most cases that isn't actually an issue since most people have a longer data retention period than backfill horizon.


Thanks for the answer! I'd love to know more :) Also, I'm not following, how you guys deal with issues with unique counts? For example, lets say you've got 100 unique visitors on Monday and 100 on Tuesday. The unique visitors for both days might be anywhere between 100-200 and averaging counts between days doesn't work.


Not sure about this specific implementation but normally you handle this with approximations that support merging. i.e HyperLogLog You can merge 2 HyperLogLog counters to maintain proper distinct counts.


Yep!

And in fact, that's exactly what TimescaleDB supports - things like hyperloglog to support approximate count distinct, including as part of continuous aggregates. [0]

This blog post - "How PostgreSQL aggregation works and how it inspired our hyperfunctions’ design" - provides a really nice description of how our the API design of some of our analytical functions are motivated by the ability to "split" processing into the "pre-aggregation" and "finalization" steps, with the blog post focusing on the example of percentile approximation. (I think it was on HN a while back as well.) [1]

[0] https://blog.timescale.com/blog/introducing-hyperfunctions-n...

[1] https://blog.timescale.com/blog/how-postgresql-aggregation-w...


Awesome, thank you!


You avoid average-of-averages by storing multiple summaries. For example, you don't compute and store average, you compute and store sum and count.


Just one more note. Timescale is hiring, including for roles working on Promscale.

https://www.timescale.com/careers

Promscale roles are listed in the "Observability" section.


Congratz timescale on being #1 on the frontpage 3 days in a row !


(Timescale co-founder)

Thank you for noticing :-)

This is really a testament to all of the amazing products, new features, R&D, and overall work that the team has been shipping.

We are firing on all cylinders. Move fast without breaking things :-)

If this looks like fun to anyone - we're hiring!

Come and help us build the next great database company:

https://www.timescale.com/careers


This sounds awesome! But is it the right approach if I am just running a simple Prometheus instance on my home NAS? I've wondered for a while how I can persist my Prometheus timeseries, I guess I could use promscale for this, but maybe it's overkill for something this simple. Advice appreciated :)


Indefinite persistence of time-series data "as-is", is a somewhat different use-case from putting them in a data warehouse so you can efficiently do rollups to them. Timescale seems to be useful for the latter, but I'm not sure it offers too much value for the former.

I believe the state-of-the-art for plain-old Prometheus data retention is https://thanos.io/ — my understanding is that it's a Prometheus remote storage integration (https://prometheus.io/docs/prometheus/latest/storage/#remote...) that archives time-series data from Prometheus into an object-store, and then fetches/streams chunks back out from said object-store to serve requests.

You could use it locally on your NAS, by running a Minio instance on there. But IMHO there wouldn't be much point in doing that, over just keeping all the data in Prometheus's own internal storage.


> But IMHO there wouldn't be much point in doing that, over just keeping all the data in Prometheus's own internal storage.

So far from what I've read Prometheus isn't built for that. It stores all its data in RAM and asking it to store any more leads to running out of memory quickly. What have I missed?


(NB: Post author and Promscale dev here)

Promscale does both data storage and analysis/rollups. It's like Thanos in that you can use it as a remote storage backend. It has the additional functionality of then aggregating/analyzing the data in SQL.


Just be careful with Thanos if you're storing in s3. We got a 17k bill for rolling up/compacting our metrics which far outweighed the cost of just keeping them at the orig resolution.. :}


Years ago I had a Graphite installation where I configured retention policies, and the same for InfluxDB if my memory doesn't fail me.

The downsampling feature at first glance seems to serve a different use case than Prometheus was built for, which I think is observability and alerting for a relatively short time period. For systems that need to work with years of data it totally makes sense, but I don't think Prometheus is used in those cases.

Since this feature has been built for a reason however, I could be wrong


Prometheus without any supporting tooling isn't really designed for long term storage as I understand it, however it is built to support long term storage and querying via its remote read/write protocol. Prometheus will write data to remote storage, and can delegate queries to that storage, rather than using its own local storage as it does by default.

Of the various tools that expose the remote read/write APIs, I like the looks of Promscale/TimescaleDB the most so far, but other options like Thanos might make more sense if you need to collect metrics from a bunch of Prometheuses. That said, maybe you can still use Promscale/TimescaleDB with Thanos as the storage backend, I can't recall the details on its requirements though, so it might not be suitable for that case. For my own use cases though, Promscale is a great solution.


(NB: Promscale team member)

Thanks for the positive feedback!

Is there anything in particular you are missing in Promscale to be used as a backend for multiple Prometheus instances?

We added support for multi-tenancy a couple of months ago (https://blog.timescale.com/blog/simplified-prometheus-monito...)

And thanks to a community contribution by 2nick on github Promscale can be integrated with Thanos :) (https://github.com/timescale/promscale/pull/664)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: