There are some newer data structures that take this to the next level such as T-...

djk447 · on Sept 14, 2021

NB: Post author here.

Yeah, that was one of the reasons we chose it as one of the ones to implement, seemed like that was a really interesting tradeoff, we also used uddsketch[1] which provides relative error guarantees, which is pretty nifty. We thought they provided different enough tradeoffs that we wanted to implement both.

[1]: https://arxiv.org/abs/1908.10693

vvern · on Sept 14, 2021

Is it using https://github.com/tvondra/tdigest under the hood, or a separate implementation?

mfreed · on Sept 14, 2021

If folks are interested:

https://github.com/timescale/timescaledb-toolkit/blob/main/e...

(The TimescaleDB Toolkit is also implemented in Rust)

rogers18445 · on Sept 15, 2021

Facebook seems to have an even better performance implementation using sqrt. Might make sense to port that over to Rust. https://github.com/facebook/folly/blob/master/folly/stats/TD...

Lockerman · on Sept 14, 2021

a separate implementation

_0ffh · on Sept 15, 2021

Hi, in an unrelated nitpick: The relative error should be calculated by dividing the error by the true value, not by it's approximation. Still, very nice writeup!

djk447 · on Sept 15, 2021

Thanks! messed up the formula but had it right in the text :( Fixed now.

jeremysalwen · on Sept 15, 2021

Not an expert on this topic but I noticed that the KLL algorithm (published in 2016) was not mentioned in this thread, which provides theoretically optimal performance for streaming quantiles with guaranteed worst case performance: http://courses.csail.mit.edu/6.854/20/sample-projects/B/stre... (And is pretty fast in practice).

djk447 · on Sept 15, 2021

NB: Post author here.

Interesting will have to take a look! Thanks for sharing!

riskneutral · on Sept 14, 2021

Also relevant: Single-Pass Online Statistics Algorithms

[1] http://www.numericalexpert.com/articles/single_pass_stat/

breuleux · on Sept 14, 2021

That's pretty neat! Can these be used to efficiently compute rolling percentiles (over windows of the data), or just incremental?

WireBaron · on Sept 14, 2021

The UDDSketch (default) implementation will allow rolling percentiles, though we still need a bit of work on our end to support it. There isn't a way to do this with TDigest however.

jeffbee · on Sept 14, 2021

Sure there is. You simply maintain N phases of digests, and every T time you evict a phase and recompute the summary (because T-digests are easily merged).

djk447 · on Sept 14, 2021

I think this would be a tumbling window rather than a true "rolling" tdigest. I suppose you could decrement the buckets, but it gets a little weird as splits can't really be unsplit. The tumbling window one would probably work, though Tdigest is a little weird on merge etc as it's not completely deterministic with respect to ordering and merging (Uddsketch is) so it's likely you get something that is more than good enough, but wouldn't be the same as if you just calculated it directly so it gets a little confusing and difficult.

(NB: Post author here).

cyral · on Sept 14, 2021

This is what I do, it's not a true rolling digest but it works well enough for my purposes.

fotta · on Sept 14, 2021

yep I had to implement t-digest in a monitoring library. another alternative (although older) that the prometheus libraries use is CKMS quantiles [0].

[0] http://dimacs.rutgers.edu/~graham/pubs/papers/bquant-icde.pd...

convolvatron · on Sept 14, 2021

i think the new ones started wtih Greenwald-Khanna. but i definately agree - p^2 can be a little silly and misleading. in particular it is really poor at finding those little modes on the tail that correspond to interesting system behaviours.

cyral · on Sept 14, 2021

That sounds familiar, I remember reading about Greenwald-Khanna before I found T-Digest after I ran into the "how to find a percentile of a massive data set" problem myself.