There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)
Yeah, that was one of the reasons we chose it as one of the ones to implement, seemed like that was a really interesting tradeoff, we also used uddsketch[1] which provides relative error guarantees, which is pretty nifty. We thought they provided different enough tradeoffs that we wanted to implement both.
Hi, in an unrelated nitpick: The relative error should be calculated by dividing the error by the true value, not by it's approximation. Still, very nice writeup!
Not an expert on this topic but I noticed that the KLL algorithm (published in 2016) was not mentioned in this thread, which provides theoretically optimal performance for streaming quantiles with guaranteed worst case performance: http://courses.csail.mit.edu/6.854/20/sample-projects/B/stre... (And is pretty fast in practice).
The UDDSketch (default) implementation will allow rolling percentiles, though we still need a bit of work on our end to support it. There isn't a way to do this with TDigest however.
Sure there is. You simply maintain N phases of digests, and every T time you evict a phase and recompute the summary (because T-digests are easily merged).
I think this would be a tumbling window rather than a true "rolling" tdigest. I suppose you could decrement the buckets, but it gets a little weird as splits can't really be unsplit. The tumbling window one would probably work, though Tdigest is a little weird on merge etc as it's not completely deterministic with respect to ordering and merging (Uddsketch is) so it's likely you get something that is more than good enough, but wouldn't be the same as if you just calculated it directly so it gets a little confusing and difficult.
i think the new ones started wtih Greenwald-Khanna. but i definately agree - p^2 can be a little silly and misleading. in particular it is really poor at finding those little modes on the tail that correspond to interesting system behaviours.
That sounds familiar, I remember reading about Greenwald-Khanna before I found T-Digest after I ran into the "how to find a percentile of a massive data set" problem myself.
[1]: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest