swquantiles: sliding window quantiles

Recently I have been investigating efficient implementations of sliding window quantiles. The motivating case is to be able to grab statistics about high frequency events such as page loads or search requests. Currently I've focused on exact methods, but have a fair amount of literature on inexact solutions saved to read. My focus has been on trying to answer whether we can throw enough space at the problem to bring the complexity down to constant time per event per quantile being monitored.

First, a quick rundown of the restrictions I've been using. Our sliding window is based on both a maximum age, t, and a maximum number of events in that time window, N. All observed values must fall into a range [0, V) -- though a potential extension is discussed in the future work section. We will be observing q distinct, arbitrary, quantiles.

It's clear that certain quantiles can be measured very efficiently in the unbounded window case. Both min and max can be recorded in constant space and time over a stream. On the other hand, when the current min goes out of a bounded window, we must be able to efficiently compute the next lowest value. We can do this by tracking a min heap, but this increases time to O(log n). There is a classic algorithms for computing the streaming median using two min/max heaps which works well with a bounded window size. How this can be extended to arbitrary quantiles, I do not yet know.

Current proposals

I've been writing a small c++ library in my spare time to benchmark and test various solutions I've thought of. My current focus has been towards maintaining a simple dictionary of potential values to frequency of occurrence, potentially with some extra supporting structures. In any case, to have exact results the original values and their timestamps must be maintained. I use a circular buffer for this and encode times as deltas.

Quantile pointers

The first solution I worked with maintains a "pointer" into the sorted array of values in the window where the quantile in question resides. Since we do not maintain a sorted array, this is a pointer into the conceptual array that we can manipulate by examining the frequency dictionary. When a value lower than the current value of the quantile is inserted, the index of the pointer moves forward by one but does not change value, and similarly if the value is greater neither the index nor value changes. If the index of this pointer is now different from the index it should be at the new size, it is moved.

This is efficient when the quantile points inside a run of the same value. For example, if our sorted values are [0, 1, 1, 1, 2] and we're maintaining the median then values can be added on either side in constant time. If the median eventually moves outside of the run of "1"s in the middle, we need to find the next lowest or next highest value. This is called the fixed universe successor problem. If we maintain no extra structures beyond the list of events and the frequency dictionary, then this is a brute force search over the value space.

As long as we do not enter a case where we need the next/previous value, this ends up being O(q) time per event, simple increments and decrements into a directly addressed array. When a search must be performed for the next or previous value however, we gain an extra O(V) step.

Bitsets

One way the predecessor / successor problem can be sped up is by maintaining a bitset of the values in the frequency dictionary. This adds an overhead of 1 bit per potential value, but lets us make leaps and bounds over holes we wouldn't otherwise be able to detect. Suppose the code is running on a machine with a 64 bit word, and we must find the successor to the value 5. We can lookup the word in the bitset with 5 in it (the first word), and mask off the first 6 bits, leaving us with the remaining 58. If the remaining word is 0, then there is no successor in the first block of 64 values and we continue with the next word in the bitset. When we encounter a non-zero word, we can leverage native instructions such as bit scan reverse to find the index of the first bit set and map it back to our value space.

Overall, this has the same complexity as before, but with a much reduced constant. Under "normal" operation we are still O(q), or constant time per event per quantile, but when we enter a search for the next or previous value we hit a O(V/w) step, where w is the word size.

Van Emde Boas trees

The two solutions so far in essence still scroll across the entire value space worst case, but perhaps with a low (1/64) constant. We could instead maintain a search tree of the values currently present in the frequency dictionary. The obvious first choice, some type of binary search tree presents an O(log V) time complexity where V is the number of distinct values in the value space. Using a more complicated tree such as Van Emde Boas trees, this can be reduced to O(log log V).

Future work

The current code requires a fixed range of unitless values. For instance, the value space could be limited to [0, 255] ms for response times. In practical terms, at some point a response time might as well be infinity for the purpose of informing decisions on future work and we can use saturated arithmetic for out of range values.

On the other hand, it may be best to have a dynamic range where we have guarantees about the maximum relative error of the values in the results. In decimal, we may choose to ensure that we have a certain number of significant digits. For instance we store exact values for 1-99, a number within 10 for 100-999, a number within 100 for 1,000-9,999 etc.. I believe this can be implemented efficiently again with bitmasking and an instruction like bit scan reverse, or a simple if tree.

All my posts need more time :p Some not entirely disjoint documentation can be found in the project's current README.

originally posted 2016-02-03