[Turn] How to improve the efficiency of the time series rolling sort function (TS_RANK) in Python?

1. What is TS_RANK?

The TS_RANK(X, n) function refers to the cyclic calculation of the ranking value of the last value of each fixed window in this window on a time series X. To put it simply, it is to look at the order of the current value of the time series X concerned in the past period of time at each moment. This function is deliberately discussed because its frequency of use in mining signals is still very high.

For example, if I have a time series [1,2,3,4,5,6] with a fixed window of 3, then the first two subscripts will not be calculated due to the insufficient data length of the forward trace. For [1, 2,3], since 3 is the largest, the order value is 3. In the same way, continue to cycle forward, for [2,3,4], [3,4,5], [4,5,6], the sequence value of the last value is also 3. Finally we get [3,3,3,3].

Considering that the fixed windows are not the same, the calculation results are difficult to compare with each other, so the sequence value obtained each time can be divided by the window length to regularize the result to [0,1]. For the above example, the result becomes [1, 1,1,1]. To give another example [1,6,5,2,4,3], you get [0.66, 0.33, 0.66, 0.33].

2. Python implementation of TS_RANK

In the past, people like to use pandas.rolling() to implement this kind of symbolic function of rolling and circular calculation on time series. Assuming the price dataframe is df, then the general writing is:

df.rolling(n).apply(lambda x: get_sort_value(x)/n)

Since pandas does not have its own built-in function for obtaining the sort value coupled with rolling, we need to use apply+lambda and then use the self-written get_sort_value to get the sort value of the last element. According to our requirements, get_sort_value is a function that passes in the array and returns the sort value of the last element. Here, dividing by n is for regularization.

For the core get_sort_value, there are many methods that can be implemented. These codes are from the discussion at https://github.com/pandas-dev/pandas/issues/9481 . There are some minor errors, and the author has made changes.

def rollingRankOnSeries(array):
    s = pd.Series(array)
    return s.rank(method='min', ascending=False)[len(s)-1]

def rollingRankSciPy(array):
     return array.size + 1 - sc.stats.rankdata(array)[-1]

def rollingRankBottleneck(array):
    return array.size + 1 - bd.rankdata(array)[-1]

def rollingRankArgSort(array):
    return array.size - array.argsort().argsort()[-1]

Among them, the first implementation is to use the pandas rank function. Because the efficiency of converting array to series is not discussed, the second implementation uses the scipy rankdata function, and the third implementation uses the bottleneck library function. The last one is One is numpy's built-in function argsort.

Experiments have proved that the rankdata efficiency of BottleNeck is slightly higher than that of Scipy and Numpy, and the average time is 4S, while Scipy and Numpy require 6S.

3. Speed up

In fact, we must be faster. The reason is because our previous operation is to treat a window sequence separately every time, which results in an O(nlogn) sorting every time. But in fact, due to the coincidence of the time series before and after, the time series we used for sorting at this moment is only one element away from the previous moment.

For example, for the time series [1,2,3,4,5,6] and window value 4, after we sort [1,2,3,4], we only need to sort from [1,2,3,4] the next time ,3,4], remove 1, add 5, and then get the sequence value of 5. So we found that we can actually use a better data structure to achieve our goal. The requirements of this data structure are: time series can be stored, and the operation of adding, deleting, and acquiring order can be efficiently implemented.

Although the idea is good, we also need to consider the speed of Python's implementation. Since efficient library functions are implemented based on C/C++, if we cannot find a suitable library function, it will only be slower to pick one by ourselves. After a simple search, the author found a barely OK method: SortedList. SortedList is a function in the sortedcontainers package, which can maintain the sorting characteristics during add and pop operations.

@jit
def TS_RANK(x, n):
    sl = SortedList(x[:n])
    for i in range(n,len(x)):
        sl.add(x[i])
        res.append(sl.bisect_left(x[i]) / n)
        res.pop(0)
    return res

The new TS_RANK function is implemented as above, using numba acceleration. After the test, the speed is increased to 0.11S, which is more than 50 times faster than the violent numpy.

Fortunately, Python still provides us with more elegant usage, and just in the bottleneck library we mentioned, move_data can calculate the rank value of the last value of the moving window, and the writing method is simple:

bk.move_data(x)

After testing, the speed is 0.09S, which is equivalent to our handwritten version.

However, this function also has a shortcoming, that is, the operation ability for the first n elements is general, and all missing values are assigned. If n is large, it will cause some problems. For our self-written functions, we can flexibly modify the assignment rules of the first n values according to our needs. So which one to use needs to be carefully considered by everyone.

4. Conclusion

This article proves that numpy + good algorithm ideas + numba can greatly approximate the C version of others. At the same time, most Python novices may not even be able to reach the 4-6s primary solution. When the amount of data increases sharply, this will also greatly affect the research efficiency of quants.