Pearson correlation coefficient realizes similar K line and its performance optimization

Pearson correlation coefficient realizes similar K line and its performance optimization

concept introduction

Similar K-line is a classic product that verifies that "history always repeats itself". At present, many stock trading software have begun to provide similar K-line functions one after another. The following figure is a similar K-line rendering of a product:
Write picture description here
Investors can observe the possible future trend of individual stocks based on the results displayed by the similar K-line, which can play a certain guiding role in investment.
This article will briefly introduce how to realize the calculation of similar K-lines, and discuss some difficult details in the implementation process.

Calculation and realization

The implementation of similar K-lines is mainly divided into two parts, the first part is similarity matching calculation; the second part is ranking screening.

similarity matching

When performing similarity matching, we use the "Pearson product-moment correlation coefficient" (Pearson product-moment correlation coefficient) for correlation verification. The derivation and calculation of the detailed "Pearson correlation coefficient" can be found on the Internet. In this paper, we directly use the conclusion formula:
PX , Y = ∑ XY − ∑ X ∑ YN ( ∑ X 2 − ( ∑ X ) 2 N ) ( ∑ Y 2 − ( ∑ Y ) 2 N ) P_{X,Y}=\ frac{\sum{XY}-\frac{\sum{X}\sum{Y}}{N}}{\sqrt{(\sum{X^2}-\frac{(\sum{X})^ 2}{N})(\sum{Y^2}-\frac{(\sum{Y})^2}{N})}}PX,Y=(X2N(X)2)(Y2N(Y)2) XYNXY
//(∑XY-∑X ∑Y/N)/(Math.sqrt((∑X 2-(∑X) 2/N) ((∑Y 2-(∑Y) 2/N)))
formula main It is easier to calculate the similarity through the concept of average and covariance. For K-line data, input X, Y is a set of continuous price data. By calculating the Pearson formula, we will get a correlation of -1~1 Coefficient, the closer the result is to 1, the higher the similarity. The following code is a complete implementation of the above formula, programmed using JavaScript:

/*
 * Pearson皮尔森相关系数计算
 * (∑XY-∑X*∑Y/N)/(Math.sqrt((∑X^2-(∑X)^2/N)*((∑Y^2-(∑Y)^2/N)))
 */
pearsonManager=(function(){
    var compare,calcCov,calcDenominator;

    /*
     * 协方差计算
     * ∑XY-∑X*∑Y/N
     * @param {array} source 源K线数据
     * @param {array} data 对比的K线数据,data.length=source.length
     * @param {string} field 参数
     */
    calcCov=function(source,data,field){
        var i,l,mulE,sourceE,dataE;
        mulE=0;
        sourceE=0;
        dataE=0;
        for(i=0,l=source.length;i<l;i++){
            mulE+=source[i][field]*data[i][field];
            sourceE+=source[i][field];
            dataE+=data[i][field];
        }
        return mulE-sourceE*dataE/l;
    };

    /*
     * 皮尔森分母计算
     * Math.sqrt((∑X^2-(∑X)^2/N)*((∑Y^2-(∑Y)^2/N))
     * @param {array} source 源K线数据
     * @param {array} data 对比的K线数据,data.length=source.length
     * @param {string} field 参数
     */
    calcDenominator=function(source,data,field){
        var i,l,sourceSquareAdd,sourceAdd,dataSquareAdd,dataAdd;
        sourceSquareAdd=0;
        sourceAdd=0;
        dataSquareAdd=0;
        dataAdd=0;
        for(i=0,l=source.length;i<l;i++){
            sourceSquareAdd+=source[i][field]*source[i][field];
            sourceAdd+=source[i][field];
            dataSquareAdd+=data[i][field]*data[i][field];
            dataAdd+=data[i][field];
        }
        return Math.sqrt((sourceSquareAdd-sourceAdd*sourceAdd/l)*(dataSquareAdd-dataAdd*dataAdd/l));
    };

    /*
     * 对比两组输入数据的相似度
     * @param {array} source 源K线数据
     * @param {array} data 对比的K线数据,data.length=source.length
     * @param {string} field 参数
     */
    compare=function(source,data,field){
        var numerator,denominator;
        if(source.length!=data.length){
            console.error("length is different!");
            return ;
        }
        numerator=calcCov(source,data,field);
        denominator=calcDenominator(source,data,field);
        return numerator/denominator;
    };

    return {
        compare:compare
    };
})();

We can try to calculate the correlation coefficient with the following data:

var testSource,testData;
testSource=[{value:1},{value:2},{value:3}];
testData=[{value:3},{value:2},{value:1}];
console.log(pearsonManager.compare(testSource,testData,"value"));

The output result is -1, and the target data is completely negatively correlated.
From this, we get the method of comparing similarity. In the next step, we will find the most relevant similar K-line data from the historical data of the whole market.

Ranking Filter Optimization

Notice!
When the following code is extracted into this article, the business layer code has been removed, and only the core calculation logic is displayed. The following data results are obtained from the operation on August 28, 2017. And the following calculations are only the historical data traversal of a single stock, without the complete running data of the whole market, the time consumption of traversing the whole market can be estimated by the running time of a single stock. In the special case below, it is to compare the similarity between the latest 30-day data of 600570.SS and all historical data of 600571.SS.

traverse calculation

The simplest implementation method is to calculate the whole market, traversal by brute force, sort all the results obtained, and finally get similar K-lines:

/*
 * 遍历计算
 * 2个属性20ms,最高相似度0.9505
 * 600570:600571{position: 20130415, similar: 0.9505145006910938}
 */
compareSimilarKViolent=function(code,period,data){
    var i,l,compareData,startTime,result;
    compareData=[];
    result=[];
    for(i=0,l=data.length;i<l;i++){
        compareData[i]={
            date:data[i][0],
            open:data[i][1],
            high:data[i][2],
            low:data[i][3],
            close:data[i][4],
            amount:data[i][5]
        };
    }
    startTime=new Date().getTime();
    for(i=0,l=data.length-31;i<l;i++){
        result[i]={
            start:data[i][0],
            end:data[i+compareCount][0],
            similar:calcSimilar(sourceData,compareData.slice(i,i+compareCount))
        };
    }
    result.sort(function(a,b){
        return b.similar- a.similar;
    });
    result=result.slice(0,10);
    console.log(result);
    console.log("calc cost:",new Date().getTime()-startTime);
};

This algorithm is simple to implement, and the result must be the data with the highest similarity. However, this algorithm needs to traverse and calculate all the data completely, and needs to save the calculation results of each data in the process, and finally obtain the most similar data after sorting, which has extremely poor performance from the perspective of time and space.

Traversal calculation optimization

So we thought that sorting could be optimized: because there will be no missing calculations, we only save the value of the current highest similarity data during the calculation process, which saves the space of the sorted array and the final sorting time:

/*
 * 遍历计算,取最大值算法优化,无需存储无意义的全部数据
 * 2个属性17ms,最高相似度0.9505
 * 600570:600571{position: 20130415, similar: 0.9505145006910938}
 */
compareSimilarKViolentOptimize=function(code,period,data){
    var i,l,compareData,startTime,result,similarValue;
    compareData=[];
    for(i=0,l=data.length;i<l;i++){
        compareData[i]={
            date:data[i][0],
            open:data[i][1],
            high:data[i][2],
            low:data[i][3],
            close:data[i][4],
            amount:data[i][5]
        };
    }
    startTime=new Date().getTime();
    i=0;
    result={
        start:data[i][0],
        end:data[i+compareCount][0],
        similar:calcSimilar(sourceData,compareData.slice(i,i+compareCount))
    };
    for(i=1,l=data.length-31;i<l;i++){
        similarValue=calcSimilar(sourceData,compareData.slice(i,i+compareCount));
        if(result.similar<similarValue){
            result={
                start:data[i][0],
                end:data[i+compareCount][0],
                similar:similarValue
            };
        }
    }
    console.log(result);
    console.log("calc cost:",new Date().getTime()-startTime);
};

As you can see, the runtime has almost a 15% performance boost. And the new algorithm is also better in space.

divide and conquer algorithm

Although it has been optimized, we still cannot avoid large-scale data calculations, and this defect will be more prominently exposed when performing market-wide data calculations. Between this, we propose a feasible optimization algorithm: introduce the idea of ​​divide and conquer. For the specific case in this article, we calculate the latest 30-day data of 600570, that is to say, each cycle, Pearson's formula will calculate the similarity of two sets of array data with a length of 30. In order to reduce the amount of calculation, we select several feature data from the array, for example, select the three data of the array head, the array, and the end of the array to calculate the correlation, instead of calculating the similarity of the entire data every time, select after a rough calculation Rank the top few similarities, and then perform a complete calculation on the top few data indexes:

/*
 * 分治计算
 * 2个属性4.3ms,divide=3最高相似度0.9501(cut越大近似度越高)
 * cut*compareCount+totalLength*divide=totalLength*compareCount;
 * 当compareCount=30,totalLength=1404,divide=3时,cut=1263
 */
compareSimilarKCut=function(code,period,data){
    var i,j,l,compareData,startTime,result,cut,position,
        divide,divideStep,divideIndex,tempSource,tempCompare;
    compareData=[];
    result=[];
    for(i=0,l=data.length;i<l;i++){
        compareData[i]={
            date:data[i][0],
            open:data[i][1],
            high:data[i][2],
            low:data[i][3],
            close:data[i][4],
            amount:data[i][5]
        };
    }
    startTime=new Date().getTime();
    cut=100;
    divide=3;
    divideIndex=[0];
    divideStep=compareCount/(divide-1);
    tempSource=[sourceData[divideIndex[0]]];
    for(i=1;i<divide-1;i++){
        divideIndex[i]=divideStep*i;
        tempSource[i]=sourceData[divideIndex[i]];
    }
    divideIndex[i]=compareCount-1;
    tempSource[i]=sourceData[divideIndex[i]];
    for(i=0,l=data.length-1-compareCount;i<l;i++){
        tempCompare=[];
        for(j in divideIndex){
            tempCompare.push(compareData[i+divideIndex[j]]);
        }
        result[i]={
            start:i,
            similar:calcSimilar(tempSource,tempCompare)
        }
    }
    result.sort(function(a,b){
        return b.similar- a.similar;
    });
    result=result.slice(0,cut);
    for(i=0,l=result.length;i<l;i++){
        position=result[i].start;
        result[i]={
            start:data[position][0],
            end:data[position+compareCount][0],
            similar:calcSimilar(sourceData,compareData.slice(position,position+compareCount))
        }
    }
    result.sort(function(a,b){
        return b.similar- a.similar;
    });
    console.log(result);
    console.log("calc cost:",new Date().getTime()-startTime);
};

For this algorithm, there is a performance formula for reference (the following variable names are the same as in the program):
cut ∗ compare C ount + total L ength ∗ divide = total L ength ∗ compare C ount cut*compareCount+totalLength*divide=totalLength*compareCountcutcompareCount+totalLengthdivide=totalLength
In the c o m p a r e C o u n t formula, cut indicates the number of top names to be intercepted after rough calculation for precise calculation (in this example, the value is 100), and compareCount indicates the length of the entire comparison data ( in this example , it is 30 ), totalLength indicates the length of historical data (in this example, the length of 600571 historical data is 1404), divide indicates the number of feature data (in this example, 3 feature numbers are used instead of 30 complete data). For this formula, divide and cut are subjectively defined by the developer. The larger the divide, the higher the cost of rough calculation. When cut=compareCount, this algorithm degenerates into the first brute force traversal algorithm; the larger the cut, the higher the final precise calculation cost, but note that cut cannot Too small, because the rough calculation process is actually a greedy calculation process, which may lose the global optimal solution and obtain a local optimal solution. Another point to note is that during the calculation process of this algorithm, a similarity array of data length must also be saved for final sorting, so the space consumption is as large as the first brute force traversal algorithm (you can maintain a length of cut The descending array saves the data of the first few digits of similarity for small optimization, which improves the performance by about 15%.

Dynamic programming algorithm

For this problem scenario, is there an algorithm that saves both space and time? The answer is the dynamic programming algorithm.
The detailed introduction of the dynamic programming algorithm can be found on the Internet, and this article only briefly introduces the concept. Simply put, dynamic programming means that the current calculation can use the previous calculation results. For the Pearson formula, we can store the intermediate calculation process and only change the head and tail data when traversing the data, so that we don't need to save all the similarity calculation results and then sort them at the end, but only need to maintain a length of compareCount The intermediate calculation state array (30 in this example) solves the problem in space; since each step of calculation is based on the result of the previous step calculation, except for the first step, a complete calculation of a compareCount (30 in this example) is required. Therefore, in theory, the performance of this algorithm in terms of time is also excellent. For this algorithm, we also get the exact solution to the global optimal solution. The program code is as follows:

/*
 * 动态规划
 * 2个属性4.2ms,最高相似度0.9505
 */
compareSimilarKDynamic=function(code,period,data){
    var i,l,compareData,startTime,result,similarValue,atomOpen,atomClose,
        tempCompare,mulOpen,mulClose;
    var calcPearson,calcAtom,calcMulAdd,dynamic;

    /*
     * 原子公式计算皮尔逊相关系数
     * 返回[∑X,∑Y,∑X^2,∑Y^2,N]
     * @param {array} source 源K线数据
     * @param {array} data 对比的K线数据,data.length=source.length
     * @param {string} field 参数
     */
    calcAtom=function(source,data,field){
        var i,l,sourceSquareAdd,sourceAdd,dataSquareAdd,dataAdd;
        sourceSquareAdd=0;
        sourceAdd=0;
        dataSquareAdd=0;
        dataAdd=0;
        for(i=0,l=source.length;i<l;i++){
            sourceAdd+=source[i][field];
            dataAdd+=data[i][field];
            sourceSquareAdd+=source[i][field]*source[i][field];
            dataSquareAdd+=data[i][field]*data[i][field];
        }
        return [sourceAdd,dataAdd,sourceSquareAdd,dataSquareAdd,l];
    };

    /*
     * 计算累乘
     * @param {array} source 源K线数据
     * @param {array} data 对比的K线数据,data.length=source.length
     * @param {string} field 参数
     */
    calcMulAdd=function(source,data,field){
        var i,l,mulAdd;
        mulAdd=0;
        for(i=0,l=source.length;i<l;i++){
            mulAdd+=source[i][field]*data[i][field];
        }
        return mulAdd;
    };

    /*
     * 计算皮尔逊值
     * (∑XY-∑X*∑Y/N)/(Math.sqrt((∑X^2-(∑X)^2/N)*((∑Y^2-(∑Y)^2/N)))
     */
    calcPearson=function(mul,data){
        return (mul-data[0]*data[1]/data[4])/(Math.sqrt((data[2]-data[0]*data[0]/data[4])*(data[3]-data[1]*data[1]/data[4])));
    };

    /*
     * 动态规划分步变化
     */
    dynamic=function(atom,field){
        var value;
        value=compareData[i+compareCount-1];
        atom[1]=atom[1]-compareData[i-1][field]+value[field];
        atom[3]=atom[3]-compareData[i-1][field]*compareData[i-1][field]+value[field]*value[field];
        return atom;
    };

    compareData=[];
    for(i=0,l=data.length;i<l;i++){
        compareData[i]={
            date:data[i][0],
            open:data[i][1],
            high:data[i][2],
            low:data[i][3],
            close:data[i][4],
            amount:data[i][5]
        };
    }
    startTime=new Date().getTime();
    i=0;
    tempCompare=compareData.slice(i,i+compareCount);
    mulOpen=calcMulAdd(sourceData,tempCompare,"open");
    mulClose=calcMulAdd(sourceData,tempCompare,"close");
    atomOpen=calcAtom(sourceData,tempCompare,"open");
    atomClose=calcAtom(sourceData,tempCompare,"close");
    similarValue=0.5*calcPearson(mulOpen,atomOpen)+0.5*calcPearson(mulClose,atomClose);
    result={
        start:data[i][0],
        end:data[i+compareCount][0],
        similar:similarValue
    };
    for(i=1,l=data.length-31;i<l;i++){
        tempCompare=compareData.slice(i,i+compareCount);
        mulOpen=calcMulAdd(sourceData,tempCompare,"open");
        mulClose=calcMulAdd(sourceData,tempCompare,"close");
        atomOpen=dynamic(atomOpen,"open");
        atomClose=dynamic(atomClose,"close");
        similarValue=0.5*calcPearson(mulOpen,atomOpen)+0.5*calcPearson(mulClose,atomClose);
        if(result.similar<similarValue){
            result={
                start:data[i][0],
                end:data[i+compareCount][0],
                similar:similarValue
            };
        }
    }
    console.log(result);
    console.log("calc cost:",new Date().getTime()-startTime);
};

The only disadvantage of this algorithm is that it is complex to implement. It is not a small challenge to modify the simple loop calculation into a dynamic programming loop; secondly, in this problem scenario, we also need to modify the implementation of the Pearson formula to save the intermediate process. You can see that we use calcAtom in the code and calcPearson rewrite the code implementation of Pearson's formula.

code tuning

For production-level high TPS scenarios, this article gives another program implementation after "code tuning". The algorithm itself is based on dynamic programming (algorithm 4), but the specific implementation form has shifted from "object-oriented" to "process-oriented". In the following set of codes, we merge redundant loops, remove redundant method stack calls, and remove redundant memory allocation (Array. 22 times performance improvement) calculation program. Use the following program to calculate the whole market traversal, and it only takes 30 minutes for a single thread.

/*
 * 动态规划
 * 内存优化(数组maloc),去除方法栈开销(面向过程)
 * 2个属性0.17ms,最高相似度0.9505
 */
compareSimilarKDynamicOptimize=function(code,period,data){
    var i,l,compareData,startTime,result,similarValue,atomOpen,atomClose,
        j,k,mulOpen,mulClose,sourceSquareAdd,sourceAdd,dataSquareAdd,dataAdd,
        m;
    compareData=[];
    for(i=0,l=data.length;i<l;i++){
        compareData[i]={
            date:data[i][0],
            open:data[i][1],
            high:data[i][2],
            low:data[i][3],
            close:data[i][4],
            amount:data[i][5]
        };
    }
    startTime=new Date().getTime();
    //tempCompare=compareData.slice(i,i+compareCount);
    /*
     * mulOpen=calcMulAdd(sourceData,tempCompare,"open");
     * mulClose=calcMulAdd(sourceData,tempCompare,"close");
     */
    i=0;
    mulOpen=0;
    mulClose=0;
    for(l=i+compareCount;i<l;i++){
        mulOpen+=sourceData[i].open*compareData[i].open;
        mulClose+=sourceData[i].close*compareData[i].close;
    }
    /*
     * atomOpen=calcAtom(sourceData,tempCompare,"open");
     * atomClose=calcAtom(sourceData,tempCompare,"close");
     */
    sourceSquareAdd=0;
    sourceAdd=0;
    dataSquareAdd=0;
    dataAdd=0;
    for(i=0;i<l;i++){
        sourceAdd+=sourceData[i].open;
        dataAdd+=compareData[i].open;
        sourceSquareAdd+=sourceData[i].open*sourceData[i].open;
        dataSquareAdd+=compareData[i].open*compareData[i].open;
    }
    atomOpen=[sourceAdd,dataAdd,sourceSquareAdd,dataSquareAdd,l];
    sourceSquareAdd=0;
    sourceAdd=0;
    dataSquareAdd=0;
    dataAdd=0;
    for(i=0;i<l;i++){
        sourceAdd+=sourceData[i].close;
        dataAdd+=compareData[i].close;
        sourceSquareAdd+=sourceData[i].close*sourceData[i].close;
        dataSquareAdd+=compareData[i].close*compareData[i].close;
    }
    atomClose=[sourceAdd,dataAdd,sourceSquareAdd,dataSquareAdd,l];
    /*
     * similarValue=0.5*calcPearson(mulOpen,atomOpen)+0.5*calcPearson(mulClose,atomClose);
     */
    similarValue=0.5*(mulOpen-atomOpen[0]*atomOpen[1]/atomOpen[4])/(Math.sqrt((atomOpen[2]-atomOpen[0]*atomOpen[0]/atomOpen[4])*(atomOpen[3]-atomOpen[1]*atomOpen[1]/atomOpen[4])))+0.5*(mulClose-atomClose[0]*atomClose[1]/atomClose[4])/(Math.sqrt((atomClose[2]-atomClose[0]*atomClose[0]/atomClose[4])*(atomClose[3]-atomClose[1]*atomClose[1]/atomClose[4])));
    result={
        start:data[0][0],
        end:data[compareCount][0],
        similar:similarValue
    };
    for(i=1,l=data.length-31;i<l;i++){
        //tempCompare=compareData.slice(i,i+compareCount);
        /*
         * mulOpen=calcMulAdd(sourceData,tempCompare,"open");
         * mulClose=calcMulAdd(sourceData,tempCompare,"close");
         */
        mulOpen=0;
        mulClose=0;
        for(j=0,k=i,m=i+compareCount;k<m;k++,j++){
            mulOpen+=sourceData[j].open*compareData[k].open;
            mulClose+=sourceData[j].close*compareData[k].close;
        }
        /*
         * atomOpen=dynamic(atomOpen,"open");
         * atomClose=dynamic(atomClose,"close");
         */
        var value;
        value=compareData[i+compareCount-1];
        atomOpen[1]=atomOpen[1]-compareData[i-1].open+value.open;
        atomOpen[3]=atomOpen[3]-compareData[i-1].open*compareData[i-1].open+value.open*value.open;
        atomClose[1]=atomClose[1]-compareData[i-1].close+value.close;
        atomClose[3]=atomClose[3]-compareData[i-1].close*compareData[i-1].close+value.close*value.close;
        /*
         * similarValue=0.5*calcPearson(mulOpen,atomOpen)+0.5*calcPearson(mulClose,atomClose);
         */
        similarValue=0.5*(mulOpen-atomOpen[0]*atomOpen[1]/atomOpen[4])/(Math.sqrt((atomOpen[2]-atomOpen[0]*atomOpen[0]/atomOpen[4])*(atomOpen[3]-atomOpen[1]*atomOpen[1]/atomOpen[4])))+0.5*(mulClose-atomClose[0]*atomClose[1]/atomClose[4])/(Math.sqrt((atomClose[2]-atomClose[0]*atomClose[0]/atomClose[4])*(atomClose[3]-atomClose[1]*atomClose[1]/atomClose[4])));
        if(result.similar<similarValue){
            result={
                start:data[i][0],
                end:data[i+compareCount][0],
                similar:similarValue
            };
        }
    }
    //console.log(result);
    //console.log("calc cost:",new Date().getTime()-startTime);
    return new Date().getTime()-startTime;
};

Summarize

Based on the above discussions and tests (program example 5 is not considered, because program 5 is not an algorithm design technique, but a code tuning technique), the following algorithm performance results are obtained: Time consumption: Algorithm 3<Algorithm 4<Algorithm 2
< Algorithm 1
Space consumption: Algorithm 2<Algorithm 4<Algorithm 3=Algorithm 1
Implementation difficulty: Algorithm 1<Algorithm 2<Algorithm 3<Algorithm 4

The practical application of similar K-lines to financial investment

In fact, for most of the data segments, data segments with a similarity greater than 0.9 can be found in the historical data of any stock that has been listed for more than five or six years. The top similar K-lines we see on many products are only It is just a drop in the ocean, and the corresponding late trend cannot represent all the data, so the author is still skeptical about the validity of similar K-lines. What does it mean? For a piece of data, one similar K-line found indicates that the future price will rise, while another similar K-line indicates that the future price will fall. How do investors judge? And this differentiation trend must exist. If the similar K-line we see on a certain product shows that it will rise 100% in the future, it is just that this product has not calculated all the data. Similar K-lines can only provide morphological simulation approximations, and cannot completely match the financial conditions (news, fundamentals, etc.) of current individual stocks and historical data.

It's not easy to stay up late, please have a glass of wine with the author!

Guess you like

Origin blog.csdn.net/yuhk231/article/details/80810427