python implements gray correlation method (GRA)

Original text: https://mp.weixin.qq.com/s/Uuri-FqRWk3V5CH7XrjArg

1 Introduction to gray correlation analysis method

A white system refers to a system with completely clear information, a black system refers to a system with incomplete information, and a gray system is a system between white and black systems, which means that the internal information and characteristics of the system are partially known and partially unknown.

Gray system theory puts forward the concept of gray correlation analysis , whose purpose is to find the numerical relationship between various factors .

The basic idea of ​​gray correlation analysis is to determine whether there is a close connection between different sequences based on the similarity of the geometric shapes of the sequence curves. The basic idea is to convert the discrete behavioral observations of system factors into piecewise continuous through linear interpolation . Polyline , and then construct a model to measure the degree of correlation based on the geometric characteristics of the polyline.

The closer the geometric shapes of the polylines are, the greater the correlation between the corresponding sequences, and vice versa. Gray correlation analysis is actually a quantitative analysis of dynamic indicators, which fully embodies the dynamic significance.

Gray Relational Analysis (GRA) refers to the measure of the correlation between two systems or two factors . Based on its size, we can distinguish the primary and secondary relationships of factors' influence on system development.

Note: It is called correlation because it only reflects which indicator is most related to the indicator to be compared, but does not reflect correlation . It has nothing to do with the correlation coefficient and is not equivalent. The final coefficients are just sorted .

Features of gray relational analysis method:

  • No dependent variable is required, the reference variable obeys the normal distribution, there are no requirements for sample data, and it is suitable for small sample data.
  • It can sort the correlation between the independent variable and the reference variable to get an evaluation of the importance ranking. The correlation itself has no practical meaning. For example, the correlation is 0.95. This number is only used for ranking, and it is not the same as the correlation of 0.95. It is concluded that they are highly positively correlated.
  • The disadvantage is that it can only be sorted , and the exact correlation cannot be obtained. It does not belong to the category of correlation analysis and is only included in the comprehensive evaluation method .

2 Calculation using gray relational analysis method

1) Determine the indicators

2) Obtain the statistical data and form it into a matrix R

Assume that X indicators are selected, and each indicator has statistical data for Y years.

In order to facilitate calculation, the statistical data of Y years of X indicators are listed into a matrix R, then there is:

3) Perform dimensionless processing on the matrix R

There are many calculation methods for gray correlation, such as absolute correlation, slope correlation, rate correlation, B-type correlation, area correlation, etc. Here we choose maximum and minimum normalization.

4) Determine the reference column and comparison column

Denote a comparison sequence that eliminates dimensions as x 0 ( t ) x_0(t)x0( t ) ,the reference sequenceisx 1 (t) x_1(t)x1( t ) , then the two sequences are at the same timekkThe values ​​of k arex 0 ( k ) , x 1 ( k ) x_0(k),x_1(k)x0(k),x1( k ) means:

x 0 ( k ) , x 1 ( k ) x_0(k),x_1(k) x0(k),x1Absolute difference of ( k ) :

5) Determine the minimum and maximum values ​​of the absolute difference

6) Calculate the correlation coefficient

Δ i ( k ) \Delta_i(k) Di( k ) iskkThe absolute difference between the two comparison sequences at time k , in gray theoryρ ∈ (0, 1) \rho\in(0,1)r(0,1 ) , usually researchers take 0.5.

if x 0 ( k ) x_0(k)x0( k ) is the optimal value data column,α i \alpha_iaiThe bigger, the better; if x 0 ( k ) x_0(k)x0( k ) is the worst value data column,α i \alpha_iaiThe smaller it is, the worse it is.

7) Calculate the target layer correlation degree

WW in the formulaW can be calculated by the AHP method, orα \alphaα takes the mean directly.

8) Determine the evaluation level

3 python implementation

Now there is such a set of data. Through the follow-up investigation of a certain elite female shot put player, the time series data of their annual best results and 16 special qualities and physical qualities from 1982 to 1986 are obtained. See the table. Try this shot put Athletes' special performance was subjected to factor analysis.

# 参考:https://blog.csdn.net/PY_smallH/article/details/121491094
# 最下面写了三个函数分别为gain,cost,level_,用于无量纲化,被GRA调用

def GRA(df,normaliza="initial",level=None,r=0.5):
    '''
    df : 二维数据,这里用dataframe,每一行是一个评价指标,要对比的参考指标放在第一行
    normaliza :["initial","mean"] 归一化方法,默认为初值,提供初值化或者均值化,其他方法自行编写
    level :为None默认增益型, 
            可取增益型"gain"(越大越好),成本型"cost"(越小越好),
            或者dataframe中的某一列,如level="level","level"是列名,这列中用数字1和0表示增益和成本型
    r : [0-1] 分辨系数  
            越大,分辨率越大; 越小,分辨率越小,一般取0.5
    '''
    # 判断类型
    if not isinstance(df,pd.DataFrame):
        df = pd.DataFrame(df)
        
    # 判断参数输入
    if (normaliza not in ["initial","mean"]) or (r<0 or r>1):
        raise KeyError("参数输入类型错误")
    
        # 增益型的无量纲化方法
    if level == "gain" or level == None:
        df_ = gain(df,normaliza)
        
        #成本性无量纲化方法
    elif level == "cost":
        df_ = cost(df,normaliza)
        
    else:# 有增益有成本性的无量纲化方法
        try:
            df.columns.get_loc(level) # 尝试获得这一列的列索引,判断输入的列名有没有,返回这个列的索引下标
        except:
            raise KeyError("表中没有这一列")
        df_ = level_(df,normaliza,level)
        df_.drop(level,axis=1,inplace=True)# 加的level这一列对总体没用,最后把这一列删除再做关联分析
        
    df_ = np.abs(df_ - df_.iloc[0,:]) # 每一行指标和要参考的指标做减法,取绝对值。
    global_max = df_.max().max()
    global_min = df_.min().min()
    df_r = (global_min + r*global_max)/(df_ + r*global_max) # 求关联矩阵
    return df_r.mean(axis=1)

# gain增益型
def gain(df,normaliza):
    for i in range(df.shape[0]):
        if normaliza == "initial" or normaliza==None:
            df.iloc[i] = df.iloc[i]/df.iloc[i,0]
        elif normaliza == "mean":
            df.iloc[i] = df.iloc[i]/df.mean(axis=1)
    return df

# cost成本型
def cost(df,normaliza):
    for i in range(df.shape[0]):
        if normaliza == "initial" or normaliza==None:
            df.iloc[i] = df.iloc[i,0]/df.iloc[i]
        elif normaliza == "mean":
            df.iloc[i] = df.mean(axis=1)/df.iloc[i]
    return df

# 数据如下
x = np.array([
    [13.6,14.01,14.54,15.64,15.69],
    [11.50,13.00,15.15,15.30,15.02],
    [13.76,16.36,16.90,16.56,17.30],
    [12.41,12.70,13.96,14.04,13.46],
    [2.48,2.49,2.56,2.64,2.59],
    [85,85,90,100,105],
    [55,65,75,80,80],
    [65,70,75,85,90],
    [12.80,15.30,16.24,16.40,17.05],
    [15.30,18.40,18.75,17.95,19.30],
    [12.71,14.50,14.66,15.88,15.70],
    [14.78,15.54,16.03,16.87,17.82],
    [7.64,7.56,7.76,7.54,7.70],
    [120,125,130,140,140],
    [80,85,90,90,95],
    [4.2,4.25,4.1,4.06,3.99],
    [13.1,13.42,12.85,12.72,12.56]
])

df1 = pd.DataFrame(x)

The final correlation coefficient can be obtained:

0 1.000000
1 0.588106
2 0.662749
3 0.853618
4 0.776254
5 0.854873
6 0.502235
7 0.659223
8 0.582007
9 0.683125
10 0.695782
11 0.895453
12 0.704684
13 0.933405
14 0.846704
15 0.745373
16 0.726079

The first line is the shot put special score x0, which is most relevant to you. You don’t need to look at this. The remaining one with the strongest correlation is x13 full squat, so you can do targeted training. Finally, you can sort them.

Let’s discuss this correlation coefficient matrix again, which is obtained in the fifth step of the steps. For the above, the correlation matrix is ​​as follows:

Draw a picture, pick out a few lines and draw them, otherwise there will be too many lines and you will not be able to see them clearly.

The selected data are the rows [0,1,2,3,4,13,16], where 0 is the reference indicator.

It can be seen that x13 has the strongest relationship, so the straight line with 0 is closest. 3 among 1, 2, and 3 is the strongest, so the red line is also on it.

This is what was said at the beginning. If the change trends of two factors are consistent, that is, the degree of synchronous change is higher, it can be said that the degree of correlation between the two is higher; otherwise, the degree of correlation is lower.

Reference:
Research on comprehensive evaluation of residential landscape in Shenhe District, Shenyang City based on AHP-gray correlation analysis method - Wang Wenwen.
https://blog.csdn.net/PY_smallH/article/details/121491094

Guess you like

Origin blog.csdn.net/mengjizhiyou/article/details/127780455