Gray correlation

        The previous article wrote the calculation codes for Pearson correlation coefficient , Spearman correlation coefficient and Kenda correlation coefficient. This article writes the code for gray correlation analysis.  Correlation analysis, correlation coefficient matrix heat map_The last scoop of water's blog-CSDN blog      

        Gray correlation analysis refers to a method of quantitative description and comparison of the development and change situation of a system. The basic idea of ​​gray correlation analysis is to judge whether the connection is close based on the similarity of the geometric shapes of the sequence curves. The closer the curves are, the greater the correlation between the corresponding sequences, and vice versa.

        This method can usually be used to analyze the impact of each factor on the results ( system analysis ), and can also be used to solve comprehensive evaluation problems that change over time . The core is to establish a mother sequence that changes over time according to certain rules, and combine each The changes in the evaluation object over time are used as subsequences, the correlation degree between each subsequence and the parent sequence is found, and conclusions are drawn based on the magnitude of the correlation. (The theoretical part is based on other blogs. The code and data are personal creations. The purpose of creation is only to facilitate your own future use. The code part can be obtained by yourself if necessary.)

Step 1: Clean data and standardize data

The data used in this article is as shown below. We calculate the correlation between the development cycle, the inventory removal cycle, the low price per unit, the average price of a single land parcel, the total land transaction amount, the number of land parcels traded, the land area traded and the profit rate. Therefore, We need to first clear the irrelevant data, that is, the time column data, and then standardize the data to eliminate dimensions. Introduction to the three standardization and denormalization methods of the sklearn library_Last Dipper of Water's Blog-CSDN Blog_The process of standardization handled by the sklearn library

#读取数据
import pandas as pd
df=pd.read_excel(r"C:\Users\86177\Desktop\datas.xlsx")
df

 After removing irrelevant data, the data is standardized to make the data dimensionless.

df=df.iloc[:,1:]#清除第一列无关数据
#数据标准化
import pandas as pd
from sklearn.preprocessing import StandardScaler
columns=df.columns
#Z-score标准化 将某一列数据处理成均值为0,方差为1的数据。优点是受异常值影响较小。公式为:(X-μ)/σ
standard_s1=StandardScaler()#创建StandardScaler()实例
standard_s1_data=standard_s1.fit_transform(df)#将DataFrame格式的数据按照每一个series分别标准化
standard_s1_data_pd=pd.DataFrame(standard_s1_data,columns=columns)#将标准化后的数据改成DataFrame格式
datas=standard_s1_data_pd
datas

Step 2: Determine the analysis sequence (reference sequence, comparison sequence)

        Determine the reference sequence that reflects the behavior characteristics of the system and the comparison sequence that affects the system behavior. The data sequence that reflects the behavioral characteristics of the system is called a reference sequence. The data sequence composed of factors that affect the behavior of the system is called a comparison sequence.

        Gray correlation analysis can also be said to be the calculation of the correlation between the comparison sequence and the parameter sequence. Therefore, in the example of this article, the reference sequence is the profit rate, and the comparison sequence is the development cycle, inventory removal cycle, unit low price, and single parcel. Average price, total land transaction amount, number of land parcels transacted, and land area transacted.

(1) Reference sequence (also called mother sequence):

(2) Compare sequence (also called subsequence)

Step 3: Calculate correlation coefficient

The formula seems complicated at first glance, but it is actually very simple. Let’s break down the formula:

(1)

It is the difference sequence of the absolute value of the reference sequence minus the i-th comparison sequence, and then taking the minimum value of the difference sequence, then taking the maximum value of the difference sequence.

(2)

is the minimum difference, which is the minimum value of the minimum value of all difference sequences. For example, if we have 7 comparison sequences here, there will be 7 difference sequences. Each difference sequence will have a minimum value. Then among these 7 minimum values The minimum value of is the minimum difference.

is the maximum difference, which is naturally the maximum value among the maximum values ​​of all difference sequences.

(3)

It is called the resolution coefficient. The smaller ρ is, the greater the resolution is. Generally, the value range of ρ is (0,1) (0,1)(0,1). The specific value depends on the situation. When ρ≤0.5463, the resolution is the best, usually ρ=0.5.

Summary: When the reference sequence and comparison sequence are determined, the minimum difference and the maximum difference are actually determined values. The only thing that is changing in the formula is .

Calculate the maximum and minimum difference codes:

#计算最小差值和最大差值
rather_columns=columns[1:].tolist()#比较序列
print('比较数列:{}'.format(rather_columns))
min_s=[]#存储参考序列和每一个比较序列的绝对差值的最小值
max_s=[]#存储参考序列和每一个比较序列的绝对差值的最大值
for column in rather_columns:
    min_=(datas['利润率']-datas[column]).abs().min()#比较数列与参考数列矩阵相减后差值的绝对值里的最小值
    max_=(datas['利润率']-datas[column]).abs().max()#比较数列与参考数列矩阵相减后差值的绝对值里的最大值
    min_s.append(min_)
    max_s.append(max_)
print('最小值:{}'.format(min_s))
print('最大值:{}'.format(max_s))
mmin=min(min_s)#最小差值
mmax=max(max_s)#最大差值
print('最小差值:{}'.format(mmin))
print('最大差值:{}'.format(mmax))

 The code results are as follows:

Calculate correlation coefficient matrix

#计算相关系数矩阵
rho=0.5
for column in rather_columns:
    datas[column]=(mmin+rho*mmax)/(abs(datas['利润率']-datas[column])+rho*mmax)
datas[rather_columns]#相关系数矩阵

The result is as follows:

 

 Step 4: Calculate relevance

In fact, it is the mean of each column. For example, the average of the 10 numbers in the development cycle column is the correlation between the development cycle and the profit rate.

#计算比较序列与参考序列之间的相关系数
corr=[]
for column in rather_columns:
    corr.append(datas[column].mean())
print('7个比较数列与参考数列的关联度分别为:')
print(corr)

 Note: The codes for calculating the correlation degree in step 4 and calculating the correlation coefficient matrix in step 3 can be merged and separated for a clearer view.

The complete code is available from:

import pandas as pd
#数据读取与清洗
df=pd.read_excel(r"C:\Users\86177\Desktop\层次分析法所用指标.xlsx")
df=df.iloc[:,1:]
#数据标准化
import pandas as pd
from sklearn.preprocessing import StandardScaler
columns=df.columns
#Z-score标准化 将某一列数据处理成均值为0,方差为1的数据。优点是受异常值影响较小。公式为:(X-μ)/σ
standard_s1=StandardScaler()#创建StandardScaler()实例
standard_s1_data=standard_s1.fit_transform(df)#将DataFrame格式的数据按照每一个series分别标准化
standard_s1_data_pd=pd.DataFrame(standard_s1_data,columns=columns)#将标准化后的数据改成DataFrame格式
datas=standard_s1_data_pd
#计算相关系数
rho=0.5
corr=[]#存储比较序列与参考序列之间的相关系数
for column in rather_columns:
    datas[column]=(mmin+rho*mmax)/(abs(datas['利润率']-datas[column])+rho*mmax)
    corr.append(datas[column].mean())
print(datas[rather_columns])#相关系数矩阵
print('7个比较数列与参考数列的关联度分别为:{}'.format(corr))
#计算比较序列与参考序列之间的相关系数
corr=[]
for column in rather_columns:
    corr.append(datas[column].mean())
print('7个比较数列与参考数列的关联度分别为:')
print(corr)

Note: The application of gray correlation reading in comprehensive evaluation problems is to calculate the weight of the indicator (comparison sequence) based on the correlation between the comparison sequence and the reference sequence. That is, there will be time to summarize the commonly used methods and codes for calculating weights later .

Guess you like

Origin blog.csdn.net/m0_56839722/article/details/127709631