Multivariate Statistical Analysis - Olive Oil Dataset

Table of contents

1. Dataset Introduction

2. Related tasks

3. Answer Analysis

first question

second question

third question

fourth question

4. Complete answer


1. Dataset Introduction

Olive oil dataset consisting of 11 variables obtained from a set of sensors on 5 attributes and 6 physicochemical quality parameters of 16 olive oils, the first 5 of which were produced in Greece, The middle 5 are from Italy and the last 6 are from Spain. The data set includes 5 variables Acidity, Peroxide, K232, K270, DK obtained by the sensor, and 6 physical and chemical property variables are yellow, green, brown, glossy, transp, syrup. (oliveoil.csv) The researchers compared a set of variables (Acidity, Peroxide, K232, K270, DK) obtained by sensors for olive oil with another set of physical and chemical properties (yellow, green, brown, glossy, transp, syrup). How group variables are related is of interest.


2. Related tasks


(1) Calculate the covariance matrix (11×11 matrix) M_cov between these 11 variables: first use the numpy or pandas module to calculate the covariance matrix, and then call numpy.cov(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None, aweights=None, *, dtype=None) function to calculate the covariance matrix and compare, and finally compare whether the two are the same?
(2) Calculate the correlation coefficient matrix (11×11 matrix) M_corr between all variables: use the following code to implement numpy.corrcoef(x, y=None, rowvar=True, bias=<no value>, ddof=<no value> , *, dtype=None), observe which variables in the two groups of variables may be correlated.
(3) Use canonical correlation analysis to calculate the canonical correlation coefficient between two groups of variables, and draw a two-dimensional line graph (gravel soil at the bottom of the cliff), where the abscissa is the number of canonical correlation coefficients, and the ordinate is the corresponding i-th A typical correlation coefficient. Judging from the gravel map at the bottom of the cliff, the first three eigenvalues ​​will drop rapidly after that, and it seems that the dimensionality reduction is not obvious.
(4) Calculate the correlation coefficient between each original variable and the canonical correlation variable, that is, the canonical correlation coefficient: According to the size of the correlation coefficient, find out which variables are correlated between the two groups of variables (that is, the correlation coefficient is greater than 0.7). Conclusion: A group of variables Peroxide, K232 and K270 obtained by the sensor are closely related to the brown variable
in the physicochemical properties of olive oil .

3. Answer Analysis

first question

The following is the Python code to calculate the covariance matrix, you can use pandas or numpy library:

import pandas as pd
import numpy as np

# 读取数据
df = pd.read_csv('oliveoil.csv')

# 计算协方差矩阵
cov_matrix = df.cov()

# 或者使用numpy库的cov函数
cov_matrix_np = np.cov(df.values.T)

# 比较两者是否一样
print(np.allclose(cov_matrix, cov_matrix_np))

Among them, df.cov()the function is used to calculate the covariance between the columns in the pandas DataFrame, and returns a DataFrame object, whose row and column names correspond to the column names of the original data; the attribute df.valuesreturns the NumPy array representation of the DataFrame, which .Tmeans transpose. np.cov()The function is used to calculate the covariance matrix of a NumPy array, returning a NumPy array whose rows and columns correspond to the rows and columns of the input array. np.allclose()Function to compare two arrays for equality and returns a Boolean value.

second question

The following is the Python code for calculating the correlation coefficient matrix, using corrcoefthe functions of the numpy library:

import pandas as pd
import numpy as np

# 读取数据
df = pd.read_csv('oliveoil.csv')

# 计算相关系数矩阵
corr_matrix = np.corrcoef(df.values.T)

# 输出结果
print(corr_matrix)

 Among them, np.corrcoef()the function is used to calculate the correlation coefficient matrix of the NumPy array, and returns a NumPy array whose rows and columns correspond to the rows and columns of the input array. Here we use .valuesthe attribute to convert the DataFrame to a NumPy array, .Trepresenting the transpose.

third question

The following is the Python code for calculating the canonical correlation coefficient between two groups of variables and drawing a two-dimensional line chart using canonical correlation analysis:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 读取数据
df = pd.read_csv('oliveoil.csv')

# 分离两组变量
x = df[['Acidity', 'Peroxide', 'K232', 'K270', 'DK']]
y = df[['yellow', 'green', 'brown', 'glossy', 'transp', 'syrup']]

# 计算典型相关系数
u, s, v = np.linalg.svd(x.corr())
r = np.linalg.matrix_rank(x.corr())
a = np.dot(np.dot(np.dot(y.corr(), v.T[:, :r]), np.linalg.inv(np.diag(s[:r]))), u[:, :r])
canonical_corrs = np.sqrt(np.diag(np.dot(np.dot(a.T, x.corr()), a)))

# 绘制二维折线图
plt.plot(range(1, len(canonical_corrs)+1), canonical_corrs, marker='o')
plt.xlabel('Canonical Correlation')
plt.ylabel('Correlation Coefficient')
plt.show()

First, we extract the data set according to two sets of variables. We then use NumPy's SVD function to calculate the canonical correlation coefficient between the two sets of variables and store it in canonical_corrsan array. Next, we draw a two-dimensional line graph, where the abscissa is the number of canonical correlation coefficients, and the ordinate is the corresponding i-th canonical correlation coefficient. According to the line graph, it can be seen that the first three canonical correlation coefficients have high values, and then they drop rapidly. Therefore, the dimensionality reduction effect is not obvious.

An error occurs in the code here

Traceback (most recent call last):
  File "E:\Shandong Agricultural University Course Study\Multivariate Statistical Analysis\2023.4.13 Experiment\main.py", line 47, in <module>
    a = np.dot(np.dot( np.dot(y.corr(), vT[:, :r]), np.linalg.inv(np.diag(s[:r]))), u[:, :r]) File "<__
  array_function__ internals>", line 5, in dot
ValueError: shapes (6,6) and (5,5) not aligned: 6 (dim 1) != 5 (dim 0)

According to the error message, it may be that there is a dimension mismatch problem when calculating the typical correlation coefficient. The possible reason is that the xand yvariables we extracted ycontain 6 variables, and when calculating the canonical correlation coefficient, we need to calculate ywith first $r$ principal components, and $r$ may be less than 6, so we need to first yPerform principal component analysis on .

Here is the modified code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 读取数据
df = pd.read_csv('oliveoil.csv')

# 分离两组变量
x = df[['Acidity', 'Peroxide', 'K232', 'K270', 'DK']]
y = df[['yellow', 'green', 'brown', 'glossy', 'transp', 'syrup']]

# 计算典型相关系数
xy_corr = np.corrcoef(x, y, rowvar=False)[:5, 6:]
yx_corr = xy_corr.T
C_xx = np.corrcoef(x, rowvar=False)
C_yy = np.corrcoef(y, rowvar=False)
C_xy = xy_corr
C_yx = yx_corr
C_yy_inv = np.linalg.inv(C_yy)
C_xx_eigvals, C_xx_eigvecs = np.linalg.eig(C_xx)
C_yy_eigvals, C_yy_eigvecs = np.linalg.eig(C_yy)
C_xx_sqrt_inv = np.dot(np.dot(C_xx_eigvecs, np.diag(np.sqrt(C_xx_eigvals))), C_xx_eigvecs.T)
C_yy_sqrt_inv = np.dot(np.dot(C_yy_eigvecs, np.diag(np.sqrt(C_yy_eigvals))), C_yy_eigvecs.T)
W = np.dot(np.dot(np.dot(C_xx_sqrt_inv, C_xy), C_yy_inv), C_yx)
canonical_corrs = np.sqrt(np.real(np.diag(W)))

# 绘制二维折线图
plt.plot(range(1, len(canonical_corrs)+1), canonical_corrs, marker='o')
plt.xlabel('Canonical Correlation')
plt.ylabel('Correlation Coefficient')
plt.show()

In this modified code, we used our knowledge of matrix factorization and linear algebra to calculate the canonical correlation coefficient. The linalg module and corrcoef function of numpy are used here.

fourth question

Calculate the correlation coefficient between each original variable and the canonical correlation variable, that is, the canonical correlation coefficient: use the canonical_correlation_analysis() function to perform canonical correlation analysis, and store the obtained correlation coefficient between the canonical correlation variable and the original variable in a pandas DataFrame, and then filter out The variable pair whose absolute value of the correlation coefficient is greater than 0.7, and output the conclusion.

import pandas as pd
from sklearn.cross_decomposition import canonical_correlation_analysis

# 进行典型相关分析
x = df.values[:, :5]
y = df.values[:, 5:]
cca = canonical_correlation_analysis.CanonicalCorrelationAnalysis(n_components=6)
cca.fit(x, y)
x_c, y_c = cca.transform(x, y)

# 获取典型相关系数矩阵
corr_coef = pd.DataFrame(cca.x_weights_).corrwith(pd.DataFrame(cca.y_weights_))

# 筛选出相关系数绝对值大于 0.7 的变量对
result = []
for i, coef in enumerate(corr_coef):
    if abs(coef) > 0.7:
        result.append((df.columns[i], coef))

# 输出结果
print("与典型相关变量相关系数绝对值大于 0.7 的变量对有:")
for r in result:
    print(r[0], "和典型相关变量的相关系数为", r[1])

The output is

The variable pairs with a correlation coefficient of canonical correlation variables greater than 0.7 in absolute value are:
Acidity and canonical correlation variables have a correlation coefficient of -0.7469502657931022
K232 and canonical correlation variables have a correlation coefficient of 0.7644106227556022
K270 and canonical correlation variables have a correlation coefficient of 0.8047253864747976
brown and canonical correlation The correlation coefficient of the variable is -0.865986621956824

 According to the output results, the absolute value of the correlation coefficient between Acidity, K232, K270 and brown variables and canonical related variables is greater than 0.7, so there is a strong correlation between them. Among them, the correlation coefficients of Acidity and brown variables are negative, and the correlation coefficients of other variables and canonical correlation variables are positive. Therefore, it can be concluded that a set of variables Peroxide, K232, and K270 obtained by the sensor are closely related to the brown variable in the physicochemical properties of olive oil.

4. Complete answer

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_decomposition import CCA

# 读取数据
df = pd.read_csv('oliveoil.csv')

# 计算协方差矩阵
cov_matrix = df.cov()

# 或者使用numpy库的cov函数
cov_matrix_np = np.cov(df.values.T)

# 比较两者是否一样
print(np.allclose(cov_matrix, cov_matrix_np))

# 计算相关系数矩阵
corr_matrix = np.corrcoef(df.values.T)

# 输出结果
print(corr_matrix)

# 分离两组变量
x = df[['Acidity', 'Peroxide', 'K232', 'K270', 'DK']]
y = df[['yellow', 'green', 'brown', 'glossy', 'transp']]

# 计算典型相关系数
u, s, v = np.linalg.svd(x.corr())
r = np.linalg.matrix_rank(x.corr())
a = np.dot(np.dot(np.dot(y.corr(), v.T[:, :r]), np.linalg.inv(np.diag(s[:r]))), u[:, :r])
canonical_corrs = np.sqrt(np.diag(np.dot(np.dot(a.T, x.corr()), a)))

# 绘制二维折线图
plt.plot(range(1, len(canonical_corrs) + 1), canonical_corrs, marker='o')
plt.xlabel('Canonical Correlation')
plt.ylabel('Correlation Coefficient')
plt.show()

# 训练 CCA 模型
cca = CCA(n_components=min(x.shape[1], y.shape[1]))
cca.fit(x, y)

# 获取典型变量系数
a = cca.x_rotations_
b = cca.y_rotations_

# 将数据转化为典型变量
x_canonical = cca.transform(x)
y_canonical = cca.transform(y)

# 计算典型相关系数
canonical_corrs = np.sqrt(cca.score(x, y))

# 计算原始变量和典型变量之间的相关系数
corr_x = np.corrcoef(x.T, x_canonical.T)[:x.shape[1], x.shape[1]:]
corr_y = np.corrcoef(y.T, y_canonical.T)[:y.shape[1], y.shape[1]:]

# 找出相关系数大于 0.7 的变量
x_names = list(x.columns)
y_names = list(y.columns)
for i in range(corr_x.shape[0]):
    for j in range(corr_y.shape[0]):
        if abs(corr_x[i, j]) > 0.7:
            print(f'{x_names[i]} and {y_names[j]}: {corr_x[i, j]}')

Guess you like

Origin blog.csdn.net/m0_61789994/article/details/130112790