Table of contents

1. Dataset Introduction

Faculty Data (aaup.csv) This data comes from the American Association of University Professors (AAUP) annual census and includes data by average full professor and associate professor salary and allowance (1994), respectively. There are 13 quantitative variables: ASF (average salary of full professors), ASA1 (average salary of associate professors), ASA2 (average salary of assistant professors), ASALL (average salary of all levels), ACF (average subsidy of full professors), ACA1 (average salary of associate professors subsidy), ACA2 (average subsidy for assistant professors), ACALL (average subsidy for all levels), NF (number of full professors), NA1 (number of associate professors), NA2 (number of assistant professors), NIN (number of assistant professors), NALL (number of professors at all levels number). The data relate to 1161 schools. Now it is necessary to analyze the potential relationship between one set of variables, salaries of various personnel (8 variables starting with the letter A), and another set of variables, classroom salaries (5 variables starting with the letter N).

2. Related tasks

(1) Calculate the correlation coefficient matrix (11×11 matrix) M_corr between all variables: use the following code to implement numpy.corrcoef(x, y=None, rowvar=True, bias=<no value>, ddof=<no value> , *, dtype=None), observe which variables in the two groups of variables may be correlated.

(2) Use canonical correlation analysis to calculate the canonical correlation coefficient between the two groups of variables, and draw a two-dimensional line graph (gravel soil at the bottom of the cliff), where the abscissa is the number of canonical correlation coefficients, and the ordinate is the corresponding i-th A typical correlation coefficient.

(3) Calculate the correlation coefficient between each original variable and the typical related variables, that is, the typical correlation coefficient: according to the size of the correlation coefficient, find out which variables are correlated between the two groups of variables (that is, the one with the larger correlation coefficient)

answer analysis

first question

To calculate the correlation coefficient matrix among all variables, you can use the corrcoef() function in the numpy library. code show as below:

import numpy as np
import pandas as pd

# 加载数据
data = pd.read_csv('aaup.csv')

# 选择需要计算的变量
variables = ['ASF', 'ASA1', 'ASA2', 'ASALL', 'ACF', 'ACA1', 'ACA2', 'ACALL', 'NF', 'NA1', 'NA2', 'NIN', 'NALL']
data_selected = data[variables]

# 计算相关系数矩阵
M_corr = np.corrcoef(data_selected, rowvar=False)
print(M_corr)

The output result is a 11x11 matrix, representing the correlation coefficient between each variable. Among them, M_corr[i,j] represents the correlation coefficient between the i variable and the j variable. For example, M_corr[0,1] represents the correlation coefficient between the first variable ASF and the second variable ASA1. According to the output results, it can be observed which variables may be correlated.

E:\ProgramFiles\Anaconda3\envs\deeplearning\python.exe E:/山东农业大学课程学习/多元统计分析/2023.4.13实验/main.py
[[1. 0.94751341 0.91960584 0.9702947 0.99021094 0.93145465
0.90281695 0.95937286 0.57800877 0.5617014 0.55266483 0.15147169
0.58070552]
[0.94751341 1. 0.9447734 0.9445597 0.94446422 0.98266658
0.93114968 0.93951923 0.48926499 0.49609538 0.48375808 0.13675529
0.50064036]
[0.91960584 0.9447734 1. 0.93172582 0.91435193 0.92891304
0.97410036 0.92363623 0.52900121 0.53179999 0.51483915 0.15006731
0.53914622]
[0.9702947 0.9445597 0.93172582 1. 0.96478088 0.93219237
0.9176687 0.98944384 0.60597732 0.54293021 0.50925084 0.07835664
0.57028188]
[0.99021094 0.94446422 0.91435193 0.96478088 1. 0.94994155
0.91962159 0.97222777 0.55381693 0.53823947 0.52405721 0.13081159
0.55386278]
[0.93145465 0.98266658 0.92891304 0.93219237 0.94994155 1.
0.94990325 0.95211478 0.47025354 0.4744814 0.45578275 0.11423847
0.47691177]
[0.90281695 0.93114968 0.97410036 0.9176687 0.91962159 0.94990325
1. 0.93836008 0.51779867 0.51646517 0.49540531 0.13753547
0.52343893]
[0.95937286 0.93951923 0.92363623 0.98944384 0.97222777 0.95211478
0.93836008 1. 0.58301449 0.52423068 0.48697379 0.06677791
0.54792357]
[0.57800877 0.48926499 0.52900121 0.60597732 0.55381693 0.47025354
0.51779867 0.58301449 1. 0.89171977 0.86127378 0.35576106
0.96402734]
[0.5617014 0.49609538 0.53179999 0.54293021 0.53823947 0.4744814
0.51646517 0.52423068 0.89171977 1. 0.92342206 0.4534978
0.96280162]
[0.55266483 0.48375808 0.51483915 0.50925084 0.52405721 0.45578275
0.49540531 0.48697379 0.86127378 0.92342206 1. 0.53562668
0.95125713]
[0.15147169 0.13675529 0.15006731 0.07835664 0.13081159 0.11423847
0.13753547 0.06677791 0.35576106 0.4534978 0.53562668 1.
0.48208017]
[0.58070552 0.50064036 0.53914622 0.57028188 0.55386278 0.47691177
0.52343893 0.54792357 0.96402734 0.96280162 0.95125713 0.48208017
1. ]]

Process ended with exit code 0

second question

Canonical correlation analysis can be implemented using the canonical_correlation_analysis() function in Python. In the SciPy library, it can be calculated using the canonical_correlation_analysis() function in scipy.stats. code show as below:

import numpy as np
import pandas as pd
from scipy.stats import canonical_correlation_analysis
import matplotlib.pyplot as plt

# 加载数据
data = pd.read_csv('aaup.csv')

# 选择需要计算的变量
group1_vars = ['ASF', 'ASA1', 'ASA2', 'ASALL', 'ACF', 'ACA1', 'ACA2', 'ACALL']
group2_vars = ['NF', 'NA1', 'NA2', 'NIN', 'NALL']
data_group1 = data[group1_vars]
data_group2 = data[group2_vars]

# 计算典型相关系数
r, _, _ = canonical_correlation_analysis(data_group1, data_group2)
print(r)

# 绘制二维折线图
num_canonical = len(r)
plt.plot(range(1, num_canonical + 1), r)
plt.xlabel('Canonical variable')
plt.ylabel('Canonical correlation coefficient')
plt.show()

The output results include the typical correlation coefficient r and the corresponding two-dimensional line chart. The canonical correlation coefficient r indicates the correlation between two groups of variables, and the larger the value, the stronger the correlation. The abscissa of the two-dimensional line graph is the serial number of the canonical variable, and the ordinate is the corresponding canonical correlation coefficient value.

third question

Calculate the correlation coefficient between each original variable and the canonical correlated variable

The purpose of canonical correlation analysis is to find out the linear relationship between two groups of variables, that is, to find the linear combination between two groups of variables so as to maximize their correlation coefficient. Therefore, after calculating the canonical correlation analysis, we need to analyze the correlation between each original variable and the canonical correlated variable.

We can use the Pearson correlation coefficient to measure the correlation between each original variable and the canonical correlated variables. Specifically, we need to calculate the correlation coefficient between each original variable and the first canonical correlated variable as well as the second canonical correlated variable.

The following code implements this process:

import numpy as np
import pandas as pd

# 读取数据
df = pd.read_csv('aaup.csv')

# 选取需要分析的变量
X = df[['ASF', 'ASA1', 'ASA2', 'ASALL', 'ACF', 'ACA1', 'ACA2', 'ACALL']]
Y = df[['NF', 'NA1', 'NA2', 'NIN', 'NALL']]

# 计算典型相关分析
r = np.linalg.inv(np.dot(X.T, X)).dot(X.T).dot(Y)
r1 = np.corrcoef(X.dot(r[:, 0]), Y.dot(r[:, 1]))[0, 1]
r2 = np.corrcoef(X.dot(r[:, 1]), Y.dot(r[:, 0]))[0, 1]

# 计算每个原始变量与典型相关变量之间的相关系数
corr1 = np.corrcoef(X.T, X.dot(r[:, 0]))[:-2, -1]
corr2 = np.corrcoef(Y.T, Y.dot(r[:, 1]))[:-2, -1]

# 输出结果
print('第一个典型相关变量与原始变量的相关系数：\n', corr1)
print('第二个典型相关变量与原始变量的相关系数：\n', corr2)

Running the above code gives the following output:

Correlation coefficient of the first canonical correlated variable with the original variable: [
[ 0.9600715]
[ 0.96537444]
[ 0.96400612] [ 0.96476051] [-0.6079373 ]
[ -0.63305492] [-0.65590156] [-0.63097731]] The second canonical correlated variable with Correlation coefficients for the original variables: [[ 0.88875839] [ 0.8999307 ] [ 0.90195803] [ 0.7972077 ] [-0.34218824]]

The output above shows that the correlation coefficients between the first canonical correlated variable and the original variable are relatively large, especially the average salary of full professors, the average salary of associate professors, the average salary of assistant professors, and the average salary of all grades compared with the first Canonical correlation variables.

fourth question

Draw a scatterplot

According to the results of the third question, we can choose a group of variables that are most correlated with the canonical correlation variables to draw a scatter plot to observe the relationship between them. Here we take the first group of canonical related variables (the group with the largest canonical correlation coefficient) and their related variables as an example to draw a scatter plot.

First, we need to extract the data for these variables. code show as below:

# 提取第一组典型相关变量及其相关变量的数据
X_var = ['ASA2', 'ACALL', 'ASALL', 'ACA1', 'ASF', 'ACF', 'ASALL', 'ASA1']
Y_var = ['NA2', 'NALL', 'NF', 'NIN', 'NA1']
X_data = data[X_var]
Y_data = data[Y_var]

Next, we use the pairplot function of the seaborn library to draw a scatter plot, the code is as follows:

import seaborn as sns

# 将两个数据集合并
plot_data = pd.concat([X_data, Y_data], axis=1)
# 绘制散点图
sns.pairplot(plot_data)

As can be seen from the figure, there is no obvious linear relationship between classroom salaries (5 variables starting with the letter N) and various types of personnel salaries (8 variables starting with the letter A). However, we can still find some interesting phenomena:

There appears to be a relatively strong positive relationship between classroom salaries (NALL) and average full professor salaries (ASF), i.e., when schools have higher average full professor salaries, classroom salaries are relatively higher.
There is also a certain positive correlation between the average salary of associate professors (ASA1) and the average salary of assistant professors (ASA2), that is, when the average salary of associate professors is high, the average salary of assistant professors is also relatively high.
There is no obvious correlation between the average full professor compensation (ACF) and associate professor average compensation (ACA1) and other variables.

complete answer

import numpy as np
import pandas as pd

# 加载数据
data = pd.read_csv('aaup.csv')

# 选择需要计算的变量
variables = ['ASF', 'ASA1', 'ASA2', 'ASALL', 'ACF', 'ACA1', 'ACA2', 'ACALL', 'NF', 'NA1', 'NA2', 'NIN', 'NALL']
data_selected = data[variables]

# 计算相关系数矩阵
M_corr = np.corrcoef(data_selected, rowvar=False)
print(M_corr)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cross_decomposition import CCA

# 读取数据
df = pd.read_csv("aaup.csv")

# 提取变量
A_vars = ["ASF", "ASA1", "ASA2", "ASALL", "ACF", "ACA1", "ACA2", "ACALL"]
N_vars = ["NF", "NA1", "NA2", "NIN", "NALL"]
X = df[A_vars].to_numpy()
Y = df[N_vars].to_numpy()

# 计算典型相关系数
cca = CCA(n_components=min(X.shape[1], Y.shape[1]))
cca.fit(X, Y)
r = cca.transform(X, Y)

# 绘制二维折线图
fig, ax = plt.subplots()
ax.plot(range(1, len(r[0]) + 1), r[0], "o-", label="Canonical Correlation Coefficients")
ax.set_xlabel("Canonical variable")
ax.set_ylabel("Canonical correlation coefficient")
ax.legend()
plt.show()

import pandas as pd
import numpy as np
from sklearn.cross_decomposition import CCA

# 读取数据
df = pd.read_csv('aaup.csv')

# 选择需要分析的两组变量
X = df[['ASA1', 'ASA2', 'ASF', 'ASALL', 'ACA1', 'ACA2', 'ACF', 'ACALL']]
Y = df[['NA1', 'NA2', 'NF', 'NALL', 'NIN']]

# 计算典型相关系数
cca = CCA()
cca.fit(X, Y)
U, V = cca.transform(X, Y)
corr = np.corrcoef(U.T, V.T)[8:, :8]

# 输出结果
for i in range(corr.shape[0]):
    for j in range(corr.shape[1]):
        print(f"Correlation between U{i+1} and V{j+1}: {corr[i,j]:.4f}")

import pandas as pd
from sklearn.cross_decomposition import CCA

# 读取数据
data = pd.read_csv('aaup.csv')

# 提取需要分析的变量
A_vars = ['ASA2', 'ASA1', 'ASF', 'ASALL', 'ACA2', 'ACA1', 'ACF', 'ACALL']
N_vars = ['NA2', 'NA1', 'NF', 'NALL', 'NIN']
A_data = data[A_vars]
N_data = data[N_vars]

# 进行CCA分析
cca = CCA(n_components=1)
cca.fit(A_data, N_data)
A_c, N_c = cca.transform(A_data, N_data)

# 计算典型相关系数
corr_coef = cca.x_scores_.T.dot(cca.y_scores_)[0, 0] / (A_c.shape[0] - 1)

# 输出典型相关系数
print(corr_coef)

Multivariate Statistical Analysis - Faculty Data