Multivariate statistical analysis and R language exercises

Multivariate Exam Practice:

Article Directory

1. Multiple linear regression model:

q1 = read.table(“clipboard”,head = T)

1. Build a regression model

fm = lm (y~x1+x2+x3,data = q1)
fm
summary(fm)

Please add a picture description
Please add a picture description

2. Step by step screening

Stepwise regression:

fm_step = step(fm,direction = "both")#both is a step-by-step screening method, forward is a forward introduction method, and backward is a backward introduction method

Please add a picture description

Global preference method:

For each subset, the smaller the RSS, the larger the R2, the larger the adjusted R2, and the smaller the AIC BIC, the better the model.

>library(leaps)  ##安装包leaps
>varsel=regsubsets(y~ x1+ x2+ x3+ x4,data=yX)
#多元数据线性回归变量选择模型
>result=summary(varsel)   #变量选择方法结果
>data.frame(result$outmat,RSS=result$rss,R2=result$rsq,adjR2=result$adjr2,Cp=result$cp,BIC=result$bic)
                             #RSS、R2、调整R2、cp、BIC结果展示

3. The optimal standard equation with the greatest impact

#选择x24,下面则写对应的变量
> fm_2=lm(formula = y ~ x2 + x4, data = q1)
> fm_2#这是选择后未标准化的方程
summary(fm_2)#确认p值都小于要求
> #q:根据变量筛选结果,写出标准化后的回归方程,指出哪个自变量影响最大。
> library(mvstats)
> coef.sd(fm_2)
$coef.sd
         x2          x4 
 1.02788885 -0.03972031 

> #ans:标准化后的方程为y = 1.027x2 - 0.039x4,值越大影响越大,这里x2的影响最大

4. Global selection method (using R version 4.2.1):

> library(leaps)  ##安装包leaps
> varsel=regsubsets(y~ x1+ x2+ x3+ x4,data=yX)
> result=summary(varsel)
>data.frame(result$outmat,RSS=result$rss,R2=result$rsq,adjR2=result$adjr2,Cp=result$cp,BIC=result$bic)

5. Analysis

The p-values ​​of the partial regression coefficients b2 and b4 are both less than 0.01, which means that the explanatory variables tax x2 and economically active population x4 are significant; the p-values ​​of b1 and b3 are greater than 0.50, and the hypothesis of b1=0 and b3=0 cannot be denied. It can be considered that the gross domestic product x1 and the total import and export trade x3 have no significant impact on fiscal revenue y. We can see that the partial regression coefficients corresponding to GDP and economically active population are all negative, which is inconsistent with economic reality. A possible reason for this result is the high degree of collinearity among these explanatory variables.

6. It can be seen from the standardized partial regression coefficient, the results of variance analysis

> coef.sd(fm)
> anova(fm)

2. Discriminant analysis

q2 = read.table(“clipboard”,head = T);q2

library(MASS)

attach(X)#Open the data, you can use each column component

1. Linear discriminant, Bayesian discriminant correct rate

Linear:

> ld = lda(G~x1+x2+x3+x4+x5+x6+x7,prior = c(1,1,1)/3)
> z = predict(ld)
> newG = z$class
> cbind(q2$G,z$x,newG)#q2要记得修改,G也有可能改
> tab=table(q2$G,newG)
> tab
> sum(diag(prop.table(tab)))
#tab 后面的就是正确率

Bayesian:

> ld2 = lda(G~x1+x2+x3+x4+x5+x6+x7,prior = c(24,10,14)/48)
或者直接ld2 = lda(G~x1+x2+x3+x4+x5+x6+x7,data = q2);ld2
> ld2
> z2=predict(ld2)
> cbind(G,z2$x,z2$class)
> tab2 = table(G,z2$class)
> sum(diag(prop.table(tab2)))

Secondary discrimination:

#二次判别(异方差)
> qd=qda(G~Q+C+P,data=X, prior = c(1, 1, 1)/3)

2. Forecast

predict(ld,data.frame(x1=,x2=))#ld writes the model used for prediction, x12=writes the incoming value.

> predict(ld,data.frame(x1=45,x2=1,x3=0,x4=1,x5=2,x6=33,x7=5.675))#写入值

insert image description here

The inside of the circle indicates the situation (G value)

3. The trace of the linear function is

> (ld=lda(G~Q+C+P,prior=c(1,1,1)/3))

Call:
lda(G ~ Q + C + P, prior = c(1, 1, 1)/3)

Prior probabilities of groups: #先验概率值,表示每类在原样本所占的比例
        1         2         3 
0.3333333 0.3333333 0.3333333 

Group means: #每类的均值
       Q           C          P
1  8.400000  5.900000  48.200
2  7.712500  7.250000  69.875
3  5.957143  3.714286  34.000

Coefficients of linear discriminants: #写出线性判别函数
         LD1            LD2
Q   -0.92307369   0.76708185
C   -0.65222524   0.11482179     G=-0.923Q-0.652C+0.027P
P   0.02743244    -0.08484154    G=0.767Q+0.115C-0.085

Proportion of trace: #两个判别函数的迹(判别函数的判别能力)
   LD1      LD2 
0.7259   0.2741

4. Other posterior probabilities

(1)线性判别
> library(MASS) ##载入MASS函数包
> ld=lda(G~.,data=Case5,prior = c(1, 1)/2);ld  #线性判别(省略是对所有的自变量进行..)
> Zld=predict(ld) ##判别
> data.frame(Case5$G,Zld$class,round(Zld$x,3))
> addmargins(table(Case5$G,Zld$class))#在列联表上添加边缘列
  1  2 Sum
  1   24  1  25
  2    3 18  21
  Sum 27 19  46
准确率为(24+18)/46=91.3%
(2)二次判别
> qd=qda(G~.,data=Case5,prior=c(1,1)/2);qd  #二次判别
> Zqd=predict(qd)
> data.frame(Case5$G,Zqd$class,round(Zqd$post,3)*100)
> addmargins(table(Case5$G,Zqd$class))
     
       1  2 Sum
  1   24  1  25
  2    2 19  21
  Sum 26 20  46

准确率为(24+19)/46=93.5%
(3)贝叶斯判别
> ld2=lda(G~.,data=Case5);ld2 ##不设定先验概率,即默认为样本中的比例。
> Zld2=predict(ld2)
> data.frame(Case5$G,Zld2$class,round(Zld2$x,3))
> addmargins(table(Case5$G,Zld2$class))
     
       1  2 Sum
  1   24  1  25
  2    3 18  21
  Sum 27 19  46
> Zld2$post##后验概率
此外还可以使用predict(model)$posterior提取后验概率。
>predict(ld2)$posterior
在使用lda和qda函数时注意:其假设是总体服从多元正态分布,若不满足的话则谨慎使用。

3. Logistic model

Note that the G value can only be 0 or 1

1. The optimal expression after the step-by-step screening method, and predict

> q3 = read.table("clipboard",head=T)
> q3
#建议直接使用逐步。
> logit.glm = glm(G~x1+x2,family=binomial,data=q3)
##> summary(logit.glm)#此时需要查看p值是否满足。
> glm.new = step(logit.glm)
> summary(glm.new)


Please add a picture description

From the calculation results, all coefficients have passed the test (α=0.1), if not, glm.new = step(logit.glm) is required, and the regression model is

Please add a picture description

Please add a picture description

2. Predict the probability of y=1

#计算的是y=1的概率,pre表示预测值
> p=predict(glm.new, data.frame(x=3.5), type="response")
> p
#############
> pre<-predict(glm.new, data.frame(x2=2,x3=0))
> p<-exp(pre)/(1+exp(pre));p

3. Backward prediction

可以作控制,如有50%的牛有响应,其电流强度为多少?

> X<- - glm.sol$coefficients[1]/glm.sol$coefficients[2];X
(Intercept) 
   2.649439
即2.65mA的电流强度,可以使50%的牛有响应。

When p=0.5, that is
Please add a picture description

so
Please add a picture description

4. Log-linear model, completely random design model, random unit design model, factorial design model, orthogonal experimental design model

log.glm <-glm(y~x1+x2,family=poisson(link=log),data=x)

4. Principal component analysis (correlation coefficient matrix)

q4 = read.table(“clipboard”,head = T)
q4

1. Principal component correlation

When the variance contribution rate reaches %, the minimum number of principal components is, and its cumulative variance contribution rate is, the first principal component variance, the first principal component expression

> PCA=princomp(q4,cor = T)#T表示使用相关系数矩阵
> PCA
> summary(PCA,loading = T)
screeplot(PCA,type="lines")###碎石图

We can see from the signs of the corresponding coefficients of the first principal component that the higher the consumption from x1 to x8, the smaller the value of Z1 and the larger the absolute value of Z1. From the perspective of the second principal component, the positive sign is larger than the negative sign. It can be considered that the higher the consumption from x1 to x8, the greater the value of Z2*.

insert image description here
insert image description here

insert image description here

2. Comprehensive score formula and ranking

Calculate principal component scores

>predict(PCA)

overall ratings

> princomp.rank(PCA,m=2,plot=T)##排名

insert image description here

3. Other

insert image description here

principal component variance = square of standard deviation

5. Cluster analysis

q5 = read.table(“clipboard”,head = T);q5

qb5 = scale(q5)#standardized processing

1. Using different methods for clustering

It is divided into three categories. In the ward method, the least category contains ____ samples, and the longest distance is the most ____ samples. The more reasonable of the two methods is _____

d is the distance calculation method: euclidean (Euclidean distance), maximum (Chebyshev distance), manhattan (absolute value distance), canberra (Langer's distance), minkoeski (Ming's distance).

m is the system clustering method single (shortest distance method), complete (longest distance method), average (class average method), median (middle distance method), centroid (centroid method), ward.D (ward method). proc is whether to output the clustering process. plot is whether to output a cluster map.

#欧式+ward法
> qb5_eu_wd = H.clust(qb5,"euclidean","ward.D",plot=T);rect.hclust(qb5_eu_wd,k=3)

Please add a picture description

#欧式+最长
> qb5_eu_cop = H.clust(qb5,"euclidean","complete",plot=T);rect.hclust(qb5_eu_cop,k=3)

Please add a picture description

2. kmeans

> (km<-kmeans(Z,5)) #对数据Z做K均值聚类,分5类
> plot(km$cluster) #对分类作图展示
> identify(km$cluster,labels=names(km$cluster),n=length(km$cluster),tolerance =0.25) #点击显示点的标签

6. Factor analysis

The principal component method was used to conduct factor analysis on the samples, and the number of common factors was 4.

Maximum Likelihood + Rotation FA0=factanal(X,3,rotation="varimax")

Maximum likelihood + non-rotation FA0=factanal(X,3,rotation="none")

Principal component + rotation FA1=factpc(X,3,rotation="varimax")

The principal component does not rotate FA1=factpc(X,3)

1. Build a factor analysis model

Use the variance maximization method for factor rotation, write the factor model, and write the variance contribution rate of the first two factors. The variable with the largest common degree is

factor model

> q6_zcf_xz = factpc(q6,4,rotation="varimax");q6_zcf_xz

 Factor Analysis for Princomp in Varimax: 

$Vars
         Vars Vars.Prop Vars.Cum
Factor1 3.040     33.78    33.78
Factor2 1.932     21.46    55.24
Factor3 1.277     14.19    69.43
Factor4 1.079     11.98    81.42

$loadings#旋转后载荷矩阵
     Factor1  Factor2   Factor3   Factor4
x1  0.049430  0.92535  0.076684 -0.103147
x2  0.249707  0.85154 -0.253035 -0.184190
x3  0.715246  0.41793 -0.057583 -0.144488
x4 -0.002995 -0.23212  0.005695  0.935301
x5  0.796492  0.09432  0.461953  0.105807
x6  0.063679 -0.11709  0.927519 -0.008258
x7  0.865334  0.10109 -0.098568  0.173427
x8  0.702326  0.28098  0.346513  0.051310
x9  0.763471 -0.09983 -0.027029 -0.307292

$scores#旋转后因子得分
         Factor1  Factor2   Factor3   Factor4
张三   -0.913224 -0.09734 -0.006191  0.796441
刘明    0.577338  0.61277  0.186427 -0.214724
安宁   -0.321587  0.36309  0.422861 -1.224428
王浩    1.545656  0.14090  0.570897 -0.007294
田一杰 -0.371850 -0.26984  0.784478 -0.277053
杨桐    0.395438  1.07307  0.858988  0.527898
邹文杰  0.008653  0.19757 -1.583081  2.035428
王哲    0.751191  1.37037  0.980507  0.557261
罗丽    0.381450 -0.09022  1.224255  2.128442
郑涛    1.272876  0.85438  0.743912 -1.510462
张磊   -0.456377  0.95169 -0.053060 -1.325094
王晓    1.193712 -0.82341  0.030743 -0.216747
兰陵   -1.430795  0.46809 -0.052472  0.657802
孙鑫   -2.100622 -0.13697  0.611105 -0.960262
陈翔   -0.053436 -0.94006 -0.882093 -0.812503
常广    0.627591 -3.40662  0.831341 -0.140313
石飞跃 -1.112887 -0.03864 -0.968510 -0.725039
唐伯虎 -0.470003 -0.42217 -0.852541  0.272069
马一杰  1.390098  0.29068 -2.841374 -0.357864
徐盛   -0.913224 -0.09734 -0.006191  0.796441

$Rank#得分排名
             F   Ri
张三   -0.2883 13.5
刘明    0.4019  6.0
安宁   -0.1442 12.0
王浩    0.7768  2.0
田一杰 -0.1294 10.0
杨桐    0.6744  3.0
邹文杰  0.0793  9.0
王哲    0.9258  1.0
罗丽    0.6612  4.0
郑涛    0.6606  5.0
张磊   -0.1427 11.0
王晓    0.2516  7.0
兰陵   -0.3825 15.0
孙鑫   -0.9424 20.0
陈翔   -0.5434 18.0
常广   -0.5134 17.0
石飞跃 -0.7474 19.0
唐伯虎 -0.4149 16.0
马一杰  0.1053  8.0
徐盛   -0.2883 13.5

$common#共同度
    x1     x2     x3     x4     x5     x6     x7     x8     x9 
0.8752 0.8854 0.7104 0.9287 0.8679 0.8781 0.7988 0.6949 0.6880 

Please add a picture description

Please add a picture description

Please add a picture description

2. Factor F can be regarded as the common factor of which variables, and which variable has the largest load

Please add a picture description

3. Calculate the comprehensive factor score for comprehensive ranking

> factanal.rank(q6_zcf_xz,plot=T)

Comprehensive scoring formula:
Please add a picture description

Among them, 0.40366 is the variance contribution rate of F1, 0.32449 is the variance contribution rate of F2, 0.15937 is the variance contribution rate of F3, and 0.8875 is the variance cumulative contribution rate of the first three factors

4. Factor Analysis

It can be seen from the rotated factor loading matrix:

The loading values ​​of public factor F1 on X1 (food expenditure per capita), X5 (traffic and communication expenditure per capita), x7 (living expenditure per capita), and x8 (miscellaneous goods and service expenditure per capita) are all large, which can be regarded as reflecting daily necessities. Common factor of consumption.

The public factor F2 has a large loading value on X3 (per capita expenditure on household equipment and services), x4 (per capita expenditure on health care), and x6 (per capita expenditure on entertainment, education and culture), which can be regarded as a public factor reflecting relatively high-end consumption.

The public factor F3 only has a large load on x2 (clothing expenditure per capita), which can be regarded as a clothing factor.

In this way, the consumption situation of each province, city and autonomous region can be evaluated.

7. Given the distance between each pair, use the shortest distance to do systematic clustering and draw a pedigree diagram.

1 2 3 4 5
1 0 4 6 1 6
2 4 0 9 7 3
3 6 9 0 10 5
4 1 7 10 0 8
5 6 3 5 8 0
(x=matrix(c(0,4,6,1,6,4,0,9,7,3,6,9,0,10,5,1,7,10,0,8,6,3,5,8,0),5))	#生成5维矩阵
Z=scale(x)
D=dist(Z) #计算距离矩阵
hc=hclust(D,"single")
cbind(hc$merge,hc$height)
plot(hc) #画聚类图
rect.hclust(hc,k=2) #对聚类结果画框,k=2表示分2类

Please add a picture description

8. Play freely according to the given data set

  1. Cluster analysis (macro analysis, regional division)

In 2003, in addition to the differences in the development of the telecommunications industry in various regions of Guangdong Province, there was also a trend of concentrated development. We can use cluster analysis to divide the cities in Guangdong Province into several categories. Each category represents a different level of development, while each category contains cities with similar levels of development. After analysis, we also got a little enlightenment: When developing the telecommunications industry, each city should not only emphasize the total amount of communication, but also pay attention to the development of per capita volume and the popularity of the whole region. Only when the per capita level is improved, is it really meaningful, and it can be said that the city's telecommunications level has really improved. Only when a city's telecommunications industry develops in an all-round way can it withstand the impact of WTO and maintain good competitiveness. At the same time, as far as Guangdong Province is concerned, although its total telecom industry ranks first in the country in 2003, there are serious differences among regions. The Pearl River Delta region is developing rapidly, with a large volume of telecom business and a high market share, while economically underdeveloped areas, especially mountainous and rural areas, develop slowly, with a small volume and low share. In this regard, the Guangdong provincial government should speed up the telecommunications construction in economically underdeveloped areas, vigorously expand the telecommunications market in mountainous areas, and take supportive measures to strengthen the construction of rural markets and promote the coordinated development of the telecommunications industry in various areas of Guangdong Province. As a backward city, it should also actively take measures to accelerate its own development and improve its competitiveness, so as to avoid becoming a "dragging oil bottle".

  1. Principal component analysis (micro analysis, comprehensive ranking)

Due to the large number of indicators, it is not convenient for comprehensive analysis. Firstly, the principal component analysis method is used to extract the main components, and then the corresponding analysis is carried out. After running with R software, we found that two main components can be extracted, and these two components account for 96.14% of the total, which can be said to basically represent the amount of information of all indicators.
After principal component analysis, we found that two principal components, Comp. 1 and Comp. 2, can be extracted.
The first principal component, Comp.1, is mainly composed of X (total amount of telecommunication services), X (international Internet users), X, (time spent by Internet users), and X. (Long-distance call volume), X, (Long-distance call duration) are determined. These five indicators are aggregate factors, which illustrate the scale of a city's telecommunications industry and the development level of telecommunications and communication services.
The second principal component, Comp.2, is mainly determined by X (number of fixed telephones per 100 people) and X, (number of mobile phones per 100 people). These two indicators are average volume components, reflecting the per capita penetration of telephones in the telecommunications industry.
Since the two principal components PC, and PC we selected after principal component analysis represent 96.14% of the information, it can be said that they basically represent all of our indicators. So we use the extracted principal components for comprehensive analysis of each city.
We find that the seven economic indicators can be replaced by two composite indicators with little loss of information from the composite indicators. On this basis, we can not only calculate the component scores of each city, but also use the linear weighting method to weight the contribution rate of each principal component, that is, according to the formula (0.738 xPC, +0.223 xPC,) / (0.738+0.223 ) to calculate the comprehensive score of each city's telecommunications industry development level and rank accordingly. Its principal component scores and rankings are shown in the figure below.

  1. analyze

After ranking the cities, we found that Shenzhen, Guangzhou, Dongguan, Huizhou and Foshan are the top-ranked regions. The relatively backward areas include Shanwei, Zhanjiang, Maoming and Yangjiang.
We can also clearly see from the score chart of the principal components that both the first principal component Comp.1 and the second principal component Comp.2 have the highest scores in Shenzhen, while Huizhou, Zhongshan and Maoming. Looking back at the previous data, we found that although the first principal component Comp.1 level of Huizhou City, that is, the level of communication development is lower than that of Zhongshan City, its second principal component Comp.2 factor, that is, the level of telephone popularity is far higher than that of Zhongshan City. Zhongshan, and the second principal component Comp.2 accounts for 22.34% of all variables, which cannot be ignored. However, because Maoming city does not have enough Internet users and per capita telephone penetration, the scores of the other two principal components are not high, and the second principal component is particularly low, so its ranking is relatively low. It can be seen from the principal component score chart:
(1) Guangzhou is in the second quadrant, far away from Comp. 1 and Comp. 2 axes. This shows that Guangzhou's first principal component Comp. 1 score is relatively high, second only to Shenzhen; but the second principal component Comp.2 score is low. We know that Comp. 1 represents the total level of communication business development in the telecommunications industry, while Comp.2 represents the average level of development in the telecommunications industry. Combining the meanings of Comp. 1 and Comp. 2, Guangzhou is the capital city of Guangdong, and its economic and cultural development levels are good. The overall development of the telecommunications industry is also good. Therefore, Comp. It is second to Shenzhen, but since Guangzhou is also a large open city with a large population, the population growth rate is obviously faster than the development of the telecommunications industry, so the calculated per capita amount is not as high as Shenzhen. (2) The situation of Meizhou and Huizhou is a bit opposite to that of Guangzhou. They are not as good as Guangzhou in terms of total telecommunications, but because of their relatively small population and high per capita traffic, although Comp.l scores relatively low, Comp.2 has a high score Score. This is reflected in the principal component diagram, which is very close to the Comp.2 axis and far away from the Comp.1 axis. Due to their specificity, we separate them into one category.
(3) From the figure, we can see that the position of Shenzhen is relatively far from the origin in the figure, and at the same time, it is relatively far from the Comp.1 axis and the Comp.2 axis. This shows that the scores of Comp.1 and Comp.2 in Shenzhen are relatively high. As a special economic zone, Shenzhen has developed rapidly in all aspects since the reform and opening up. It is a developed city with many mobile phone users. In recent years, the development of mobile phones has sprung up suddenly and occupied an important position in the development of the telecommunications industry. Unlike Guangzhou, Shenzhen's total population is not too large, so its telephone penetration rate can reach a high level. Because of this, it has a higher Comp.2 score. At the same time, due to its development, there are many telephone and Internet users, and the total development of the telecommunications industry is also good, so Comp.1 has a high score, ranking first among all cities in Guangdong. The very high Comp.1 score and relatively high Comp.2 score determine that Shenzhen can be ahead of Guangzhou and occupy the first place in the ranking.

  1. factor analysis

Analysis of the results: ① From the factor score table, it can be seen that the four companies with the highest scores on the profitability factor F are Conch Cement, Fujian Cement, Jidong Cement and Qilianshan. The scores of these four companies are much higher than other companies. This shows that in terms of profitability, the profitability of these four companies is much higher than other companies, while the companies with relatively weak profitability are Jianfeng Group, Xishui and Mudanjiang. ②Fujian Cement, Conch Cement, and Sichuan Jinding scored higher on factor F, indicating that in the cement industry, these three companies have relatively good solvency, while Lionhead and Datong Cement The company's low score on factor F, indicates that the solvency of the two companies is relatively poor and efforts should be made to improve it. ③ On the development ability factor F, the scores of Xishui and Conch Cement are much higher than other companies, reflecting the reality that these two stocks have been rising steadily from 2008 to now, which also depends on benefit from their good development capabilities. At the same time, it also shows that among the listed companies in the cement industry, in terms of development capabilities, there are still a few good companies, and many companies do not pay attention to long-term and stable development, but only focus on short-term profits. This point needs to draw the attention of relevant enterprises. Sichuan Jinding has the lowest score on factor F, indicating that it has the worst development ability, and its first two factors score is not high, and it is also at the bottom of the comprehensive ranking. Therefore, this company should start from the inside of the enterprise. To carry out rectification, it is necessary to improve the company's various operating capabilities as a whole to achieve the purpose of improving the company's operating performance.

9. Data visualization related

10. Correspondence analysis

Guess you like

Origin blog.csdn.net/Destinyxzc/article/details/130574906