This blog is mainly used to record some of the summaries I got after studying "Data Analysis Methods" for a semester, summarizing the concepts, ideas and knowledge points of SAS solutions and interpretation of various data analysis methods. (Because we teach a little bit less, so here I just summarize the analysis I have learned). If it is implemented, it is implemented with SAS9.2. In the example, it mainly explains the results of SAS operation.
1. Linear regression analysis
Thought
The relationship between variable Y and other related variables X1, X2,..., Xk cannot be known exactly. The value of variable Y consists of two parts: one part is determined by X1, X2,..., Xk, which can be expressed as X1, X2,..., A certain functional relationship of Xk: Y = f (X1,X2,…,Xk); the other part is the influence of many unconsidered factors, which is regarded as a random error, denoted as ε. Thereby:
Y = f(X1,X2,…,Xk) + ε
In medicine, people's height and weight, body temperature and pulse rate, age and blood pressure, drug dosage and efficacy are all related. It is the task of correlation analysis to explain the closeness of the relationship between objective things or phenomena and express them with appropriate statistical indicators. Representing the relationship between objective things or phenomena in functional form is the problem to be solved by regression analysis.
Regression analysis is to determine the relationship between one continuous variable and other continuous variables for interpretation and prediction.
example
To understand the relationship between the annual salary Y of a researcher in a certain research institute and his paper quality X1, working years X2, and funding index X3. The survey data (parts) of 24 researchers are as follows:
Assuming that the error obeys a normal distribution, Establish a regression equation; assuming a person's observations (x01, x02, x03) = (5.1, 20, 7.2), predict the annual salary and a 95% confidence interval.
The running code is as follows:
data examp2_3;
input y x1-x3@@;
cards;
33.2 3.5 9 6.1
40.3 5.3 20 6.4
38.7 5.1 18 7.4
46.8 5.8 33 6.7
41.4 4.2 31 7.5
37.5 6.0 13 5.9
39.0 6.8 25 6.0
40.7 5.5 30 4.0
30.1 3.1 5 5.8
52.9 7.2 47 8.3
38.2 4.5 25 5.0
31.8 4.9 11 6.4
43.3 8.0 23 7.6
44.1 6.5 35 7.0
42.8 6.6 39 5.0
33.6 3.7 21 4.4
34.2 6.2 7 5.5
48.0 7.0 40 7.0
38.0 4.0 35 6.0
35.9 4.5 23 3.5
40.4 5.9 33 4.9
36.8 5.6 27 4.3
45.2 4.8 34 8.0
35.1 3.9 15 5.0
. 5.1 20 7.2
;
run;
proc reg data=examp2_3;
model y=x1-x3/i r cli clm;
output out=d h=f;
run;
/*y=x1-x3表示求y与x1-x3的线性回归模型
i表示输出(XTX)-1
r表示输出有关残差及用于影响分析的各量,包括拟合值的标准差、残差、学生化残差及cook距离等
cli clm用于输出95%的置信区间
out=d h=f用于输出xi(XTX)-1xi
*/
Main running results:
The REG Procedure(回归过程)
Model: MODEL1
Dependent Variable: y(决定变量)
Parameter Estimates(参数估计)
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
值 自由度 参数估计 标准差 t统计量 检验p值
(常数项)Intercept 1 17.84693 2.00188 8.92 <.0001
x1 1 1.10313 0.32957 3.35 0.0032
x2 1 0.32152 0.03711 8.66 <.0001
x3 1 1.28894 0.29848 4.32 0.0003
Here it can be seen that the values of the test p-values are all less than 0.05, so all parameter estimates of the model can be significantly established. If one or more coefficients are not significant in the test, they should be estimated by finding another equation. Or transform the equation through the analysis method based on the residual map (common data transformation method - Box-Cox transformation).
Get the regression model: Y=17.8469+1.10313X1+0.32152X2+1.28894X3
confidence interval estimate:
Dependent Predicted Std Error Std Error Student
Obs Variable Value Mean Predict 95% CL Mean <font color='red'> 95% CL Predict</font> Residual Residual Residual
17 34.2000 34.0262 0.9062 32.1359 35.9164 29.9103 38.1420 0.1738 1.500 0.116
18 48.0000 47.4522 0.6708 46.0530 48.8515 43.5374 51.3670 0.5478 1.619 0.338
19 38.0000 41.2463 0.7798 39.6197 42.8729 37.2446 45.2480 -3.2463 1.570 -2.068
20 35.9000 34.7173 0.7960 33.0568 36.3778 30.7017 38.7328 1.1827 1.562 0.757
21 40.4000 41.2814 0.6008 40.0280 42.5347 37.4163 45.1464 -0.8814 1.647 -0.535
22 36.8000 38.2479 0.6460 36.9005 39.5954 34.3514 42.1445 -1.4479 1.629 -0.889
23 45.2000 44.3852 0.8309 42.6520 46.1184 40.3390 48.4313 0.8148 1.543 0.528
24 35.1000 33.4166 0.5819 32.2029 34.6304 29.5643 37.2690 1.6834 1.653 1.018
25 . 39.1837 0.5639 38.0073 40.3600 35.3429 43.0244 . . .
Get about (x01,x02,x03)=(5.1,20,7.2), the predicted value of y0 is 39.1837, the 95% confidence interval is (35.3429,43.0244)
2. Analysis of variance
Thought
Random and systematic errors are separated from the total variation of the data.
Use systematic error and random error to compare under certain conditions. If the difference is not large, it is considered that the systematic error has little influence on the index. If the systematic error is much larger than the random error, it means that the condition has a great influence.
Often encountered such a problem, there are several different raw materials, to examine whether they have a significant impact on product quality.
A certain new drug and some other traditional drugs are subjected to group experiments to examine whether different drugs and cure rates are significantly different. Here, the objects, raw materials and drugs we examine are called factors.
When only one factor is examined we call it a single factor problem. If two or more factors are considered at the same time, it is called multivariate analysis of variance (computational complexity at this time).
3. Principal component analysis
Thought
Need and possibility ( dimension reduction ): In practical problems, in order to obtain relevant information as completely as possible, it is often necessary to consider numerous variables. Although this can avoid the omission of important information, it increases the complexity of the analysis. Generally speaking, there will be a certain correlation between many variables involved in the same problem, and this correlation will "overlap" the information of each variable. A small number of new variables with non-overlapping information reflect most of the information provided by the original variables, so as to solve the problem through the analysis of a small number of new variables.
Principal component analysis and canonical correlation analysis are statistical methods for processing high-dimensional data under the idea of dimensionality reduction, both of which extract different information by constructing appropriate linear combinations of original variables. The principal component analysis focuses on considering the "dispersion" information of the variables, and the main purpose is to "transform" the original variables to reduce the dimension of the original variables as much as possible without losing too much information of the original variables, that is, use less The "new variables" replace the original variables, namely:
(1) dimensionality reduction of variables;
(2) Explanation of principal components.
geometric meaning
From an algebraic point of view, principal components are some special linear combinations of p variables, and from a geometric point of view, these linear combinations are just new coordinate systems generated by rotating the coordinate system composed of X1,...,Xp, and the new coordinate axis makes it The direction with the largest sample variation (or the largest sample variance).
There are n observations, each observation has p variables X1,...,Xp, and their comprehensive indicators (principal components) are recorded as Y1,...,Yp.
In general, p variables form a p-dimensional space, and n sample points are n points in the p-dimensional space. For p-element normally distributed variables, the problem of finding principal components is to find the principal axis of the ellipsoid in the p-dimensional space. .
seek law
The idea of principal component analysis is to construct a series of linear combinations of the original variables to maximize their (sample) variance
The method of finding the principal components is to find all the eigenvalues of the covariance matrix or the correlation coefficient matrix and the corresponding orthogonal unitized eigenvectors; the variance of the kth principal component is the kth eigenvalue after sorting from large to small, and the coefficient is the corresponding Orthogonal normalized eigenvectors of
4. Canonical Correlation Analysis
concept
It is a statistical analysis method to identify and quantify the correlation between two groups of variables, which can effectively reveal the mutual linear dependence between the two groups of variables.
Canonical correlation analysis is based on identifying and quantifying statistical correlations between two groups of variables.
Thought
The appropriate linear combination of each group of variables is constructed respectively, and the correlation of the two groups of variables is transformed into the correlation of two variables for analysis, and the dimensionality reduction is completed.
5. Discriminant Analysis
concept
Determine a discriminant method to determine which class a new sample belongs to according to the relevant data of historical classification and a certain optimal criterion
thought model
There are n samples, and the data of p items (variables) are measured for each sample. It is known that each sample belongs to one of the k categories (or population) G1, G2, ..., Gk, and their The distribution functions are F1(x), F2(x), ..., Fk(x).
We hope to use these data to find a discriminant function, so that this function has some optimal properties, can distinguish sample points belonging to different categories as much as possible, and measure the same p-item indicators (variables) A new sample of data that can determine which class the sample belongs to.
method
- Discriminant analysis is rich in content and has many methods.
- Judgment analysis is distinguished by the number of populations to be judged, there are two overall discriminant analysis and multi-population discriminant analysis;
- According to the mathematical model used to distinguish different populations, there are linear discrimination and nonlinear discrimination;
- According to the different methods of variables processed during discrimination, there are stepwise discrimination and sequential discrimination.
- Discriminant analysis can ask questions from different angles, so there are different discriminant criteria, such as the minimum Mahalanobis distance criterion, Fisher criterion, minimum average loss criterion, least square criterion, maximum likelihood criterion, maximum probability criterion, etc. Different discrimination methods are proposed.
6. Cluster analysis
Thought
Cluster analysis is a numerical classification method (ie, based solely on data relationships). To carry out cluster analysis, it is necessary to first establish an index system composed of some attributes of things, or a combination of variables. Each index selected must be able to describe a certain aspect of the attributes of things. All indicators are combined to form a complete index system, and they can work together to describe the characteristics of things.
The so-called complete indicator system means that the selected indicators are sufficient, and any other newly added variables have no significant contribution to distinguishing the difference between things. If the selected indicator is incomplete, it will lead to classification bias. For example, to classify the parenting style of a family, there must be a series of variables describing the parenting style of the family, and these variables can fully reflect the parenting style of the children in different families.
Simply put, the results of cluster analysis depend on both the selection of variables and the acquisition of variable values. The more accurate the selection of variables and the more reliable the measurement, the more the classification results obtained can describe the essential differences between various types of things.
describe
Cluster analysis is carried out entirely according to the data situation. For a data file consisting of n cases and k variables, when performing cluster analysis on cases, it is equivalent to grouping n points in a k-dimensional coordinate system, based on their distances; when When performing cluster analysis on variables, it is equivalent to grouping k points in the n-dimensional coordinate system, which is also based on the point distance. So distance or degree of similarity is the basis of cluster analysis.
In a word, cluster analysis specifically calculates the similarity of some samples or some parameters (indicators) according to many observation indicators of a batch of samples according to a certain mathematical formula, and classifies similar samples or indicators into one category, and classifies dissimilar ones into one category. as a class