Summary of data analysis methods

    This blog is mainly used to record some of the summaries I got after studying "Data Analysis Methods" for a semester, summarizing the concepts, ideas and knowledge points of SAS solutions and interpretation of various data analysis methods. (Because we teach a little bit less, so here I just summarize the analysis I have learned). If it is implemented, it is implemented with SAS9.2. In the example, it mainly explains the results of SAS operation.

1. Linear regression analysis

Thought

The relationship between variable Y and other related variables X1, X2,..., Xk cannot be known exactly. The value of variable Y consists of two parts: one part is determined by X1, X2,..., Xk, which can be expressed as X1, X2,..., A certain functional relationship of Xk: Y = f (X1,X2,…,Xk); the other part is the influence of many unconsidered factors, which is regarded as a random error, denoted as ε. Thereby:
Y = f(X1,X2,…,Xk) + ε

     In medicine, people's height and weight, body temperature and pulse rate, age and blood pressure, drug dosage and efficacy are all related. It is the task of correlation analysis to explain the closeness of the relationship between objective things or phenomena and express them with appropriate statistical indicators. Representing the relationship between objective things or phenomena in functional form is the problem to be solved by regression analysis.

     Regression analysis is to determine the relationship between one continuous variable and other continuous variables for interpretation and prediction.

    example

  To understand the relationship between the annual salary Y of a researcher in a certain research institute and his paper quality X1, working years X2, and funding index X3. The survey data (parts) of 24 researchers are as follows:
write picture description here
  Assuming that the error obeys a normal distribution, Establish a regression equation; assuming a person's observations (x01, x02, x03) = (5.1, 20, 7.2), predict the annual salary and a 95% confidence interval.

The running code is as follows:

data examp2_3;
input y x1-x3@@;
cards;
33.2 3.5 9  6.1
40.3 5.3 20 6.4
38.7 5.1 18 7.4
46.8 5.8 33 6.7
41.4 4.2 31 7.5
37.5 6.0 13 5.9
39.0 6.8 25 6.0
40.7 5.5 30 4.0
30.1 3.1 5  5.8
52.9 7.2 47 8.3
38.2 4.5 25 5.0
31.8 4.9 11 6.4
43.3 8.0 23 7.6
44.1 6.5 35 7.0
42.8 6.6 39 5.0
33.6 3.7 21 4.4
34.2 6.2 7  5.5
48.0 7.0 40 7.0
38.0 4.0 35 6.0
35.9 4.5 23 3.5
40.4 5.9 33 4.9
36.8 5.6 27 4.3
45.2 4.8 34 8.0
35.1 3.9 15 5.0
  .  5.1 20 7.2
;
run;
proc reg data=examp2_3;
model y=x1-x3/i r cli clm;
output out=d h=f;
run;
/*y=x1-x3表示求y与x1-x3的线性回归模型
i表示输出(XTX)-1
r表示输出有关残差及用于影响分析的各量,包括拟合值的标准差、残差、学生化残差及cook距离等
cli clm用于输出95%的置信区间
out=d h=f用于输出xi(XTX)-1xi
*/

Main running results:

                                              The REG Procedure(回归过程)
                                                 Model: MODEL1
                                             Dependent Variable: y(决定变量)

                                              Parameter Estimates(参数估计)

                                           Parameter       Standard
                      Variable     DF       Estimate          Error    t Value    Pr > |t|
                         值        自由度      参数估计         标准差       t统计量      检验p值                                 
             &nbsp;(常数项)Intercept     1       17.84693        2.00188       8.92      <.0001
                      x1            1        1.10313        0.32957       3.35      0.0032
                      x2            1        0.32152        0.03711       8.66      <.0001
                      x3            1        1.28894        0.29848       4.32      0.0003

Here it can be seen that the values ​​of the test p-values ​​are all less than 0.05, so all parameter estimates of the model can be significantly established. If one or more coefficients are not significant in the test, they should be estimated by finding another equation. Or transform the equation through the analysis method based on the residual map (common data transformation method - Box-Cox transformation).
Get the regression model: Y=17.8469+1.10313X1+0.32152X2+1.28894X3
confidence interval estimate:

   Dependent     Predicted                                                                                                                               Std Error    Std Error   Student
      Obs   Variable      Value  Mean Predict       95% CL Mean        <font color='red'> 95% CL Predict</font>      Residual   Residual  Residual 

  17    34.2000    34.0262        0.9062    32.1359    35.9164    29.9103    38.1420     0.1738      1.500     0.116
  18    48.0000    47.4522        0.6708    46.0530    48.8515    43.5374    51.3670     0.5478      1.619     0.338
  19    38.0000    41.2463        0.7798    39.6197    42.8729    37.2446    45.2480    -3.2463      1.570    -2.068
  20    35.9000    34.7173        0.7960    33.0568    36.3778    30.7017    38.7328     1.1827      1.562     0.757
  21    40.4000    41.2814        0.6008    40.0280    42.5347    37.4163    45.1464    -0.8814      1.647    -0.535
  22    36.8000    38.2479        0.6460    36.9005    39.5954    34.3514    42.1445    -1.4479      1.629    -0.889
  23    45.2000    44.3852        0.8309    42.6520    46.1184    40.3390    48.4313     0.8148      1.543     0.528
  24    35.1000    33.4166        0.5819    32.2029    34.6304    29.5643    37.2690     1.6834      1.653     1.018
  25          .    39.1837        0.5639    38.0073    40.3600    35.3429    43.0244          .          .         .

Get about (x01,x02,x03)=(5.1,20,7.2), the predicted value of y0 is 39.1837, the 95% confidence interval is (35.3429,43.0244)

2. Analysis of variance

Thought

Random and systematic errors are separated from the total variation of the data.
Use systematic error and random error to compare under certain conditions. If the difference is not large, it is considered that the systematic error has little influence on the index. If the systematic error is much larger than the random error, it means that the condition has a great influence.

    Often encountered such a problem, there are several different raw materials, to examine whether they have a significant impact on product quality.

    A certain new drug and some other traditional drugs are subjected to group experiments to examine whether different drugs and cure rates are significantly different. Here, the objects, raw materials and drugs we examine are called factors.

    When only one factor is examined we call it a single factor problem. If two or more factors are considered at the same time, it is called multivariate analysis of variance (computational complexity at this time).

3. Principal component analysis

Thought

Need and possibility ( dimension reduction ): In practical problems, in order to obtain relevant information as completely as possible, it is often necessary to consider numerous variables. Although this can avoid the omission of important information, it increases the complexity of the analysis. Generally speaking, there will be a certain correlation between many variables involved in the same problem, and this correlation will "overlap" the information of each variable. A small number of new variables with non-overlapping information reflect most of the information provided by the original variables, so as to solve the problem through the analysis of a small number of new variables.

    Principal component analysis and canonical correlation analysis are statistical methods for processing high-dimensional data under the idea of ​​dimensionality reduction, both of which extract different information by constructing appropriate linear combinations of original variables. The principal component analysis focuses on considering the "dispersion" information of the variables, and the main purpose is to "transform" the original variables to reduce the dimension of the original variables as much as possible without losing too much information of the original variables, that is, use less The "new variables" replace the original variables, namely:

(1) dimensionality reduction of variables;

(2) Explanation of principal components.

geometric meaning

From an algebraic point of view, principal components are some special linear combinations of p variables, and from a geometric point of view, these linear combinations are just new coordinate systems generated by rotating the coordinate system composed of X1,...,Xp, and the new coordinate axis makes it The direction with the largest sample variation (or the largest sample variance).

There are n observations, each observation has p variables X1,...,Xp, and their comprehensive indicators (principal components) are recorded as Y1,...,Yp.

In general, p variables form a p-dimensional space, and n sample points are n points in the p-dimensional space. For p-element normally distributed variables, the problem of finding principal components is to find the principal axis of the ellipsoid in the p-dimensional space. .

seek law

The idea of ​​principal component analysis is to construct a series of linear combinations of the original variables to maximize their (sample) variance

The method of finding the principal components is to find all the eigenvalues ​​of the covariance matrix or the correlation coefficient matrix and the corresponding orthogonal unitized eigenvectors; the variance of the kth principal component is the kth eigenvalue after sorting from large to small, and the coefficient is the corresponding Orthogonal normalized eigenvectors of

4. Canonical Correlation Analysis

concept

It is a statistical analysis method to identify and quantify the correlation between two groups of variables, which can effectively reveal the mutual linear dependence between the two groups of variables.

Canonical correlation analysis is based on identifying and quantifying statistical correlations between two groups of variables.

Thought

The appropriate linear combination of each group of variables is constructed respectively, and the correlation of the two groups of variables is transformed into the correlation of two variables for analysis, and the dimensionality reduction is completed.

5. Discriminant Analysis

concept

Determine a discriminant method to determine which class a new sample belongs to according to the relevant data of historical classification and a certain optimal criterion

thought model

    There are n samples, and the data of p items (variables) are measured for each sample. It is known that each sample belongs to one of the k categories (or population) G1, G2, ..., Gk, and their The distribution functions are F1(x), F2(x), ..., Fk(x).
    We hope to use these data to find a discriminant function, so that this function has some optimal properties, can distinguish sample points belonging to different categories as much as possible, and measure the same p-item indicators (variables) A new sample of data that can determine which class the sample belongs to.

method

  1. Discriminant analysis is rich in content and has many methods.
  2. Judgment analysis is distinguished by the number of populations to be judged, there are two overall discriminant analysis and multi-population discriminant analysis;
  3. According to the mathematical model used to distinguish different populations, there are linear discrimination and nonlinear discrimination;
  4. According to the different methods of variables processed during discrimination, there are stepwise discrimination and sequential discrimination.
  5. Discriminant analysis can ask questions from different angles, so there are different discriminant criteria, such as the minimum Mahalanobis distance criterion, Fisher criterion, minimum average loss criterion, least square criterion, maximum likelihood criterion, maximum probability criterion, etc. Different discrimination methods are proposed.

6. Cluster analysis

Thought

Cluster analysis is a numerical classification method (ie, based solely on data relationships). To carry out cluster analysis, it is necessary to first establish an index system composed of some attributes of things, or a combination of variables. Each index selected must be able to describe a certain aspect of the attributes of things. All indicators are combined to form a complete index system, and they can work together to describe the characteristics of things.

The so-called complete indicator system means that the selected indicators are sufficient, and any other newly added variables have no significant contribution to distinguishing the difference between things. If the selected indicator is incomplete, it will lead to classification bias. For example, to classify the parenting style of a family, there must be a series of variables describing the parenting style of the family, and these variables can fully reflect the parenting style of the children in different families.

Simply put, the results of cluster analysis depend on both the selection of variables and the acquisition of variable values. The more accurate the selection of variables and the more reliable the measurement, the more the classification results obtained can describe the essential differences between various types of things.

describe

    Cluster analysis is carried out entirely according to the data situation. For a data file consisting of n cases and k variables, when performing cluster analysis on cases, it is equivalent to grouping n points in a k-dimensional coordinate system, based on their distances; when When performing cluster analysis on variables, it is equivalent to grouping k points in the n-dimensional coordinate system, which is also based on the point distance. So distance or degree of similarity is the basis of cluster analysis.

    In a word, cluster analysis specifically calculates the similarity of some samples or some parameters (indicators) according to many observation indicators of a batch of samples according to a certain mathematical formula, and classifies similar samples or indicators into one category, and classifies dissimilar ones into one category. as a class

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325730521&siteId=291194637