Introduction to canonical correlation analysis
- Uses: A multivariate statistical method to study the correlation between two sets of variables (each set of variables may have multiple indicators) (revealing the intrinsic relationship between the two sets of variables).
Canonical correlation analysis steps
- ① Find the linear combination of variables in each group of variables so that the linear combination of the two groups has the largest correlation coefficient;
- ② Select linear combinations that are unrelated to the initially selected pair of linear combinations, pair them, and select the pair with the largest correlation coefficient;
- ③ Repeat this until the correlation between the two sets of variables is completed in advance
- Determine whether the extraction is complete: Hypothesis testing can be performed until the correlation coefficient is not significant
- Concept introduction
- Canonical variables: selected linear combination pairs
- Canonical correlation coefficient: The correlation coefficient of canonical variables. ⇒ Measures the strength of the relationship between two sets of variables.
Example demonstration: key steps of typical correlation analysis
-
(1)The distribution of data has assumptions: the two sets of data obey a joint normal distribution.
- It can be stated directly in the paper that the two sets of data conform to the (joint) normal distribution
- It can be stated directly in the paper that the two sets of data conform to the (joint) normal distribution
-
(2) First, test the correlation between the two sets of variables (construct the likelihood ratio statistic).
- A p value less than 0.05 (0.1) indicates that the null hypothesis is rejected at the 95% (90%) confidence level, that is, the two sets of variables are considered related.
- This test is not necessary, because the canonical correlation coefficient needs to be tested later (if the first canonical correlation coefficient is tested to be significant, the above conclusion can be reached)
-
(3)Determine the number of typical correlated variables(Just look at the P value corresponding to the typical correlation coefficient
-
(4) Use standardized canonical correlation variables to analyze the problem
-
(5) Carry outTypical load analysis
- Typical load analysis reflects the correlation of data (reflecting the correlation of comprehensive indicators and each indicator)
- Typical load analysis reflects the correlation of data (reflecting the correlation of comprehensive indicators and each indicator)
-
(6) Calculate the contribution of the first r typical variables to the total variance of the sample
-
SPSS specific operationsare as follows
Specific examples
Question analysis
- What is the relationship between exploring the opinions of viewers and industry insiders on some TV programs?
- The first set of variables (audience ratings): low-educated (led), high-educated (hed) and network (net) surveys;
- The second set of variables (rated by industry insiders): artists (arti) including actors and directors, distribution (com) and heads of various departments in the industry (man)
- Idea: Directly analyzing variables in pairs, it is difficult to get a clear impression of the relationship between these two sets of variables (audience and industry insiders). ⇒ Convert the correlation between multiple variables into the correlation between two representative variables.
- Selection of representatives: Can measure the internal rules of the group in a more comprehensive and comprehensive way. ⇒The simplest comprehensive form of a set of variables is a linear combination of the set of variables.
SPSS operation steps
-
Note:Spss requires at least version 24 (lower versions cannot directly perform typical correlation analysis operations and require programming)
-
Step 1: Import data from excel to spss
-
Step 2: Check the type of data (all set to "Scale" here)
- Scale: Numeric scalar (such as height, weight, etc.)
- Ordered: ordered categorical variables (such as A, B, C, D, good, bad, etc.)
- Nominal: Unordered categorical indicators (such as men and women, etc.)
-
Step 3: Click the menu function and select typical correlations.
-
Step 4: Move the data to the corresponding collection.
- Note that you need to install Python first to run
- Note that you need to install Python first to run
-
Step 5: Export analysis results
-
Step 6: Analyze the results
- ①Instructions for this step (explanation of this step does not need to be written in a paper):
- To include the figure in the paper, the following five headers need to be modified
- Through p-value comparison, it can be seen that only the first row of data is valuable (available) ⇒ Obtainedcanonical correlation coefficient
- After obtaining the canonical correlation coefficient, findcanonically correlated variables
- Non-standardized data will be affected by dimensions, so the data must be standardized before use.
- Non-standardized data will be affected by dimensions, so the data must be standardized before use.
- To include the figure in the paper, the following five headers need to be modified
- ②There are three specific links a~c (the part that needs to be written in the paper)
- ①Instructions for this step (explanation of this step does not need to be written in a paper):
-
Step 7: Selectively analyzetypical loadingsandvariance explained .
Typical load analysis
Typical load
-
Definition: Canonical loading analysis refers to the analysis of the correlation between original variables and canonical variables.
- In step 6 above, the standardized typical correlation variables are obtained, and their sum can be judged by the absolute value of the standardized data. The magnitude of the correlation between canonical variables. However, this method is not very strict. The strict method is to calculate the correlation, which can be achieved through typical load analysis.
-
Typical load analysis examples:
cross loading
- Cross loading is rarely used, and generally only typical loads are used to analyze the interior of the set.
Typical redundancy analysis (rarely used)
- Typical redundancy analysis: Calculate the variance ratio of each of the three typical variables to see whether the three explanations are strong or not.
After-school exercises
- We want to explore what is the relationship between the opinions of viewers and industry insiders on some TV programs? Use canonical correlation analysis to complete this question and write a short paper.
- Audience ratings come from three types of surveys: low-educated (led), high-educated (hed) and network (net) surveys, which form the first group of variables;
- The ratings from industry insiders come from three categories: artists (arti) including actors and directors, distributors (com), and heads of various departments in the industry (man), forming the second group of variables.
- Read the article "Evaluation of Wine, the First Prize Essay of Mathematical Modeling A in 2012"
- Other video explanations:SPSS typical correlation analysis
postscript
- The reference course can be found at Qingfeng Digital Analog at Station B. The above is only for personal notes after study.
- Example Demonstration comes from: Xiamen University Multivariate Statistical Analysis Chapter 9 Canonical Correlation Analysis.ppt