R Language unsupervised learning: PCA principal component analysis visualization

Original link: http://tecdat.cn/?p=9839


Overview

In supervised learning, we can often access n number  of observations of p  feature set and measured on the same observation value   Y.

Unsupervised learning is not a set of variables related to the  method of Y . Here, we focus on two technologies ...

  • Principal Component Analysis : Data visualization tools or other pretreatment prior to the method for supervised learning.
  • Clustering : unknown method of discovery data.

Unsupervised learning challenges

In general, unsupervised learning more challenging than the subjective learning, because it is more subjective. There is no simple objective analysis, for example, to predict the response. Unsupervised learning is typically used as  part of exploratory data analysis . Further, since the cross-validation or authentication method is not generally accepted, it is difficult to assess the accuracy of the results obtained. In short, in addition to theoretical knowledge on the process at hand or simple intuition, we can not really in the absence of supervision and inspection work . However, unsupervised method has many uses:

  • By identifying subgroups of patients with cancer to understand the behavior.

  • Website (especially e-commerce) will usually try to recommend products based on your previous activities.

  • Netflix movie recommendations.

Principal component analysis

When a large number of relevant variables appear, the main component so that we can be summarized as a collection on behalf of a small number of variables  together explain most of the variability in the original collection.

Principal component analysis (PCA) is the process of calculating a main component, and the subsequent use of these ingredients in the understanding of the data. PCA can also be used as tools for data visualization.

What is the main component

Suppose we want a set by the p  measurements to visual characteristics  of n observations, a portion for exploratory data analysis. Specifically, we wanted to find a form of low-dimensional representation of the data, the representation can capture as much information. PCA provides a method to do this. PCA will seek a small amount as interesting dimension, in which interesting concept measured by the amount of change observed value over the whole dimension.

We can also be measured by the use of the main components of how much information is lost. To this end, we can calculate each of the main components of variance explained by the  proportion (PVE). Usually best be interpreted as FIG cumulative, so that we can visualize PVE and the total variance of each component is explained. One

Determining the number of principal components to be used

Overall, we want to use the minimum number of principal components to fully understand the data. We can say that the best way to do this is in the scree graph to visualize data, we will demonstrate later. It's just the accumulation of PVE map. Similar to our other learning techniques to choose the best adjustment of parameters of the way, and when to see the percentage change dropped, so, add the main component and does not really add a lot of variance. We can combine some of the understanding of the data to use this technique.

Most statistical methods can be adapted for use as a principal component predictors, which can sometimes lead to less noise.

Visualization

We perform the PCA.

states <- rownames(USArrests)
states
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

The column contains four sets of data variables.

names(USArrests)
## [1] "Murder"   "Assault"  "UrbanPop" "Rape"

Let's talk about data.

kable(summary(USArrests))

 

We can see the data have different mean and variance. In addition, these variables are measured on different scales. For example  UrbanPop , as a percentage, the number of measurements per 100,000 individuals. If we do not standardize the data, then trouble.

PCA is performed to provide the main load component.

We can already determine the contents of each principal component represents. For example, the first part seems to explain the differences between the urban population and the information related to crime. This is the first part, from Intuitively, this is the biggest difference. The second part certainly explains the effects of the urban environment, third and fourth part shows the difference between other crimes.

We can draw the first view of the main components.

Biplot

Here we can see a lot of information. First viewing axis, PC1 axis x PC2 and shaft  y. The arrows show how they move in two dimensions. Black status shows how each state change in the direction of the PC. For example, California has both a high crime rate, but also one of the city's most populous country.

This  $sdev property standard deviation for each output component. Each component can be calculated from the variance explained by the squares of the squares:

## [1] 2.4802 0.9898 0.3566 0.1734

Then, in order to calculate the proportion of the variance of each principal component of explanation, we first divided by its variance.

## [1] 0.62006 0.24744 0.08914 0.04336

Here we see the first PC explains about 62% of the data, the second PC explains about 24% of the data. We can also draw this information.

Scree plot

par(mfrow=c(1,2))

plot(pve, xlab='Principal Component', 
     ylab='Proportion of Variance Explained', 
     ylim=c(0,1), 
     type='b')

plot(cumsum(pve), xlab='Principal Component', 
     ylab='Cumuative Proportion of Variance Explained', 
     ylim=c(0,1), 
     type='b')

 

If you have any questions, please leave a comment below. 

 

 

Big Data tribe  - Chinese professional third-party data service providers to provide customized one-stop data mining and statistical analysis consultancy services

Statistical analysis and data mining consulting services: y0.cn/teradat (Consulting Services, please contact the official website customer service )

Click here to send me a messageQQ:3025393450

 

QQ exchange group: 186 388 004 

[Service] Scene  

Research; the company outsourcing; online and offline one training; data reptile collection; academic research; report writing; market research.

[Tribe] big data to provide customized one-stop data mining and statistical analysis consultancy

 

Welcome attention to micro-channel public number for more information about data dry!
 
 

Welcome to elective our R language data analysis will be mining will know the course!

Guess you like

Origin www.cnblogs.com/tecdat/p/12099458.html