Original link: http://tecdat.cn/?p=9839
Overview
In supervised learning, we can often access n number of observations of p feature set and measured on the same observation value Y.
Unsupervised learning is not a set of variables related to the method of Y . Here, we focus on two technologies ...
- Principal Component Analysis : Data visualization tools or other pretreatment prior to the method for supervised learning.
- Clustering : unknown method of discovery data.
Unsupervised learning challenges
In general, unsupervised learning more challenging than the subjective learning, because it is more subjective. There is no simple objective analysis, for example, to predict the response. Unsupervised learning is typically used as part of exploratory data analysis . Further, since the cross-validation or authentication method is not generally accepted, it is difficult to assess the accuracy of the results obtained. In short, in addition to theoretical knowledge on the process at hand or simple intuition, we can not really in the absence of supervision and inspection work . However, unsupervised method has many uses:
-
By identifying subgroups of patients with cancer to understand the behavior.
-
Website (especially e-commerce) will usually try to recommend products based on your previous activities.
-
Netflix movie recommendations.
Principal component analysis
When a large number of relevant variables appear, the main component so that we can be summarized as a collection on behalf of a small number of variables together explain most of the variability in the original collection.
Principal component analysis (PCA) is the process of calculating a main component, and the subsequent use of these ingredients in the understanding of the data. PCA can also be used as tools for data visualization.
What is the main component
Suppose we want a set by the p measurements to visual characteristics of n observations, a portion for exploratory data analysis. Specifically, we wanted to find a form of low-dimensional representation of the data, the representation can capture as much information. PCA provides a method to do this. PCA will seek a small amount as interesting dimension, in which interesting concept measured by the amount of change observed value over the whole dimension.
We can also be measured by the use of the main components of how much information is lost. To this end, we can calculate each of the main components of variance explained by the proportion (PVE). Usually best be interpreted as FIG cumulative, so that we can visualize PVE and the total variance of each component is explained. One
Determining the number of principal components to be used
Overall, we want to use the minimum number of principal components to fully understand the data. We can say that the best way to do this is in the scree graph to visualize data, we will demonstrate later. It's just the accumulation of PVE map. Similar to our other learning techniques to choose the best adjustment of parameters of the way, and when to see the percentage change dropped, so, add the main component and does not really add a lot of variance. We can combine some of the understanding of the data to use this technique.
Most statistical methods can be adapted for use as a principal component predictors, which can sometimes lead to less noise.
Visualization
We perform the PCA.
The column contains four sets of data variables.
Let's talk about data.
We can see the data have different mean and variance. In addition, these variables are measured on different scales. For example UrbanPop
, as a percentage, the number of measurements per 100,000 individuals. If we do not standardize the data, then trouble.
PCA is performed to provide the main load component.
We can already determine the contents of each principal component represents. For example, the first part seems to explain the differences between the urban population and the information related to crime. This is the first part, from Intuitively, this is the biggest difference. The second part certainly explains the effects of the urban environment, third and fourth part shows the difference between other crimes.
We can draw the first view of the main components.
Biplot
Here we can see a lot of information. First viewing axis, PC1 axis x
PC2 and shaft y
. The arrows show how they move in two dimensions. Black status shows how each state change in the direction of the PC. For example, California has both a high crime rate, but also one of the city's most populous country.
This $sdev
property standard deviation for each output component. Each component can be calculated from the variance explained by the squares of the squares:
Then, in order to calculate the proportion of the variance of each principal component of explanation, we first divided by its variance.
Here we see the first PC explains about 62% of the data, the second PC explains about 24% of the data. We can also draw this information.
Scree plot
If you have any questions, please leave a comment below.
Big Data tribe - Chinese professional third-party data service providers to provide customized one-stop data mining and statistical analysis consultancy services
Statistical analysis and data mining consulting services: y0.cn/teradat (Consulting Services, please contact the official website customer service )
[Service] Scene
Research; the company outsourcing; online and offline one training; data reptile collection; academic research; report writing; market research.
[Tribe] big data to provide customized one-stop data mining and statistical analysis consultancy
Welcome to elective our R language data analysis will be mining will know the course!