Data mining and machine learning

Machine Learning:

    Is the core research areas of artificial intelligence, he is currently defined as: use experience to improve the performance of computer systems . For the "experience" is actually a computer, there is a form of "experience" is based on data, so the machine learning need to analyze data use.

 

    Improve generalization (generalization ability) is one of the most important machine learning problems. Generalization ability to characterize the ability to adapt to the new machine learning system events, in short, the stronger generalization ability, the more accurate prediction system to make the event.

    We are approaching an authentic model of machine learning problems in nature (we think we have chosen a better approximation model, the approximate model is called a hypothesis), but there is no doubt that the true model must not know (if you know , machine learning why should we? solve the problem directly with the real model can not it? right, ha ha) since the real model does not know, then how much the gap between the real solution we have chosen assumptions and questions, we would not be able learned. For instance, we believe that the universe was born in a Big Bang 15 billion years ago, this assumption can describe many phenomena we observe, but it is between the real model of the universe and also how much difference? Who can not tell, because we simply do not know what the true model of the universe in the end yes.

    The error between this and the real solution of the problem, called risk (more strictly speaking, the risk of cumulative error is called). After we chose a hypothesis (more direct point of view, we got a classifier later), the true error is not known, but we can use some amount can be approximated to master it. The most intuitive idea is to use the results to real results on the classification of the classifier sample data (because the sample is already annotated data, the data is accurate) the difference between to represent. This difference is called the empirical risk Remp (w). Previous experience in machine learning methods regarded as a risk minimization goal effort, but later found that many functions can easily achieve the correct classification rate of 100% in the sample set, a mess (the so-called ability to promote real difference in classification Shique, or poor generalization ability). At this point the situation is to choose a sufficiently complex classification functions (its VC dimension is high), can accurately remember each sample, but the data for all samples outside the classification error. Looking back empirical risk minimization principle we will find that this principle applies premise is to really be able to experience risk close to the real risk of the job (in the jargon consistent), but can actually approach it? The answer is no, because the number of samples with respect to the real world of text to be classified is simply a drop in the bucket, empirical risk minimization principle, only in this account do not have a small percentage of error on a sample, of course, no guarantee that a larger proportion of real the text is no error.

Statistical learning therefore introduces the concept of generalization error bounds, meaning the real risk should consist of two parts portray, one empirical risk represents the classification error on the sample in a given; the second is the risk of confidence, on behalf of us in To what extent can trust the results of the classifier on the unknown text classification. Obviously, the second part there is no way to accurately calculate, we can only give an estimate of the range, but also makes the overall error can only be calculated on the sector, but can not calculate the exact value (so called generalization error bounds, and not called generalization error).

Confidence about the amount of risk with the two, one is the number of samples, apparently given the larger number of samples, our study results, the more likely correct, the smaller the risk of confidence at this time; the second is VC-dimensional classification function, obviously the more VC dimension large, the worse the promotion of confidence risks increased.

Generalization error bounds of the formula:

R(w)≤Remp(w)+Ф(n/h)

The formula R (w) is a real risk, Remp (w) is the empirical risk, Ф (n / h) is the risk of confidence. Target statistical learning from the experience of risk minimization becomes a search for empirical risk and confidence and minimal risk, that is, the minimum structure risk.

 

 

 The real number of positive opinions precision precision number = recognized / identified as all positive view of
 recall = recall identified positive real number of views / sample number of front views of all true

 

Data Mining:

   "Data mining" and "knowledge discovery" is generally considered to be the same. On many occasions is an alternative term.

    Data mining as the name implies: to find useful knowledge from massive data. Data mining can be considered cross machine learning applications and databases. It uses machine learning techniques to analyze massive amounts of data using database technology to manage massive amounts of data.

As well as "statistics", a lot of statistical learning algorithms usually require further study in order to become an effective algorithm to data mining through the machine.

    From the data analysis point of view, most of the data mining techniques are applied machine learning technology, but we can not believe that data mining is an application of machine learning. Traditional machine learning does not deal with the massive amounts of data as a research subject, a lot of technology

They are only suitable for small and medium-scale data, if the application of these technologies to the huge amounts of data, then the result will be very bad. Therefore, data mining also requires special transformation of these technologies.

    For example, "tree", it is a good machine learning technology, not only generalization ability and learning outcomes is understandable. The traditional approach is to read all the data into memory for analysis, but apparently not for the huge amounts of data, this time on the need to deal with,

And scheduling data structure, such as by introducing efficient.

   Exception, as an independent discipline, data mining has his own "unique" things. For example, " association analysis ." Simply put a lot of data from relational analysis is to identify the diaper such as beer drinking is very strange but meaningful association. If there are 100 customers in 20 20 customer buys a diaper, for later diaper 16 bits purchased beer, it can be written as "diapers → beer [support = 20% confidence level = 80%]" This of an association rule.

Reproduced in: https: //www.cnblogs.com/GuoJiaSheng/p/3851034.html

Guess you like

Origin blog.csdn.net/weixin_34208283/article/details/93614724