Machine Learning Cornerstone 16: The Magic of the "Three" (Course Summary)

This section introduces three important principles in machine learning, including the Occam's razor principle, sample deviation, and data snooping; and summarizes the knowledge learned in 16 courses.



16. Three Learning Principles

16.1 Occam’s Razor

entities must not be multiplied beyond necessity (Entities must not be multiplied beyond necessity) - William of Occam (1287-1347)

This is the origin of Occam's razor. Summarized in one sentence: "If not necessary, do not add entities."

Machine Learning Occam's Razor mean should choose the simplest possible model design.

Look at an example of model selection:
Insert picture description here

Although the model on the left may have errors, the model is simple and meets the needs; although the models on the right are classified correctly, they are too complicated. According to Occam's razor principle, the model on the left is selected as the design model. Consider the following two issues:

  • How is "simple" defined in the simple model?
  • Why are simple models better than complex models? (Why is the Occam's razor principle applicable in machine learning?)

Let ’s look at the first question first: How is “simple” defined in the simple model?
Insert picture description here
As can be seen from the above figure, "simple" has two aspects:

  • For a single hypothesis function h h , fewer parameters means simpler. As shown in the figure above, the classification hyperplane on the left can be expressed only by the center and radius, so the parameters are few, which is "simple", and the hyperplane on the right is too complicated, contains many high-order terms, and the parameter amount is large, so not simple";
  • For hypothetical space H H , the smaller the number of valid hypotheses, the simpler it is.

The two are closely related, a hypothetical space H H Hypothetical function in h h the number of H = 2 l |H| = 2^l ; if space is assumed H H Model complexity in Oh ( H ) \ Omega (H) small, then assume the model complexity of the function Oh ( h ) \ Omega (h) will also be very small.

Next, let's look at the second question: Why is a simple model better than a complex model?

It is assumed that the regularity of a data set is very poor, such as the labeling of samples is randomly labeled. In this case, there are few or even no hypothesis functions that can make the sample E i n 0 E_{in} \approx 0 . If a data set can be separated by a model, the regularity of the data set will not be particularly bad. When using a simple model to roughly separate the data, you can determine that the data set has a certain regularity; if you use a complex model to separate a certain data set, you cannot determine whether the data set is regular or the model is complex enough to separate the data set separate. Therefore, in practical applications, you should try to choose a simple model, such as the simplest linear model.


Exercise 1:
Insert picture description here


16.2 Sampling Bias

To illustrate with an example: the two popular candidates for the 1948 US presidential election are Truman and Dewey. A newspaper interviewed by phone and counted whether people voted for Truman or Dewey. After a lot of phone statistics, Dewey voted more votes than Truman, so this newspaper published the front page of "Dewey Defeats Truman" just before the election results were announced, but the results Truman won the election.

Why is there a completely opposite result from the phone statistics? Because the phone was more expensive at the time, there were fewer families with a phone, and most people with a phone supported Dewey, and most people without a phone supported Truman. That is to say, the sample selection is biased towards the rich, and it does not have a wide representativeness, which causes the illusion of more Dewey support.

This example shows that the sampled sample will affect the result, which is expressed in one sentence: if the sampling is biased, the learning result will also be biased . This situation is called sampling bias (the Sampling Bias) . Technically, the training data and the verification data must obey the same distribution, preferably independently and identically distributed, so that the trained model is representative.


Exercise 2:
Insert picture description here


16.3 Data Snooping

As mentioned earlier, when choosing a model, you should try to avoid data snooping (Data Snooping), because this will artificially tend to a certain model, rather than random selection based on data. So it should be free to choose, it is best not to peep at the original data. In any process of using data, it is considered to indirectly peep the data, and then in the selection or decision of the model, it will increase the model complexity (model complexity), which introduces pollution. The following is an example.

If there are 8 years of currency transaction data, I hope to use these data to predict the trend of currency. The first 6 years of data are used as training data, and the next 2 years of data are used as test data to train the model. Enter 20 days of data to predict the 21st day's ratio transaction.
Insert picture description here
In the figure above, the red line represents the two-year revenue forecast after the model is built using 8-year data; the blue line represents the two-year revenue forecast after the model is built using the previous 6-year data.

It can be seen from the curve in the figure that the model trained with 8 years of data has a greater return on the prediction in the next 2 years and seems to have a better effect. But this is a kind of self-deception, because the data of the next 2 years has been used during training. It is unscientific to use this model to predict the trend of the next 2 years. This approach is indirect peeping data.

In the process of machine learning, it is very important to avoid "voyeur data", but it is very difficult to avoid it completely. In practice, there are some ways to help avoid peeping at the data.

  • "Invisible" data. When choosing a model, try to use experience and knowledge to make a selection, rather than selecting through data. Choose the model first, then look at the data.
  • Stay skeptical. Always be vigilant and skeptical of others' papers or research results, and choose models through your own research and testing so that you can get a more correct conclusion.

Insert picture description here


Exercise 3:
Insert picture description here


16.4 Power of Three

This section is a summary of a series of machine learning cornerstone courses (16 courses, 4 subsections each). The content of the course introduction is related to the number "3", so the title of the subsection is named.

This course introduces the following content of machine learning:

Three application areas

Insert picture description here

  • Data Mining
  • Artificial Intelligence
  • Statistics (Statistics)

Three constraints theory

  • Hoeffding inequality (Hoeffding)
  • Multi-Bin Hoeffding Inequality
  • VC restriction theory

Insert picture description here


Three linear models

  • Linear perceptron algorithm / pocket algorithm (PLA / pocket)
  • Linear regression
  • Logistic regression

Insert picture description here


Three key tools

  • Feature Transform
  • Regularization (Regularization)
  • Cross validation (Cross Validation)

Insert picture description here


Three important principles

  • Occam's Razer principle
  • Sampling Bias
  • Data Snooping

Insert picture description here


Three future jobs

Insert picture description here


Lin Xuantian's machine learning cornerstone course notes are now complete.

Next, I will continue to study and organize the notes related to Mr. Lin's machine learning technique course.


Reference:
https://www.cnblogs.com/ymingjingr/p/4306666.html
https://github.com/RedstoneWill/HsuanTienLin_MachineLearning

Published 167 original articles · praised 686 · 50,000+ views

Guess you like

Origin blog.csdn.net/weixin_39653948/article/details/105619029