[Watermelon Book] Zhou Zhihua's "Machine Learning" study notes and exercises (1)

[Watermelon Book] Zhou Zhihua's "Machine Learning" study notes and exercises (1)

1

【Chapter 1 Introduction】

1.1 Introduction

Learning algorithm: The main content of machine learning research is an algorithm that generates a "model" from data on a computer, that is, a "learning algorithm".
The role of learning algorithm: 1. Generate a model based on the provided empirical data;
2. When faced with new situations, the model can provide corresponding judgments.
Model: Refers generally to the results learned from data.
Learner: the instantiation of a learning algorithm on a given data and parameter space.
[Watermelon Book] Zhou Zhihua's "Machine Learning" study notes and exercises (1)

1.2 Basic terminology

To do machine learning, you need data.
Data set: A collection of records.
Example/sample/feature vector: each record (a description of an event or object) or each point in space (corresponding to a coordinate vector).
Attributes/features: things that reflect the performance or nature of an event or object in a certain aspect.
Attribute value: the value on the attribute.
Attribute space/sample space/input space: the space where the attributes are expanded.
Dimension: the number of attributes.
The model needs to be learned from the data.
Learning/training: The process of learning a model from data.
Training data: the data used in the training process.
Training samples: each sample.
Training set: a collection of training samples.
Hypothesis: The learning model corresponds to a certain underlying law about the data.
Truth/True: This underlying law itself.
The learning process is to find out or approximate the truth.
Only by obtaining the result information of the training samples can a "prediction" model be established.
Tags: Information about the results of the example.
Example: An example with tag information.
Tag space: the collection of all tags.
[Watermelon Book] Zhou Zhihua's "Machine Learning" study notes and exercises (1)

Test: After learning the model, the process of using it to make predictions.
Test sample: The predicted sample.
Clustering: Divide the training samples in the training set into several groups.
Clusters: Each group is called a "cluster", and these automatically formed "clusters" may correspond to some potential concept divisions.
According to whether the training data has label information, learning tasks can be roughly divided into two categories.
[Watermelon Book] Zhou Zhihua's "Machine Learning" study notes and exercises (1)
The goal of machine learning is to make the learned model better applicable to "new samples."
Generalization: The ability of the learned model to apply to new samples.

1.3 Hypothetical space
[Watermelon Book] Zhou Zhihua's "Machine Learning" study notes and exercises (1)

The learning process is regarded as a search process in the space composed of all hypotheses. The search goal is to find a hypothesis that "matches" the training set.
Hypothesis space: The space formed by possible functions in machine learning is called "hypothesis space".
Version space: a "hypothesis set" consistent with the training set.

1.4 Inductive Preference

Inductive preference: The preference of a machine learning algorithm for certain types of assumptions during the learning process.
Any effective machine learning algorithm must have its inductive preferences.
"Occam's razor" principle: "If there are multiple assumptions that are consistent with observations, choose the simplest one."
Note: Occam's razor is not the only feasible principle; Occam's
razor itself has different interpretations.
"There is no free lunch" theorem (NFL theorem): The total error has nothing to do with the learning algorithm.
Note: It is meaningless to talk about "what learning algorithm is better" without specific problems.

2

〖2. Difficulty analysis〗

P5 Hypothetical space scale problem

  1. Whatever the value of a certain attribute is appropriate, we use the wildcard "*" to indicate it.
  2. There is no in the world, we use "∅" to indicate.
    The example in the book is watermelon. In order to judge the quality of watermelon, there are three attributes, namely: color, root and knocking sound.
    These three attributes have 3, 3, and 3 possible values, but the scale of the hypothetical space is 4×4×4+1=65.
    This is because, in the hypothetical space, the attribute "no matter what attribute value is appropriate" is also a kind of attribute value, rather than as a collection of three possible values ​​of a single attribute.

Because the hypothesis space is a space composed of possible functions in machine learning, "whatever attribute value is appropriate" means that this function has nothing to do with this attribute.

For example, if the three attribute values ​​of watermelon are "no matter what attribute value is appropriate", then no matter what kind of melon, it is a good melon. Rather than a collection of 3×3×3=27 kinds of melons, there is no need to judge sequentially, the only judgment needed is-it is a melon.

So the hypothetical space is:
{green, jet black, light white, }×{curled up, slightly curled up, stiff, }×{voice, crisp, dull, *}+1 (there is no definition of "good melon")=65

3

〖3. Discussion of exercises〗

1.1 If only two samples numbered 1 and 4 are included in Table 1.1, try to give the corresponding version space.
Solution:
[Watermelon Book] Zhou Zhihua's "Machine Learning" study notes and exercises (1)
As shown in Table 1.1, the attribute values ​​of the three attributes of good melons and non-good melons are all different, so the hypothesis set consistent with the training set, that is, the version space, is shown in the following figure:
[Watermelon Book] Zhou Zhihua's "Machine Learning" study notes and exercises (1)
Figure 1.1 The version space
has a common value Seven types, the conjunctive formula is:
(color = green) ∧ (root = ∗) ∧ (knock = ∗)
(color = ∗) ∧ (root = curl) ∧ (knock = ∗)
(color = ∗) ∧ (root stalk = ∗) ∧ (knocking sound = voiced sound)
(color = green) ∧ (root stalk = curled up) ∧ (knocking sound = ∗)
(color = green) ∧ (root stalk = ∗) ∧ ( Knock = voiced sound)
(color = ∗ )∧ (root stalk = curled up) ∧ (knocked sound = voiced noise)
(color = green) ∧ (rooted stalk = curled up) ∧ (knocked sound = voiced noise)

1.2 Compared with using a single conjunction to represent hypotheses, the use of "analytic paradigm" will make the hypothesis space more powerful. For example:
Good melon←→((color= )∧(root=crunch)∧(knocking= ))∨((color=black)∧(root=*)∧(knocking=dull))
will put " ((Color=green)∧(root=crunch)∧(knocking=clear))” and “((color=black)∧(root=stiff)∧(knocking=dull))” are classified as “ Good melon".

If an analytical paradigm containing at most k conjunctions is used to express the hypothesis space of the watermelon classification problem in Table 1.1, try to estimate how many possible hypotheses there are.
Solution: To
[Watermelon Book] Zhou Zhihua's "Machine Learning" study notes and exercises (1)
calculate this problem, you must first calculate the size of the hypothetical space. Before, the author has been confused by the three attribute values ​​of the three attributes in P5, so the table should be calculated in the same way.

But in fact, the color attribute of this table only has two attribute values: 1. cyan and 2. black, so its scale is: 3×4×4+1=49 kinds.
The maximum value of k is 49.

There are many possibilities to use the analysis-conjunction paradigm that contains at most k conjunctions:
[Watermelon Book] Zhou Zhihua's "Machine Learning" study notes and exercises (1)
ps Many bloggers have discussed whether there are two situations of redundancy. The author believes that redundancy does not exist due to different analysis steps.

1.3 If the data contains noise, there may be no hypothesis consistent with all training samples in the hypothesis space. In this case, try to design an inductive preference for hypothetical choices.
Solution:
Inductive preference: Choose the hypothesis that satisfies the most samples during the training process.

1.4 When discussing the "No Free Lunch" theorem in section 1.4 of this chapter, the "classification error rate" is used by default as a performance metric to evaluate the classifier. If we use other performance measures, try to prove that there is no free lunch” theorem still holds.
Solution:

Considering the binary classification problem, the NFL must first ensure that the objective function f is evenly distributed. For a binary classification problem with X samples, it is obvious that there are 2X cases of f. Half of them are consistent with the assumption, that is, P(f(x)=h(x))=0.5. At this time, ∑fl(h(x),f(x))=0.5∗2X∗(l(h(x)=f(x))+l(h(x)≠f(x))) l( h(x)=f(x))+l(h(x)≠f(x)) should be a constant, and the implicit condition should be (a reasonable and sufficient condition) l(0,0)=l (1,1),l(1,0)=l(0,1). If it is not satisfied, the NFL should not be established (or not so easy to prove).

Recommended reading:

Why should the data be normalized?
Logistic function and softmax function
video explanation | Why can't all neural network parameters be initialized to all 0

All hard and easy to understand! Just stick to the top~ Welcome to follow the exchange~

[Watermelon Book] Zhou Zhihua's "Machine Learning" study notes and exercises (1)

Guess you like

Origin blog.51cto.com/15009309/2553648