Machine Learning--Zhou Zhihua's homework---Chapter 3 Linear Model

3.1 Try to analyze under what circumstances the bias term b is not considered in the following formula.

answer:

在 sample $x$ Nakaya certain one attribute $x_{i}$ Fixed time. At this time $w_{i}x_{i}+b$ etc., this time $w_{i}x_{i}+b$ give $b$ 等了。

3.2 Try to prove that for parameters , the objective function (3.18) of logarithmic regression (logistics regression) is non-convex, but its log-likelihood function (3.27) is convex.

answer:

3.18： $y=\frac{1}{1+e^{-(w^{T}x+b)}}$ ，

3.27： $l(\beta)=\sum_{i=1}^{m}lnp(y_{i}|x_{i};w,b)$

Functions on the real number set can be identified by finding the second-order derivative: if the second-order derivative is non-negative in the interval, it is called a convex function; if the second-order derivative is always greater than 0 in the interval, it is called a strictly convex function. Original book p54)

For a multivariate function, its Hessian matrix is positive semidefinite, that is, it is a convex function.
For Equation 3.27, about $\beta$ 的二阶导有（原书p60）
$\frac{\partial^{2}{l(\beta)}}{\partial{\beta}\partial{\beta^{T}}}=\sum_{i=1}^{m}\tilde{x_{i}}\tilde{x_{i}}^{T}p_{1}(\tilde{x_{i}};\beta)(1-p_{1}(\tilde{x_{i}})=XPX^{T}$ ,
where the first equal sign is from the original book, and the second equal sign is $X$ 为 $(n, m)$ Matrix, each column corresponds to a sample, $P$ 为对angle矩阵， $P_ {ii}=p_{1}(\tilde{x_{i}};\beta)(1-p_{1}(\tilde{x_{i}})$ 。

关于 $XPX^{T}$ , arbitrary direction quantity $z$ 都有： $z^{T}XPX^{T}z = (X^{T}z)^{T}P(X^{T}z)=v^{T}Pv=\sum_{i}P_{ii}v_{i}^{2}\geq0$ , so its Hessian matrix is positive semidefinite.

Kaiyu style 3.18 ，这り， $y$ Understanding the amount, and $x$ 为 $Column vector of X$ . Then its first derivative $\frac{\partial{y}}{\partial{w}}=x(y-y^{2} )$ 。
二阶导 $\frac{\partial^{2}{y}}{\partial{w}\partial{w^{T}}}=xx^{T}y(1-y)(1-2y)$ (Sokuumi Mori 矩阵),
Inside $xx^{T}$ has rank 1 and has only one non-zero eigenvalue, and its sign depends on $y (1 - y) (1 - 2 y)$ ,
obviously should be in (0 ,1), the sign of the eigenvalue will change, so equation 3.18 is about $The Hessian matrix of w$ is not positive semidefinite and therefore non-convex.

3.3 Program to implement probability regression and give the results on the watermelon data set 3.0α

answer:

3.4 Select two UCI data sets and compare the error rates of logarithmic regression estimated by the 10-fold cross-validation method and the leave-one-out method.

answer:

3.5 Edit to implement linear discriminant analysis and give the results on the watermelon data set 3.0α.

answer:

3.6 Linear discriminant analysis can only obtain ideal results on linearly separable data. Try to design an improved method so that it can better deal with nonlinear separable data.

answer:

Introducing the kernel function, p137 of the original book, there is an introduction to kernel linear discriminant analysis.

3.7 Let the code length be 9 and the number of categories be 4. Try to find the theoretically optimal ECOC binary code in the sense of Hamming distance and prove it.

answer:

The original book did not explain many places clearly, so I read the original paper "Solving Multiclass Learning Problems via Error-Correcting Output Codes".

Let’s first explain some of the theories involved.

First of all, it is mentioned in the original book:

For codes of the same length, theoretically, the farther the coding distance between any two categories, the stronger the error correction capability. Therefore, when the code length is small, the theoretical optimal coding can be calculated based on this principle.

In fact, this point is also mentioned in the paper, "Assuming that the minimum Hamming distance between any two categories is d, then this error correction output code can at least correct $\left[ \frac{d-1}{2} \right]$ bit error.
[The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-lGiyWDeQ-1646034756635)(https://github.com/han1057578619/ MachineLearning_Zhouzhihua_ProblemSets/blob/master/ch3–%E7%BA%BF%E6%80%A7%E6%A8%A1%E5%9E%8B/image/1.jpg)]
Get Explain the example in the paper above. In the above picture, the Hamming distance between all categories is 4.
Assume that the correct category of a sample is c1, then the codeword should be '0 0 1 1 0 0 1 1',
If there is a classifier output error at this time and becomes '0 0 0 1 0 0 1 1', then the closest one at this time is still c1,
If there are two classification output errors such as '0 0 0 0 0 0 1 1', the Hamming distance from c1 and c2 is both 2, and it cannot be classified correctly.
That is, if any classifier misclassifies the sample, the final result will still be correct, but if there are more than two classifiers wrong,
the result may not be correct. . This is $\left[ \frac{d-1}{2} \right]$ .

In addition, the original paper mentioned that a good error correction output code should meet two conditions:

Line separation. The codeword distance between any two categories should be large enough.
column separation. Any two classifiers $f_{i},f_{j}$ The outputs should be independent of each other and uncorrelated. This can be achieved by making the Hamming distance of the classifier f_{i} code from other classification codes large enough, and the Hamming distance from the inverse codes of other classification codes also large enough (a bit convoluted.).

The first point is actually mentioned in the original book and has already been explained. Let’s talk about the second point:

If the coding of the two classifiers is similar or completely consistent, many algorithms (such as C4.5) will have the same or similar misclassification,
If such errors occur simultaneously Too much will cause the error correction output code to fail. (Translated original paper)

My personal understanding is: if two similar codes are added, then when misclassified, the original 1 will be changed to 3, resulting in an increase in the Hamming distance of the codeword from the real category.
In the extreme case, assuming that two identical codes are added, the minimum Hamming distance between any two categories will not change and remains d,
and The Hamming distance between the codeword output by the error correction output code and the codeword of the real category increases sharply (from 1 to 3).
Therefore, if there are too many incorrect classifications issued at the same time, the error correction output code will become invalid.

In addition, the codes of the two classifiers should not be the complement of each other, because many algorithms (such as C4.5, logistic regression) treat 0-1 classification symmetrically,
That is to say, the 0-1 categories are interchanged, and the final trained models are the same. In other words, two classifiers whose codes are complementary to each other will make mistakes at the same time.
will also cause the error correction output code to fail.

Of course, when there are fewer categories, it is difficult to meet the above conditions.
As shown in the picture above, there are three categories in total, then only $2^{3}=8$ $f_{0}-f{7}$ ,
4 kinds $f_{4}-f_{7}$ is the complement of the first four types and should be removed, and then remove all 0s $f_{0}$ ,
leaves only three encoding options, so it is difficult to meet the above conditions. In fact, for the classification of k categories, after
removes the complement code and the codes that are all 0 or 1, we are left with $2 ^{k}-1$ .

Several methods of constructing codes are given in the original paper. One of them is:
[The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-ZxoZ6dd4-1646034756641) (https://github. com/han1057578619/MachineLearning_Zhouzhihua_ProblemSets/blob/master/ch3–%E7%BA%BF%E6%80%A7%E6%A8%A1%E5%9E%8B/image/2.jpg)]
Back to the topic, when the category is 4, there are 7 possible encodings. According to the above method, there are:
[External link image transfer failed, the source site may have anti-leeching mechanism, it is recommended to save the image and upload it directly (img-TkMibshG-1646034756646)(https://github.com/han1057578619/MachineLearning_Zhouzhihua_ProblemSets/blob/master/ch3–%E7%BA%BF%E6%80%A7%E6% A8%A1%E5%9E%8B/image/3.jpg)]
When the code length is 9, then $f_{6} < /span>$ Adding any two codes after is the optimal code,
because adding any code at this time is the inverse code of the first code. At this time, between categories The minimum Hamming distance is always 4 and will not increase.

3.8 An important condition for ECOC coding to achieve ideal error correction is that the probability of error in each bit of coding is equal and independent. Let’s try to analyze the possibility that the two-class classifier generated after ECOC encoding for multi-classification tasks meets this condition and the resulting impact.

Answer:
The condition is decomposed into two: one is that the probability of error is equal, and the other is that the possibility of error is independent of each other.

Let’s look at the first one first. In fact, the generalization error of the classifier on each position is the same. To meet this condition actually depends on the difficulty of distinguishing between samples,
If the two categories themselves are very similar, that is, the more difficult they are to distinguish, the greater the probability of errors in the trained classifier. The original book also mentioned on page 66:

When multiple categories are disassembled into two "category subsets", the difficulty of distinguishing the two resulting category subsets is often different, that is, the difficulty of the two-classification problem they result in is different.

Therefore, the more the differences between categories after each coding are disassembled (the difficulty of distinction is the same), the greater the possibility of meeting this condition. In practice, it is actually very difficult to satisfy.

The second one is independent of each other. As mentioned in 3.7, the original paper also proposed that one of the conditions that a good error correction output code should meet is that the classifiers on each bit are independent of each other.
When there are more categories, it is satisfied The greater the possibility of this condition, it is also explained in 3.7 that when there are fewer categories, it is difficult to meet this condition.

As for the impact. The watermelon book also mentioned:

A coding that has good theoretical error correction properties but leads to a more difficult three-class classification problem, and another code that has poorer theoretical error correction properties but leads to a simpler two-classification problem. Which of the resulting model performance is stronger or weaker? Hard to say.

3.9 When using OvR and MvM to decompose a multi-classification task into a two-classification task, try to explain why there is no need to specifically deal with class imbalance.

answer:

p66 In fact, the answer has already been given:

For OvR and MvM, since each class is processed the same, the effects of category imbalance in the disassembled two-classification task will cancel each other out, so special processing is usually not required.

3.10 Try to deduce the conditions under which multi-classification cost-sensitive learning (only considering category-based misclassification costs) can obtain the theoretical optimal solution using "rescaling".

answer:

This question is actually a paper "On Multi-Class Cost-Sensitive Learning" by Professor Zhou Zhihua. Read the theoretical part of the paper. Now try to outline it.

First, let me talk about my personal understanding of "rescaling": whether it is cost-sensitive learning or non-cost-sensitive learning, various "rescaling" methods (oversampling, undersampling, threshold shifting, etc.) are all about adjusting the model for each category. The degree of influence, that is, the weight of each category.

以 $cost_{ij}$ Display general $i$ class sample is misclassified as $j$ loss of class samples, then in the two-classification problem,
$\ast cost_{11}+(1-p)cost_{21}$ represents the expected loss of the classifier predicting the sample as class 1, where $p = P (c l a s s = 1 ∣ x)$ ，
那么当 $\ast cost_{11}+(1-p)cost_{21} < p \ast cost_{12}+(1-p) \ast cost_{22}$ means that the expected loss when predicting class 1 is less than the expected loss when predicting class 2.
Then it is reasonable to predict the sample as class 1. When the equal sign is taken and the correct classification is assumed The loss is 0,
and the optimal decision threshold can be obtained: $\frac{p^{\ast} }{1-p^{\ast}}=\frac{cost_{21}}{cost_{12}}$ ，
即 $p^{*}=\frac{cost_{21}}{cost_{12}+cost_{21}}$ . In "On Multi-Class Cost-Sensitive Learning",
quoted a theory from another paper "The Foundations of Cost-Sensitive Learning":
[The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-xymF4ga2-1646034756649)(https://github.com/han1057578619/MachineLearning_Zhouzhihua_ProblemSets/blob/master/ch3–% E7%BA%BF%E6%80%A7%E6%A8%A1%E5%9E%8B/image/7.jpg)]
Through this theory, we can derive the cost-sensitive learning , the conditions that the weights of each category should meet after optimal "rescaling".
Only after reading the original paper did I understand the meaning of this theory. . If you are interested in proving the theory, you can read the original paper, but I won’t repeat it here.
What it wants to say is, suppose there is an algorithm $The classifier generated by L$ is $p_{0}$ is the decision threshold,
then if a data set is given $S$ and optimal decision threshold $p^{\ast}$ , this theory shows that by increasing the number of negative samples,
makes it the original $\frac{p^{\ast}}{1-p^{\ast}}\frac{1-p_{0}}{p_{0}}$ times,
创ken数SET集 $S^{'}$ ， $L$ communication $S^{'}$ can still generate a $p_{0}$ is the decision threshold and a good enough classifier.
Taking binary classification as an example, when the number of samples is balanced $p_{0}=0.5$ . Then according to this theory,
compared to the first type, the rescaling ratio of the second type should be that of the first type $\frac{p^{\ast}}{1-p^{\ast}}=\frac{cost_{21}}{cost_{12}}$ times,
means that the influence of the first type is the influence of the second type $\frac{cost_{12}}{ cost_{21}}$ times. $w_{i}$ Display number $The rescaling ratio of i$ ,
when extended to multi-classification, "rescaling" to obtain the optimal theoretical solution should satisfy: That is: its The adjoint matrix rank is less than c. [The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-JajmnOwk-1646034756654)(https://github.com/han1057578619/ MachineLearning_Zhouzhihua_ProblemSets/blob/master/ch3–%E7%BA%BF%E6%80%A7%E6%A8%A1%E5%9E%8B/image/6.jpg)] The system of equations has untie. [The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-AHzikVJJ-1646034756652)(https://github.com/han1057578619/MachineLearning_Zhouzhihua_ProblemSets/ blob/master/ch3–%E7%BA%BF%E6%80%A7%E6%A8%A1%E5%9E%8B/image/5.jpg)]
[External link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-Y7MtBTvH-1646034756651)(https://github.com/han1057578619/MachineLearning_Zhouzhihua_ProblemSets/blob /master/ch3–%E7%BA%BF%E6%80%A7%E6%A8%A1%E5%9E%8B/image/4.jpg)]