Watermelon Book and Pumpkin Book of Machine Learning Notes (1)——Chapter 1-6

Table of contents


The thing is like this, I signed up for DataWhale's team study on machine learning courses, this time the learning content is chapters 1-6 of Watermelon Book and Pumpkin Book. Although it is said to be organized learning, in fact, the way of learning is to read books and watch videos by yourself, but there are learning materials and discussion groups provided by DataWhale, and there is a time limit for punching cards. Take a bite of this watermelon book.

DataWhale Eating Melon Tutorial Video Address:
[Eating Melon Tutorial] "Machine Learning Formula Detailed Explanation" (Pumpkin Book) and Watermelon Book Formula Derivation Live Collection

❤ 2022.7.11 ❤

Task01: Chapters 1 and 2 of Watermelon Book + Pumpkin Book

insert image description here

Chapter 1 Introduction

〇What is machine learning
The definition given in the book is as follows:
insert image description here

insert image description here

some concepts

Data set" (data set)
sample" (instance)
sample" (samp1e).
Attribute" (attribute)
feature" (feature)
attribute value" (attribute va1ue).
Attribute space" (attribute space)
sample space" (samp1e space)
Feature vector" (feature vector)
dimension" (dimensionality)
learning" (lerning)
training" (training)
training data" (training data)
training sample" (training samp1e)
training set" (training set)
hypothesis" (hypothesis);
Real "(ground-truth)
learner" (learner)
prediction" (prediction)
label" (label1)
sample" (examp1e)
label space" (label space)
classification" (classification)
regression" (regression)
binary classification"(binary cl sification)
正类" (positive class)
反类" (negative class);
Multi-class classification" (multi - class classification)
testing " (testing ) testing
sample " (testing sample) . "(generalization) distribution" (distribution) independent and identically distributed" (independent and identically distributed, referred to as iid) induction (induction) cross interpretation (deduction) specialization "(specialization) induction learning" (inductive learning) concept (concept) version space " (version space) inductive preference" (inductive bias)













〇 There is no free lunch theorem
insert image description here


Chapter 2 Model Evaluation and Selection

some concepts

Error rate" (error rate)
precision" (accracy)
error" (error)
training error" (training error)
experience error" (empirical error)
generalization error" (generalization error)
overfitting" (overfitting)
underfitting" (underfitting)
model selection " (model selection)
test set (testing set)
hold-out method" (hold-out)
incoming sample (sampling)
stratified sampling " (stratified sampling)
authenticity (fidelity)
cross validation method" (cross validation )
k-fold cross validation " (k-fold cross validation)
self-help method" (bootstrapping)
self-service sampling (bootstrap sampling)
out-of-bag estimation "(out-of-bag estimate).
Tuning parameters" (parameter tuning).
Performance measurement (performance measure)
mean square error"(mean squared error)
precision
(recall)
true positive
False positive (false positive)
true negative (true negative)
false negative (false negative)
balance point (Break-Event Point, referred to as BEP)
classification threshold (threshold)
cut-off point" (cut point)
ROC receiver operating characteristics" (Receiver Operating Characteristic)
True Positive Rate" (True Positive Rate, referred to as TPR)
False Positive Rate" (False Positive Rate, referred to as FPR)
Unequal Cost" (unequa1 cost)
"Cost Matrix" (cost matrix)
Overall Cost" ( total cost)
cost-sensitive " (cost-sensitive)
cost curve" (cost curve)
statistical hypothesis test (hypothesis test)
binomial test " (binomial test)
confidence confidence)
t test" (t-test)
paired t test" (paired t-tests) contingency
" (contingency table)
follow-up test" (post-hoc test)
bias variance decomposition" (bias-variance decomposition)

some knowledge

· NP-hard problems

Popular science, what is an "NP-hard" problem. I can't understand the professional explanation. There are a few examples in this article, and I can understand it at once.

· Polynomial time

polynomial time

· Parameters for machine learning

insert image description here

· Normalized

How to understand normalization?


△ Evaluation method

· Set aside method

insert image description here

· Cross Validation

insert image description here

· Self-help

insert image description here

· Adjust and participate in the final model

insert image description here

insert image description here


△ Performance metrics

· Error rate and precision

insert image description here
(Formula omitted)

· Precision, recall and F1

insert image description here
insert image description here
insert image description here

· ROC and AUC

insert image description here
insert image description here
insert image description here

· Cost-sensitive error rate and cost curve

insert image description here
insert image description here


△ Comparison test

Yao remembers learning these things when he was studying mathematical statistics, but he gave them all back to the teacher. . . . (Although I didn't understand it at the time...) I
still don't understand these things now, so I'll save them here and fill in the relevant notes later.

【To be perfected】

〇Hypothesis test
〇Cross-validated t-test
〇McNemar test
〇Friedman test and Nemenyi follow-up test

△ Bias and Variance

· Bias Variance Decomposition

insert image description here
insert image description here
insert image description here


Task02: Read Chapter 3 of Watermelon Book + Pumpkin Book in detail

Chapter 3 Linear Models

some concepts

Linear model (linear model)
nonlinear model (nonlinear model)
explanation 'comprehensibility'
linear regression "(linear regression)
Euclidean distance" (Euclidean distance)
least square method "(least squ method)
parameter estimation" (p meter estimation )
Multiple linear regression" (multivariate linear regression)
regularization (regularization)
log-linear regression" (log-linear regression)
generalized linear model" (generalized linear model)
substitution function" (surrogate function)
logarithmic probability function (logistic function)
Link function " (link fun ti n)
probability" (odds)
"log odds" (log odds, also known as logit)
logarithmic probability regression "(logistic regression, also known as logit regression)
maximum likelihood method" (maximum likelihood method)
log-likelihood"(loglikelihood)
Linear Discriminant analysis (Linear Discriminant nalys, referred to as LDA)
Intra-class scatter matrix" (withi -cl scatter matrix)
inter-class scatter matrix" (between class scatter matrix)
generalized Rayleigh quotient" (generalized Rayleigh quotient)

some knowledge

· Basic form of linear model

insert image description here

· Least square method

Least square method (least sqaure method)

· Regularization

Detailed regularization

· angry mine

insert image description here
arg min means that this formula gets the minimum value of w and b

· Maximum Likelihood Estimation

One article to understand the maximum likelihood estimation

· Convex set, convex function

Machine Learning Concepts: A detailed explanation of convex functions and convex optimization, full of dry goods

insert image description here

The definition of concave and convex here is the opposite of the definition in mathematics. Here is the definition of optimization theory.

· Gradient

insert image description here

· Hessian Matrix

insert image description here

· Three elements of machine learning

insert image description here

· Information theory and information entropy

insert image description here

· Relative entropy (KL divergence)

insert image description here

· Two-norm

The two-norm of the vector is equivalent to finding the modulus length of the vector
the 2-norm of the vector and the 2-norm of the matrix

· Lagrange multiplier method

insert image description here

· Generalized eigenvalues

insert image description here

· Generalized Rayleigh Quotient

insert image description here

厄米(Hermitian)matrix:   AH = A \ \mathrm{A}^\mathrm{H} = \mathrm{A} AH=The matrix of A [?

insert image description here


△ Linear regression

○ Univariate linear regression

insert image description here
insert image description here

· Least squares estimation

insert image description here

· Maximum Likelihood Estimation

insert image description here
insert image description here

※ Textbook recommendation: Chen Xiru "Probability Theory and Mathematical Statistics"

insert image description here

Deriving the probability distribution of y from the error of a linear model

insert image description here
insert image description here

The formula analyzed by the method of maximum likelihood estimation is the same as that of the least squares method


· Solve for w and b

Explain the formula (3.5)-(3.7) on page 54 of the watermelon book, why the partial derivative is 0 is the optimal solution of w and b

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

At this point, it is proved that the loss function is a convex function, and then the optimal solution can be obtained by applying the properties of the convex function.

insert image description here

That is, when the partial derivative is equal to 0, the results of calculating w and b are as follows. See the pumpkin book for the process.

insert image description here
insert image description here


○ Multiple Linear Regression

The difference between multiple linear regression and unary linear regression is that multiple x is a vector

insert image description here

Write b as wd*1, and then absorb it into the matrix form

insert image description here

· Vectorization of least squares method for multiple linear regression

Vectorization is to avoid excessive use of for loops during summation, and the matrix acceleration function of computing libraries such as numpy can be used to write the summation in the form of vector multiplication

insert image description here
insert image description here
insert image description here

Regarding the variation of this equation, the details are as follows

insert image description here

insert image description here


· Solve the sharp sign of w

insert image description here
insert image description here

Details as follow

insert image description here

The above is a scalar, and the bottom is a vector. The derivation of a scalar with respect to a vector is the content of matrix differentiation. The derivation method is as follows

insert image description here
insert image description here

The purpose is to find the Hessian matrix, so continue to find the first-order partial derivative

insert image description here

Find the Hessian matrix equal to 0, that is, find the optimal vector

insert image description here


○ Log odds regression

Logarithmic probability regression is logistic regression, its essence is a classification algorithm,

insert image description here
insert image description here

· Maximum Likelihood Estimation for Log-Probability Regression

First, use the previously derived linear regression model combined with the mapping function sigmoid function to establish a probability mass function
※ Because the y value is 0,1, the function is a probability mass function, not a probability density function

insert image description here

Write the probability mass function in a unified form

insert image description here

Write the likelihood function, using the logarithmic function to convert the multiplication into a summation

insert image description here

Bring the previously derived formula into

insert image description here
insert image description here


· Use information theory methods (minimize relative entropy) to achieve optimization

insert image description here

First write the ideal distribution and simulated distribution, and then bring in the calculation formula of cross entropy

insert image description here
insert image description here
insert image description here
insert image description here


· Three elements of machine learning for logarithmic probability regression algorithm

insert image description here


○ Binary Classification Linear Discriminant Analysis

insert image description here
insert image description here

Contents in the book:
  xi \ x_{i} xi: n-dimensional feature vector
  yi \ y_{i} yi: Sample mark, value 0 or 1
  X i \ X_{i} Xi: middle target   i\i i refers to the value of y, that is, 0 or 1, and he represents the set of samples with the same value of y
  μ i \ \mu_{i} mi   i \ i  i takes 0 or 1, which means the mean vector   Σ i \ \Sigma_{i}of the samples (positive or negative samples) with the same value
 Si   i \ i  i takes 0 or 1, which means the covariance matrix of samples with the same value (positive sample or negative sample)
Generally speaking, the form of covariance matrix is ​​like this, and the formula in the book has a certain simplification:
  Σ 0 = 1 m 0 ∑ x ∈ X 0 ( x − μ 0 ) ( x − μ 0 ) T \ \Sigma_{0} = \frac{1}{m_{0}} \sum_{x \in X_{0}}^{} (x - \mu_{0} )(x - \mu_{0})^T S0=m01xX0(xm0)(xm0)T
  Σ 1 = 1 m 1 ∑ x ∈ X 1 ( x − μ 1 ) ( x − μ 1 ) T \ \Sigma_{1} = \frac{1}{m_{1}} \sum_{x \in X_{1}}^{} (x - \mu_{1} )(x - \mu_{1})^T  S1=m11xX1(xm1)(xm1)T

· Algorithm principle

insert image description here

See picture above

· Loss function derivation

In general calculations, use the following form

insert image description here

  w T μ \ \boldsymbol{w}^\mathrm{T}\boldsymbol{\mu} wT μ: It is the inner product form of the vector, here is to find the center of the positive and negative samples at   w \ \boldsymbol{w} The projection on w , here multiplied by   w \ \boldsymbol{w} The modulus length of w , and multiplying both projections by a modulus length does not affect the maximum distance (equivalent to the same multiple of magnification or reduction)
max is to maximize the projection distance of the vector

insert image description here

The variance is not strict here because the normal variance is divided by the number of samples (see the formula above), and the number of samples here is the same, so it is equivalent to multiplying the same coefficient, which has no effect on the calculation, so it is omitted

 ( w T x − w T μ 0 ) ( x T w − μ 0 T w ) \ (\boldsymbol{w}^\mathrm{T}\boldsymbol{x} - \boldsymbol{w}^\mathrm{T}\boldsymbol{\mu_{0}}) (\boldsymbol{x}^\mathrm{T}\boldsymbol{w} - \boldsymbol{\mu_{0}}^\mathrm{T}\boldsymbol{w})  wTxwT μ0)(xTwm0T w)is equivalent to  ( w T x − w T μ 0 ) 2 \ (\boldsymbol{w}^\mathrm{T}\boldsymbol{x} - \boldsymbol{w}^\mathrm{T}\boldsymbol{ \mu_{0}})^2 wTxwT μ0)2 , which is equivalent to   ( x − x ˉ ) 2 \ (x - \bar{x})^2 (xxˉ)2

So the following formula is derived

insert image description here
insert image description here

※ Knowledge to be discussed later: generalized Rayleigh quotient

Here
insert image description here
is to fix   w \ \boldsymbol{w} The modulus length of w , because   w \ \boldsymbol{w} The modulus length of w does not affect the calculation result (both upper and lower are approximated), so specify a value for it (the denominator is 1), which is convenient for subsequent calculations. In the calculation, both the numerator and the denominator can be fixed, and the fixed denominator is selected here. .

Here use   min \ min min is because optimization problems generally write the loss function in a minimized form. If the original formula is maximized, you need to add a negative sign in front of it


· Use the Lagrange multiplier method to solve   w \ \boldsymbol{w} w

insert image description here

Matrix differentiation needs to consult the manual (introduced earlier)

insert image description here

This is a generalized eigenvalue problem (as opposed to a general eigenvalue problem).

insert image description here

discuss
insert image description here


○ Multi-category learning [digging pits to be filled]

○ The problem of category imbalance [digging holes to be filled]


❤ 2022.7.21 ❤

Task03: Watermelon Book + Pumpkin Book Chapter 4

insert image description here

Chapter 4 Decision Trees

some concepts

Decision tree (decision tree)
divide and conquer " (divide-and-conquer)
information shop" (information entropy)
"information gain" (information gain)
gain rate " (gain ratio)
Gini index" (Gini index)
pruning (pruning)
pre-cut Branch" (prepruning)
post-pruning" (post- pruning)
decision stump" (decision stump)
division method (bi partition)
multivariate decision tree" (multivariate decision tree)
univariate decision tree" (univariate decision tree)
incremental learning " (incremental learning)

some knowledge

· Self-information and information entropy

insert image description here

※ "Purity" of the sample

insert image description here

· Conditional entropy

insert image description here

· Information gain

insert image description here

· Gini value and Gini index

insert image description here
insert image description here

△ Algorithm principle

insert image description here

· ID3 decision tree

insert image description here

· C4.5 decision tree

insert image description here
insert image description here

· CART decision tree

insert image description here
insert image description here

The CART decision tree must be a binary tree

❤ 2022.7.25 ❤

Task04: Watermelon Book + Pumpkin Book Chapter 5

insert image description here

Chapter 5 Neural Networks

some concepts

Neural networks (neural networks)
neuron (neuron)
threshold " (threshold)
activation function" (activation function)
squeeze function " (squashi function)
perceptron (Perceptron)
threshold logic unit " (threshold logic unit)
dummy node" (dummy node)
learning rate (learning rate)
functional neuron (functional neuron)
linearly separable (linearly separable)
convergence (converge)
shock (fluctuation)
hidden layer (hidden layer)
"multi-layer feedforward neural network (multi-layer feedforward neural networks)
connection weight" (connection weight)
error backpropagation (error BackPropagation, referred to as BP)
gradient descent (gradient descent)
accumulated error backpropagation (accumulated error backpropagation) algorithm "
trial-by-error" (trial-by-error)
"early stop " (early stopping)
regularization" (regularization)
"local minimum" (local minimum)
"Global minimum"
simulated annealing" (
genet algorithms)
RBF (Radial is Function, Radial Basis Function) network
Competitive learning
"winner-take -all) Principle ART (Adaptive Reson.ance Theory , Adaptive
Resonance Theory) Network
Plasticity-Stability Dilemma ) best matching unit (best matching unit) cascade correlation (Cascade-Correlation) network correlation (correlation) recurrent neural networks "(recurrent neural networks) energy " (energy) energy-based model" (energy-based model) Restricted Boltzmann Machine (Restricted Boltzmann Machine, referred to as RBM) contrastive divergence " (Contrastive Dìv ge ce, referred to as CD) deep learning" (deep learning)












Unsupervised layer-wise training (unsupervised layer-wise training)
pre-training "(pre-training)
fine-tuning" (fine-tuning) training
"weight sharing" (weight sharing)
Convolutional Neural Network (CNN for short)
feature map ( feature map)
confluence "(pooling) layer
feature learning (feature learning)
representation learning (representation learning)
feature engineering" (feature engineering)

some knowledge

· MP neurons

insert image description here

△ Perceptron

· Perceptron model

insert image description here

Perceptron is a classification model

More information can be found in "Statistical Learning Methods" - Mr. Yang

insert image description here

Explanation on the geometric angle of the perceptron
A schematic diagram of a hyperplane is shown in the figure.
insert image description here
The equation of the hyperplane is:
  x 1 + x 2 − 1 = 0 \ x_{1} + x_{2} - 1 = 0 x1+x21=0
and its corresponding normal vector   w = ( 1 , 1 ) T \ \boldsymbol{w} = (1,1)^\mathrm{T} w=(1,1)T,   b = − 1 \ b = -1  b=The −1 normal
vector is perpendicular to the hyperplane, and in this diagram, the normal vector points to the right of the hyperplane, so the right side is positive space and the left side is negative space.
Points in positive space are brought into the formula of the hyperplane greater than zero, and points in negative space are brought into less than zero.

· Perceptron learning strategy

insert image description here
insert image description here

Here put   θ \ \theta The method of treating θ as a "dummy node "was mentioned in the previous multiple linear regression.

Guess you like

Origin blog.csdn.net/ooorczgc/article/details/125722132