Table of contents
- Task01: Chapters 1 and 2 of Watermelon Book + Pumpkin Book
- Task02: Read Chapter 3 of Watermelon Book + Pumpkin Book in detail
-
- Chapter 3 Linear Models
-
- some concepts
- some knowledge
- △ Linear regression
- Task03: Watermelon Book + Pumpkin Book Chapter 4
- Task04: Watermelon Book + Pumpkin Book Chapter 5
The thing is like this, I signed up for DataWhale's team study on machine learning courses, this time the learning content is chapters 1-6 of Watermelon Book and Pumpkin Book. Although it is said to be organized learning, in fact, the way of learning is to read books and watch videos by yourself, but there are learning materials and discussion groups provided by DataWhale, and there is a time limit for punching cards. Take a bite of this watermelon book.
DataWhale Eating Melon Tutorial Video Address:
→ [Eating Melon Tutorial] "Machine Learning Formula Detailed Explanation" (Pumpkin Book) and Watermelon Book Formula Derivation Live Collection
❤ 2022.7.11 ❤
Task01: Chapters 1 and 2 of Watermelon Book + Pumpkin Book
Chapter 1 Introduction
〇What is machine learning
The definition given in the book is as follows:
some concepts
Data set" (data set)
sample" (instance)
sample" (samp1e).
Attribute" (attribute)
feature" (feature)
attribute value" (attribute va1ue).
Attribute space" (attribute space)
sample space" (samp1e space)
Feature vector" (feature vector)
dimension" (dimensionality)
learning" (lerning)
training" (training)
training data" (training data)
training sample" (training samp1e)
training set" (training set)
hypothesis" (hypothesis);
Real "(ground-truth)
learner" (learner)
prediction" (prediction)
label" (label1)
sample" (examp1e)
label space" (label space)
classification" (classification)
regression" (regression)
binary classification"(binary cl sification)
正类" (positive class)
反类" (negative class);
Multi-class classification" (multi - class classification)
testing " (testing ) testing
sample " (testing sample) . "(generalization) distribution" (distribution) independent and identically distributed" (independent and identically distributed, referred to as iid) induction (induction) cross interpretation (deduction) specialization "(specialization) induction learning" (inductive learning) concept (concept) version space " (version space) inductive preference" (inductive bias)
〇 There is no free lunch theorem
Chapter 2 Model Evaluation and Selection
some concepts
Error rate" (error rate)
precision" (accracy)
error" (error)
training error" (training error)
experience error" (empirical error)
generalization error" (generalization error)
overfitting" (overfitting)
underfitting" (underfitting)
model selection " (model selection)
test set (testing set)
hold-out method" (hold-out)
incoming sample (sampling)
stratified sampling " (stratified sampling)
authenticity (fidelity)
cross validation method" (cross validation )
k-fold cross validation " (k-fold cross validation)
self-help method" (bootstrapping)
self-service sampling (bootstrap sampling)
out-of-bag estimation "(out-of-bag estimate).
Tuning parameters" (parameter tuning).
Performance measurement (performance measure)
mean square error"(mean squared error)
precision
(recall)
true positive
False positive (false positive)
true negative (true negative)
false negative (false negative)
balance point (Break-Event Point, referred to as BEP)
classification threshold (threshold)
cut-off point" (cut point)
ROC receiver operating characteristics" (Receiver Operating Characteristic)
True Positive Rate" (True Positive Rate, referred to as TPR)
False Positive Rate" (False Positive Rate, referred to as FPR)
Unequal Cost" (unequa1 cost)
"Cost Matrix" (cost matrix)
Overall Cost" ( total cost)
cost-sensitive " (cost-sensitive)
cost curve" (cost curve)
statistical hypothesis test (hypothesis test)
binomial test " (binomial test)
confidence confidence)
t test" (t-test)
paired t test" (paired t-tests) contingency
" (contingency table)
follow-up test" (post-hoc test)
bias variance decomposition" (bias-variance decomposition)
some knowledge
· NP-hard problems
· Polynomial time
· Parameters for machine learning
· Normalized
→ How to understand normalization?
△ Evaluation method
· Set aside method
· Cross Validation
· Self-help
· Adjust and participate in the final model
※
△ Performance metrics
· Error rate and precision
(Formula omitted)
· Precision, recall and F1
· ROC and AUC
· Cost-sensitive error rate and cost curve
△ Comparison test
Yao remembers learning these things when he was studying mathematical statistics, but he gave them all back to the teacher. . . . (Although I didn't understand it at the time...) I
still don't understand these things now, so I'll save them here and fill in the relevant notes later.
【To be perfected】
〇Hypothesis test
〇Cross-validated t-test
〇McNemar test
〇Friedman test and Nemenyi follow-up test
△ Bias and Variance
· Bias Variance Decomposition
Task02: Read Chapter 3 of Watermelon Book + Pumpkin Book in detail
Chapter 3 Linear Models
some concepts
Linear model (linear model)
nonlinear model (nonlinear model)
explanation 'comprehensibility'
linear regression "(linear regression)
Euclidean distance" (Euclidean distance)
least square method "(least squ method)
parameter estimation" (p meter estimation )
Multiple linear regression" (multivariate linear regression)
regularization (regularization)
log-linear regression" (log-linear regression)
generalized linear model" (generalized linear model)
substitution function" (surrogate function)
logarithmic probability function (logistic function)
Link function " (link fun ti n)
probability" (odds)
"log odds" (log odds, also known as logit)
logarithmic probability regression "(logistic regression, also known as logit regression)
maximum likelihood method" (maximum likelihood method)
log-likelihood"(loglikelihood)
Linear Discriminant analysis (Linear Discriminant nalys, referred to as LDA)
Intra-class scatter matrix" (withi -cl scatter matrix)
inter-class scatter matrix" (between class scatter matrix)
generalized Rayleigh quotient" (generalized Rayleigh quotient)
some knowledge
· Basic form of linear model
· Least square method
→ Least square method (least sqaure method)
· Regularization
· angry mine
arg min means that this formula gets the minimum value of w and b
· Maximum Likelihood Estimation
→ One article to understand the maximum likelihood estimation
· Convex set, convex function
The definition of concave and convex here is the opposite of the definition in mathematics. Here is the definition of optimization theory.
· Gradient
· Hessian Matrix
· Three elements of machine learning
· Information theory and information entropy
· Relative entropy (KL divergence)
· Two-norm
The two-norm of the vector is equivalent to finding the modulus length of the vector
→ the 2-norm of the vector and the 2-norm of the matrix
· Lagrange multiplier method
· Generalized eigenvalues
· Generalized Rayleigh Quotient
厄米(Hermitian)matrix: AH = A \ \mathrm{A}^\mathrm{H} = \mathrm{A} AH=The matrix of A [? 】
△ Linear regression
○ Univariate linear regression
· Least squares estimation
· Maximum Likelihood Estimation
※ Textbook recommendation: Chen Xiru "Probability Theory and Mathematical Statistics"
Deriving the probability distribution of y from the error of a linear model
The formula analyzed by the method of maximum likelihood estimation is the same as that of the least squares method
· Solve for w and b
Explain the formula (3.5)-(3.7) on page 54 of the watermelon book, why the partial derivative is 0 is the optimal solution of w and b
At this point, it is proved that the loss function is a convex function, and then the optimal solution can be obtained by applying the properties of the convex function.
That is, when the partial derivative is equal to 0, the results of calculating w and b are as follows. See the pumpkin book for the process.
○ Multiple Linear Regression
The difference between multiple linear regression and unary linear regression is that multiple x is a vector
Write b as wd*1, and then absorb it into the matrix form
· Vectorization of least squares method for multiple linear regression
Vectorization is to avoid excessive use of for loops during summation, and the matrix acceleration function of computing libraries such as numpy can be used to write the summation in the form of vector multiplication
Regarding the variation of this equation, the details are as follows
· Solve the sharp sign of w
Details as follow
The above is a scalar, and the bottom is a vector. The derivation of a scalar with respect to a vector is the content of matrix differentiation. The derivation method is as follows
The purpose is to find the Hessian matrix, so continue to find the first-order partial derivative
Find the Hessian matrix equal to 0, that is, find the optimal vector
○ Log odds regression
Logarithmic probability regression is logistic regression, its essence is a classification algorithm,
· Maximum Likelihood Estimation for Log-Probability Regression
First, use the previously derived linear regression model combined with the mapping function sigmoid function to establish a probability mass function
※ Because the y value is 0,1, the function is a probability mass function, not a probability density function
Write the probability mass function in a unified form
Write the likelihood function, using the logarithmic function to convert the multiplication into a summation
Bring the previously derived formula into
· Use information theory methods (minimize relative entropy) to achieve optimization
First write the ideal distribution and simulated distribution, and then bring in the calculation formula of cross entropy
· Three elements of machine learning for logarithmic probability regression algorithm
○ Binary Classification Linear Discriminant Analysis
Contents in the book:
xi \ x_{i} xi: n-dimensional feature vector
yi \ y_{i} yi: Sample mark, value 0 or 1
X i \ X_{i} Xi: middle target i\i i refers to the value of y, that is, 0 or 1, and he represents the set of samples with the same value of y
μ i \ \mu_{i} mi: i \ i i takes 0 or 1, which means the mean vector Σ i \ \Sigma_{i}of the samples (positive or negative samples) with the same value
Si: i \ i i takes 0 or 1, which means the covariance matrix of samples with the same value (positive sample or negative sample)
Generally speaking, the form of covariance matrix is like this, and the formula in the book has a certain simplification:
Σ 0 = 1 m 0 ∑ x ∈ X 0 ( x − μ 0 ) ( x − μ 0 ) T \ \Sigma_{0} = \frac{1}{m_{0}} \sum_{x \in X_{0}}^{} (x - \mu_{0} )(x - \mu_{0})^T S0=m01x∈X0∑(x−m0)(x−m0)T
Σ 1 = 1 m 1 ∑ x ∈ X 1 ( x − μ 1 ) ( x − μ 1 ) T \ \Sigma_{1} = \frac{1}{m_{1}} \sum_{x \in X_{1}}^{} (x - \mu_{1} )(x - \mu_{1})^T S1=m11x∈X1∑(x−m1)(x−m1)T
· Algorithm principle
See picture above
· Loss function derivation
In general calculations, use the following form
w T μ \ \boldsymbol{w}^\mathrm{T}\boldsymbol{\mu} wT μ: It is the inner product form of the vector, here is to find the center of the positive and negative samples at w \ \boldsymbol{w} The projection on w , here multiplied by w \ \boldsymbol{w} The modulus length of w , and multiplying both projections by a modulus length does not affect the maximum distance (equivalent to the same multiple of magnification or reduction)
max is to maximize the projection distance of the vector
The variance is not strict here because the normal variance is divided by the number of samples (see the formula above), and the number of samples here is the same, so it is equivalent to multiplying the same coefficient, which has no effect on the calculation, so it is omitted
( w T x − w T μ 0 ) ( x T w − μ 0 T w ) \ (\boldsymbol{w}^\mathrm{T}\boldsymbol{x} - \boldsymbol{w}^\mathrm{T}\boldsymbol{\mu_{0}}) (\boldsymbol{x}^\mathrm{T}\boldsymbol{w} - \boldsymbol{\mu_{0}}^\mathrm{T}\boldsymbol{w}) (wTx−wT μ0)(xTw−m0T w)is equivalent to ( w T x − w T μ 0 ) 2 \ (\boldsymbol{w}^\mathrm{T}\boldsymbol{x} - \boldsymbol{w}^\mathrm{T}\boldsymbol{ \mu_{0}})^2 (wTx−wT μ0)2 , which is equivalent to ( x − x ˉ ) 2 \ (x - \bar{x})^2 (x−xˉ)2
So the following formula is derived
※ Knowledge to be discussed later: generalized Rayleigh quotient
Here
is to fix w \ \boldsymbol{w} The modulus length of w , because w \ \boldsymbol{w} The modulus length of w does not affect the calculation result (both upper and lower are approximated), so specify a value for it (the denominator is 1), which is convenient for subsequent calculations. In the calculation, both the numerator and the denominator can be fixed, and the fixed denominator is selected here. .
Here use min \ min min is because optimization problems generally write the loss function in a minimized form. If the original formula is maximized, you need to add a negative sign in front of it
· Use the Lagrange multiplier method to solve w \ \boldsymbol{w} w
Matrix differentiation needs to consult the manual (introduced earlier)
This is a generalized eigenvalue problem (as opposed to a general eigenvalue problem).
discuss
○ Multi-category learning [digging pits to be filled]
○ The problem of category imbalance [digging holes to be filled]
❤ 2022.7.21 ❤
Task03: Watermelon Book + Pumpkin Book Chapter 4
Chapter 4 Decision Trees
some concepts
Decision tree (decision tree)
divide and conquer " (divide-and-conquer)
information shop" (information entropy)
"information gain" (information gain)
gain rate " (gain ratio)
Gini index" (Gini index)
pruning (pruning)
pre-cut Branch" (prepruning)
post-pruning" (post- pruning)
decision stump" (decision stump)
division method (bi partition)
multivariate decision tree" (multivariate decision tree)
univariate decision tree" (univariate decision tree)
incremental learning " (incremental learning)
some knowledge
· Self-information and information entropy
※ "Purity" of the sample
· Conditional entropy
· Information gain
· Gini value and Gini index
△ Algorithm principle
· ID3 decision tree
· C4.5 decision tree
· CART decision tree
The CART decision tree must be a binary tree
❤ 2022.7.25 ❤
Task04: Watermelon Book + Pumpkin Book Chapter 5
Chapter 5 Neural Networks
some concepts
Neural networks (neural networks)
neuron (neuron)
threshold " (threshold)
activation function" (activation function)
squeeze function " (squashi function)
perceptron (Perceptron)
threshold logic unit " (threshold logic unit)
dummy node" (dummy node)
learning rate (learning rate)
functional neuron (functional neuron)
linearly separable (linearly separable)
convergence (converge)
shock (fluctuation)
hidden layer (hidden layer)
"multi-layer feedforward neural network (multi-layer feedforward neural networks)
connection weight" (connection weight)
error backpropagation (error BackPropagation, referred to as BP)
gradient descent (gradient descent)
accumulated error backpropagation (accumulated error backpropagation) algorithm "
trial-by-error" (trial-by-error)
"early stop " (early stopping)
regularization" (regularization)
"local minimum" (local minimum)
"Global minimum"
simulated annealing" (
genet algorithms)
RBF (Radial is Function, Radial Basis Function) network
Competitive learning
"winner-take -all) Principle ART (Adaptive Reson.ance Theory , Adaptive
Resonance Theory) Network
Plasticity-Stability Dilemma ) best matching unit (best matching unit) cascade correlation (Cascade-Correlation) network correlation (correlation) recurrent neural networks "(recurrent neural networks) energy " (energy) energy-based model" (energy-based model) Restricted Boltzmann Machine (Restricted Boltzmann Machine, referred to as RBM) contrastive divergence " (Contrastive Dìv ge ce, referred to as CD) deep learning" (deep learning)
Unsupervised layer-wise training (unsupervised layer-wise training)
pre-training "(pre-training)
fine-tuning" (fine-tuning) training
"weight sharing" (weight sharing)
Convolutional Neural Network (CNN for short)
feature map ( feature map)
confluence "(pooling) layer
feature learning (feature learning)
representation learning (representation learning)
feature engineering" (feature engineering)
some knowledge
· MP neurons
△ Perceptron
· Perceptron model
Perceptron is a classification model
More information can be found in "Statistical Learning Methods" - Mr. Yang
Explanation on the geometric angle of the perceptron
A schematic diagram of a hyperplane is shown in the figure.
The equation of the hyperplane is:
x 1 + x 2 − 1 = 0 \ x_{1} + x_{2} - 1 = 0 x1+x2−1=0
and its corresponding normal vector w = ( 1 , 1 ) T \ \boldsymbol{w} = (1,1)^\mathrm{T} w=(1,1)T, b = − 1 \ b = -1 b=The −1 normal
vector is perpendicular to the hyperplane, and in this diagram, the normal vector points to the right of the hyperplane, so the right side is positive space and the left side is negative space.
Points in positive space are brought into the formula of the hyperplane greater than zero, and points in negative space are brought into less than zero.
· Perceptron learning strategy
Here put θ \ \theta The method of treating θ as a "dummy node "was mentioned in the previous multiple linear regression.