Now go back and look at the course content was found to remove a large number of formulas are derived, basically what things
Outline
Category: supervised, unsupervised, semi-supervised and reinforcement learning
Supervised learning: data input and mark. Regression, classification, sequence labeling problem.
Generative model: the probability forecast
Discriminant model: direct learning decision-making function
Maximum likelihood estimation (MLE): the probability of various samples of the training set directly take up maximize
The maximum a posteriori (MAP): the MLE excellent foundation a priori probability
Unsupervised Learning Representative: clustering
Decision Tree
Input variables to the true value has a truth table, it becomes a form of a tree, path root to leaf represents a row of the truth table
Optimization goal: reduce the size of the tree, increase the degree of generalization
Optimal category: select the optimal properties based on entropy (information gain).
Pruning: pre-pruning (better not to draw up plans), after pruning (better to replace the leaf nodes)
Continuous processing value: binary
Handling missing values: when the category, for the promotion of formula
Linear Regression
A given data set, find a model can predict the results
Linear Regression: \ (F (x_i) = W ^ Tx_I + B \) , for the function for the minimum mean square error
Regularization: optimizing the structure, i.e., the absolute value of the weighting coefficients \ (\ the lambda \)
Probability
Chebyshev's inequality: Suppose the random variable X with the desired \ (E (X) = \ MU \) , the variance \ (Var (X) = \ Sigma ^ 2 \) , then for any integer \ (\ Epsilon \) , there \ (P (| X- \ mu | \ ge \ epsilon) \ le \ frac {\ sigma ^ 2} {\ epsilon ^ 2} \)
Law of large numbers: n independent and identically distributed random variables, their mean converges in probability \ (\ mu \)
Central Limit Theorem: a large number of independent and identically distributed variables and converges in distribution of the normal distribution.
MLE and MAP:
MLE think the constant parameter is unknown, need to use the data to estimate
MAP parameters are considered random variables, probability distribution has its own
MLE readily small data overfitting; the MAP different results for different a priori.
Bayesian decision theory
Bayesian decision theory: how to optimize the category tags based on the probability of loss and miscarriage of justice, even the smallest risk function.
Decision face: binary classification, the surface is classified into two categories probability sample of the same value constituted.
Bayes error: the probability of classification error, P (mistake) = P (X in L1, Y = 0) + P (X in L0, Y = 1)
Classification methods Bayesian classifier:
- Conditional probability density is determined, prior probability inference, Bayesian posterior probability Theorem method (model formula)
- Directly address the posterior probability problems, using decision theory classification (discriminant model)
- Find a function that directly maps the input to the label. Nothing to do with probability.
KNN (K adjacent) classifiers
According to the most recent k samples originally voted to label.
K value selection, distance measure, decision rules
Naive Bayes
Generative model
Each independent variables that condition, may be divided between the variables, then the Bayesian formula
\ [Y_ {new} = \ arg \ max_ {y_k} P (Y = Y_k) \ prod_ {i = 1} ^ nP ( X_ {new} | Y = Y_k ) \]
Logistic regression
Discriminant model. Direct Learning \ (P (the Y | X-) \)
\ [P (. 1 the Y = | X-) = \ {FRAC. 1} {1+ \ exp (W 0 + W ^ the TX)} \]
can be extended to multiple classifiers. So the purpose is to learn w
Calculating cross-entropy \ (l (w) = \ sum_lY ^ l \ ln P (Y ^ l = 1 | X ^ l, W) + (1-Y ^ l) ln P (Y ^ l = 0 | X ^ l , W) \)
Seeking great value.
Support vector machine (SVM)
Find a straight line, the sample was divided in half, and the maximum interval
I.e., for all classes of points 1, satisfies \ (the Tx + W ^ B \ GE C \) , based point satisfies -1 \ (w ^ Tx + b \ le-C \)
Maximize the spacing, i.e. \ (2C / || W || \) . The final conclusion is
\ [\ max_ {w, b
} \ frac 1 {|| w || _2} \\ st \ y_i (w ^ Tx_i + b) \ ge 1 \] convex quadratic optimization problem using Lagrange day multiplier method.
Interval described above is to maximize the hard, soft actually maximize interval, i.e., a slack variable is added to each sample point, the cost of the slack variables. That
\ [\ min_ {w, b } \ frac 1 2 {|| w || _2} ^ 2 + C \ sum \ xi_i \\ st \ y_i (w ^ Tx_i + b) \ ge 1- \ xi_i \]
Clustering
k-means:
Clustering.
Initialization k clusters centers, from each sample to find its nearest cluster classification, and then adjust the coordinates of the center, constantly iteration.
Actually in the optimization \ (\ min _ {\ mu , c} \ sum_i \ sum_ {C (j) = i} || \ mu_i-x_j || ^ 2 \)
EM actually steps: first fixed \ (\ mu \) optimization \ (C \) , and then fixed \ (C \) optimization \ (\ mu \)
GMM (Gaussian mixture model):
k-means of C functions too hard, posterior probability we put into it, the probability of each class, that x belongs to, then as MLE, in short with the last iteration formula
EM steps of: calculating the first posterior probability, then the posterior probability of the iterative parameters
PCA principal component analysis
The main purpose is dimensionality reduction - the original sample space-related dimensions removed, leaving a better representation of the dimensions of the original data.
Specific steps:
- To the center
- Covariance matrix
- Covariance matrix eigenvalue decomposition to find the k largest eigenvalues corresponding eigenvectors, standardized composition eigenvector matrix W
- \(z_i=W^Tx_i\)
Thinking probably find the greatest impact on the direction of k retained in the sample space unit offset, erase the other direction, that is projected in the k-dimensional super-plane.
Deleted features often associated with noise-related, so this is also a sense of noise reduction