[Autumn Recruitment] Machine Learning in the Eight-Part Essay of the Algorithm Post

Recommended links: Summary of common interview questions
in Axiu’s study notes
JavaGuide
Machine learning interview written test Essential eight-part essay for job search
Naive bayes model (naive bayes)
Random Forest – Random Forest | RF

machine learning

feature engineering

  1. The significance of feature normalization : Feature normalization is an important technology in data preprocessing. Because the units (scales) between features may be different , in order to facilitate the calculation of feature distance in subsequent downstream tasks, in order to eliminate the impact of unit and scale differences between features, and to treat each dimension of features equally, the features need to be normalized. [Convert absolute values ​​to relative values, so that you can reflect the importance of which dimensional features]

  2. How to calculate the distance between features/vectors

    • Euclidean distance: measures the straight-line distance between points in space. The distance calculation formula between n-dimensional vectors is as follows:
      ∑ i = 1 n ( xi − yi ) 2 \sqrt{\sum_{i=1}^{n}(x_i-y_i)^2 }i=1n(xiyi)2

    • Manhattan distance: two points (x 1, y 1) (x_1,y_1)(x1,y1) ( x 2 , y 2 ) (x_2,y_2) (x2,y2) is calculated as follows:
      ∣ x 1 − x 2 ∣ + ∣ y 1 − y 2 ∣ \left | x_1-x_2 \right | + \left | y_1-y_2 \right |x1x2+y1y2

    • Chebyshev distance: two points (x 1, y 1) (x_1,y_1)(x1,y1) ( x 2 , y 2 ) (x_2,y_2) (x2,y2) is defined as the maximum absolute value of the numerical difference between its coordinates.
      max ( ∣ x 1 − x 2 ∣ , ∣ y 1 − y 2 ∣ ) max(\left | x_1-x_2 \right | ,\left | y_1-y_2 \right | )max(x1x2,y1y2)

    • Cosine similarity: Calculate the cosine value of the angle between two vectors. A cosine value close to 1 indicates that the angle is close to 0, indicating that the two vectors are similar. The larger the cosine value is, the more similar the vectors are, and the value range is [-1, 1]. The cosine value between multi-dimensional vectors is calculated as follows
      cos Θ = ∑ i = 1 n ( xi × yi ) ∑ i = 1 nxi 2 + ∑ i = 1 nyi 2 cos\Theta=\frac{\sum_{i=1}^ {n}(x_i\times y_i) }{\sqrt{\sum_{i=1}^{n} x_i^2} +\sqrt{\sum_{i=1}^{n} y_i^2}}cosΘ=i=1nxi2 +i=1nyi2 i=1n(xi×yi)
      Take two points (x 1, y 1) (x_1,y_1)(x1,y1) ( x 2 , y 2 ) (x_2,y_2) (x2,y2)为例
      c o s Θ ( ( x 1 , y 1 ) , ( x 2 , y 2 ) ) = x 1 x 2 + y 1 y 2 x 1 2 + y 1 2 × x 2 2 + y 2 2 cos\Theta((x_1,y_1), (x_2,y_2))=\frac{x_1x_2+y_1y_2}{\sqrt{x_1^2+y_1^2}\times\sqrt{x_2^2+y_2^2} } cosΘ((x1,y1),(x2,y2))=x12+y12 ×x22+y22 x1x2+y1y2

    • Cosine distance=1 - Cosine similarity

  3. The role of One-Hot encoding The reason
    why One-Hot encoding is used is because in many machine learning tasks, features are not always continuous values, but may also be discrete values ​​(such as the data in the table above). Representing these data with numbers will make the execution much more efficient.

Common computing models

Overview

In machine learning, common models include:

  • Linear Regression and Logistic Regression : Mainly used for prediction problems , such as predicting numerical data. Its main purpose is to predict a numerical output result based on the characteristics of the input data.
  • Decision Tree : A supervised learning algorithm used for classification and regression analysis . In classification problems, decision trees will classify data into different categories based on the characteristics of the data; in regression analysis, decision trees are used to predict continuous values.
  • Random Forest model (Random Forest) : The classic bagging method is a machine learning algorithm based on decision tree integration , usually used for classification and regression problems. It improves prediction accuracy and reduces the risk of overfitting by randomly selecting subsets and features in the dataset, building multiple decision trees, and then merging them. Its advantages include ease of implementation and interpretation, robustness to missing data and outliers, and high accuracy.
  • Support Vector Machine model : It is a supervised learning algorithm mainly used for classification problems. Its goal is to find an optimal hyperplane (which can be linear or nonlinear) that maximizes the distance between different categories to classify the data.
  • Bayesian classifier model (Naive Bayes) : Utilizes the dependencies between factors to predict categories . Its classifier model does not require a large amount of data and computing resources and can efficiently process a large number of high-dimensional data sets. In addition, the Bayesian classifier model is based on a probabilistic model , so it is easy to understand and interpret, which facilitates optimization and adjustment of the model. At the same time, the Bayesian classifier can also handle missing data and has strong robustness and reliability.
  • K-Nearest Neighbor model : It is a classification algorithm that finds K training samples that are closest to the characteristics of the sample to be classified , and determines the category of the sample to be classified based on the majority of the categories to which these K samples belong. The main advantage of the KNN algorithm is that it is simple to understand and easy to implement, but it is time-consuming when processing large-scale data sets.
  • Neural Network model (Neural Network): A computing model based on neurons . By introducing the sigmoid activation function, it has non-linear expression capabilities and can solve many complex machine learning problems such as image and speech recognition. These include: Convolutional Neural Network model (Convolutional Neural Network), Recurrent Neural Network model (Recurrent Neural Network), Generative Adversarial Network (GAN), etc. Different types of neural network models have their own application scenarios and focuses. Choosing the appropriate model can achieve better results on specific problems.

Linear regression model and logistic regression model

linear regression model
  1. Model Assumptions: Linear models assume a linear relationship between the dependent and independent variables.
  2. Model definition: Linear regression can characterize the input data by assigning different weights to the features of each dimension, so that all features work together to make the final decision. Note that this representation method is the output result of the fitting model and is suitable for situations where its predicted value is a continuous variable and its predicted value is in the entire real number domain, and cannot be directly used for classification.
  3. Definition:
    h θ ( x ) = θ TX = θ 0 + θ 1 x 1 + ⋯ + θ nxn h_\theta (x)=\theta ^TX=\theta _0+\theta _1x_1+\cdots +\theta _nx_nhi(x)=iTX=i0+i1x1++inxn
    Among them, solve for the parameter θ \thetaThe cost function of θ is Mean Square Error (MSE)
  4. 代价函数:
    J θ = 1 2 m ∑ i = 1 m ( h θ ( x i ) − y i ) 2 J_\theta =\frac{1}{2m}\sum_{i=1}^{m}(h_\theta (x^i)-y^i)^2 Ji=2m _1i=1m(hi(xi)yi)2
  5. Features: Since MSE is sensitive to the range of eigenvalues, the linear regression model is very sensitive to outliers. In general, feature engineering is used to normalize features. When solving the actual parameters, error estimation is involved, so the least squares method is used in the solution .
logistic regression model
  1. Model hypothesis: The impact of changes in independent variables on dependent variables is reflected by a logistic function (sigmoid function).

  2. Definition: Logistic regression is theoretically supported by linear regression, but logistic regression introduces nonlinear factors through the Sigmoid function ( also known as the logarithmic probability function (Logistic Function) ), so based on linear regression, it mainly solves classification problems .

  3. General expression:
    h θ ( x ) = g ( θ T x ) , g ( z ) = 1 1 + e − z h_\theta (x)=g(\theta ^Tx),g(z)=\frac {1}{1+e^{-z}}hi(x)=g ( iTx),g(z)=1+ez1
    Among them, g ( z ) g(z)g ( z ) represents the activation function. [The activation function is used to add nonlinear factors to improve the expression ability of the neural network to the model and solve problems that cannot be solved by the linear model. 】Here to solve the parameterθ \thetaThe cost function of θ is the cross-entropy function.
    Definition of cross-entropy function:
    J θ = 1 m ∑ i = 1 m ( − yilog ( h θ ( xi ) ) − ( 1 − yi ) log ( 1 − h θ ( xi ) ) ) J_\theta =\frac{ 1}{m}\sum_{i=1}^{m}(-y^ilog(h_\theta (x^i))-(1-y^i)log(1-h_\theta (x^i) )))Ji=m1i=1m(yi log(hi(xi))(1yi)log(1hi(xi )))
    The optimal parameters solved using **Maximum Likelihood Estimation (MLE)**:
    KaTeX parse error: Undefined control sequence: \sideset at position 10: \hat{w}=\̲s̲i̲d̲e̲s̲e̲t̲{ }{}{argmax}_w\…
    Observing the above two equations, we can see thatMLE ( max ) MLE(max)M L E ( ma x ) is equivalent toJ θ ( min ) J_\theta(min)Ji(min)

  4. Features: The logistic regression model can be regarded as a linear model with Sigmoid added. As for why the logarithmic probability function in the Sigmoid function should be used, this involves the exponential family form of the Bernoulli distribution, maximum entropy theory, etc. Parameter estimation here is to determine the model parameters that best fit the data through optimization methods. In a binary classification problem, the negative log-likelihood function is exactly what we call the cross-entropy loss function. However, the construction of the cross-entropy loss function is not only through the likelihood function.

the difference
  • Linear regression and logistic regression are both special cases of generalized linear regression models. The two of them are brothers, and both are biological sons of Generalized Linear Regression.
  • Linear regression can only be used for regression problems, and logistic regression is used for classification problems (two-classification, multi-classification)
  • Linear regression has no link function or does not work. The link function of logistic regression is a logarithmic probability function, which is a Sigmoid function.
  • Linear regression uses the least squares method as the parameter estimation method, and logistic regression uses the maximum likelihood method as the parameter estimation method.

Naive Bayes classifier model (Naive Bayes)

  1. Model assumption: It is assumed that features are conditionally independent, that is, when a target value is given, the existence of one feature will not affect the existence of other features.
  2. Model definition: It is a classification method based on Bayes' theorem and specific assumptions (features are independent of each other).
  3. General expression:
    P ( y ∣ x ) = p ( x ∣ y ) ⋅ p ( y ) p ( x ) P(y|x)=\frac{p(x|y)\cdot p(y)}{ p(x)}P(yx)=p(x)p(xy)p ( y )
    Where, P ( y ) P(y)P ( y ) is the prior probability [refers to the probability of a category occurring without considering any features. 】,P ( x ∣ y ) P(x|y)P ( x y ) is samplexxx relative to categoryyyThe conditional probability of y [also becomeslikelihood (likelihood), where there are usually many samplesx = x 1 + ⋯ + xnx=x_1+\cdots+x_nx=x1++xnRepresents multiple features】, p ( x ) p(x)p ( x ) is the same asyyy- independent normalization factor.
  4. Due to the high complexity of solving the above formula, Naive Bayes gives a divine assumption: assuming that the features are independent. The so-called "naive" word is reflected in this place, but according to our common sense, we also know that the characteristics of the sample are almost impossible to be independent of each other. Therefore, the effect of naive Bayes is definitely not good, but the result is exactly the opposite. Numerous experiments have proven that Naive Bayes works well for text classification tasks.
  5. Features:
    • Naive Bayes is a typical generative model. The generative model actually builds a multi-model (build as many models as there are classes), and then calculates the posterior probability of the new sample in each class, and then sees which is the largest, the new sample is classified into which category. The discriminant model has only one model, which directly learns P ( y ∣ x ) P( y ∣ x ) from the data.P ( andx ) to predictyyy
    • The Naive Bayes model does not require training, but directly uses new samples to the data set (training set) to calculate the posterior probability and then classify.
    • In Naive Bayes, for continuous values, a Gaussian distribution is assumed.
  6. Example calculation steps (reprinted from naive bayes model )
    Insert image description here

Decision tree model

Insert image description here

  1. Model assumption: Each decision node only considers one feature and divides the sample based on the feature.
  2. Model definition: Decision tree is a prediction model, which represents a mapping relationship between object attributes and object values. The decision tree has a tree structure. Each leaf node corresponds to a classification, and the non-leaf nodes correspond to a division on a certain attribute. The sample is divided into several subsets according to the different values ​​​​of the attribute.
  3. Model building: The latest CART algorithm can generate both regression trees to solve the prediction of continuous variables and classification trees to solve the classification of discrete variables. [The core idea of ​​regression tree generation] Minimize the prediction error, so our purpose is to find a dividing point, use this point as the dividing line to divide the training set D into two parts, D1 and D2, and make the respective data sets D1 and D2 The square difference is minimal. [The core idea of ​​classification tree generation] Generate the importance level of features by calculating the information gain/Gini coefficient. The core idea is to select important features that can provide more information (large information gain) and have strong data consistency (low impurity) Come build a tree.
  4. Features:
    • [Advantages] Simple to understand and interpret, trees can be visualized; requires little data preparation, other techniques usually require normalization.
    • [Disadvantages] Decision tree learners can create overly complex numbers that do not generalize well to the data because overfitting will occur. Random forest overcomes this shortcoming on this basis .

random forest model

Insert image description here

  1. Model assumptions: [The model assumptions are similar to the decision tree, but there are some differences. 】The random forest model assumes that there is a certain degree of randomness in samples and features, that is, each decision tree is trained with only some samples and some features to reduce the risk of over-fitting. It should be noted that the model assumption of random forest is more robust than that of a single decision tree and does not require assumptions about whether the characteristics of the sample are independent of each other. Therefore, the random forest model can handle situations where there is a certain correlation between variables .
  2. Model definition: Random forest is composed of many decision trees, and there is no correlation between different decision trees. It can be used for both classification and regression problems. When we perform a classification task and a new input sample comes in, each decision tree in the forest will be judged and classified separately. Each decision tree will get its own classification result. Which classification is in the classification result of the decision tree? At most, the random forest will regard this result as the final result.
  3. Model building:
    1. A sample with a sample capacity of N is drawn N times with replacement, one sample each time, ultimately forming N samples. The selected N samples are used to train a decision tree as samples at the root node of the decision tree.
    2. When each sample has M attributes, when each node of the decision tree needs to be split, m attributes are randomly selected from these M attributes to satisfy the condition m << M. Then use some strategy (such as information gain) from these m attributes to select 1 attribute as the splitting attribute of the node.
    3. During the formation of the decision tree, each node must be split according to step 2 (it is easy to understand that if the attribute selected by the node next time is the attribute used when its parent node was split just now, the node has reached the leaf level. Node, no need to continue splitting). Until it can no longer be divided. Note that no pruning is performed during the entire decision tree formation process.
    4. Follow steps 1 to 3 to build a large number of decision trees, thus forming a random forest .
  4. Features:
    • [Advantages] It can process very high-dimensional data (with many features) without dimensionality reduction or feature selection; it can judge the importance of features; it can judge the interaction between different features; it is not easy to overfit ; The training speed is relatively fast and it is easy to make a parallel method; it is relatively simple to implement; for unbalanced data sets, it can balance the error. ; Accuracy can still be maintained if a large portion of features are missing.
    • [Disadvantages] Random forests have been proven to overfit in some noisy classification or regression problems; for data with attributes with different values, attributes with more value divisions will have greater impact on the random forest. influence, so the attribute weights produced by random forest on this kind of data are not credible.

Support Vector Machine model (Support Vector Machine)

Insert image description here

  1. Model assumptions:
  • It is assumed that all samples are in high-dimensional space and the samples can be correctly separated by a hyperplane.
  • It is assumed that the optimal hyperplane is a linear classifier that maximizes the nearest sample points on both sides.
  • It is assumed that linearly inseparable samples can be made linearly separable in high-dimensional space by mapping in high-dimensional space.
  1. Model definition: It is a two-classification model. Its basic model is a linear classifier defined with the largest interval on the feature space . The largest interval makes it different from the perceptron (the perceptron finds any data point that is linearly separable. plane). The purpose of the support vector algorithm is to find the points with the largest distance from the hyperplane (these points are called support vectors). SVM also includes kernel techniques , which map the current sample points to high-dimensional space through feature mapping (such as Gaussian kernel) for calculation, which makes it essentially a nonlinear classifier.
  2. Linear support vector machine learning algorithm and nonlinear SVM algorithm (see https://www.zhihu.com/tardis/zm/art/31886934?source_id=1005 for the detailed derivation process).
  3. Features
  • [Advantages] Can solve high-dimensional problems, that is, large feature spaces; solve machine learning problems with small samples; can handle the interaction of nonlinear features; has no local minimum problems (relative to algorithms such as neural networks); does not need to rely on the entire data (Finding support vectors is the key); generalization ability is relatively strong;
  • [Disadvantages] When there are many observation samples, the efficiency is not very high; there is no universal solution to nonlinear problems, and sometimes it is difficult to find a suitable kernel function; the explanatory power of the high-dimensional mapping of the kernel function is not strong, especially the path Directional basis function; conventional SVM only supports binary classification;
    is sensitive to missing data;

K nearest neighbor model

Insert image description here

  1. Model assumption: The distance measure between samples is available, usually measured using methods such as Euclidean distance and Manhattan distance.
  2. Model definition: It is an instance-based learning method. Under the premise of defining the distance and K value in advance, for any new sample, classify it into the category with the most categories among the K samples closest to the sample. . When k=1, it represents the nearest neighbor algorithm.
  3. There are three basic elements of the k-nearest neighbor method: selection of k values, distance measurement and classification decision rules.
  4. **K value selection: **Through cross-validation (split the sample data into training data and verification data according to a certain ratio, such as 6:4 splitting part of the training data and verification data), select a Start with a smaller K value, continue to increase the K value, and then calculate the variance of the verification set, and finally find a more appropriate K value.
  5. Algorithm steps:
    1. Calculate the distance between each sample point in the training sample and the test sample (common distance measures include Euclidean distance, Mahalanobis distance, etc.);
    2. Sort all distance values ​​above;
    3. Select the first k samples with the smallest distance;
    4. Vote based on the labels of these k samples to get the final classification category;
  6. Features:
  • [Advantages] No training required; simple and easy to use. Compared with other algorithms, KNN is a relatively simple and clear algorithm, and its principles can be understood even without a high mathematical foundation. ; Insensitive to outliers.
  • [Disadvantages] There is no obvious training process. It is a typical representative of "lazy learning". All it does during the training phase is to save samples. If the training set is large, a large amount of storage space must be used, and the training time overhead is Zero; KNN must calculate the distance to each training data point for each test point, and these distance points involve all features. When the dimensions of the data are large and the amount of data is large, the calculation of KNN will become curse

neural network model

This model contains many different network models. The following will briefly analyze the concepts according to their application focus.

Convolutional Neural Network (CNN)
  1. Applicable scenarios: image-based tasks. The characteristics of the target object are mainly reflected in the relationship between pixels. Video is a superposition of images, so it is also good at processing video content. For example, target detection, target segmentation, etc.
  2. Features:
    • Thanks to the weighting + pooling of the convolution kernel, images with large amounts of data can be effectively transformed into small amounts of data.
    • Features of convolution: local perception, parameter sharing, multi-core
    • Translation invariance of convolutional neural networks. Simply put, convolution + max pooling is approximately equal to local translation invariance . It retains the characteristics of the image in a visual-like manner. When the image is flipped, rotated or changed position, it can also effectively identify similar images.
  3. Basic principle: A typical CNN consists of 3 parts:
  • Convolutional layer: responsible for extracting local features in the image;
  • Pooling layer: Also known as downsampling, it can significantly reduce parameter magnitude (dimensionality reduction) and prevent overfitting. The reason for this is because even after convolution, the image is still large (because the convolution kernel is relatively small), so in order to reduce the data dimension, downsampling is performed. The pooling layer can reduce the data dimension more effectively than the convolution layer. This can not only greatly reduce the amount of calculations, but also effectively avoid overfitting.
  • Fully connected layer: A part similar to a traditional neural network, used to output the desired results. Only the fully connected layer can "run" the data that has been dimensionally reduced through the convolutional layer and the pooling layer. Otherwise, the amount of data is too large, the calculation cost is high, and the efficiency is low.
  1. Problems:
  • The backpropagation algorithm is not an efficient algorithm in deep learning because it requires a large amount of data.
  • If the detection target moves from the upper left corner to the lower right corner of the picture and the relative position changes, the features after pooling change so drastically that they affect the neuron weights and lead to incorrect recognition.
  • The data set needs to be normalized. The different sizes mixed together make it difficult to train.
  • The existence of the pooling layer will lead to the loss of a lot of very valuable information, and will also ignore the relationship between the whole and the parts.
  • There is no memory function, and the detection of videos is based on the detection of single-frame pictures.
  1. Improvement: Increase the model's awareness of pixel positions in images in CNN. For example CoordConv, Transformer.
Recurrent Neural Network (RNN)

Insert image description here

  1. Applicable scenarios: Need to process "sequence data - a series of interdependent data streams", such as text, audio and other sequence data.
  2. Features: As shown in the figure above, the features generated by "?" include the features of past words, indicating that all previous inputs will have an impact on future output. And as the sequence progresses, the earlier data has less impact on the current one.
  3. Problems:
    • Short-term memory has a greater impact (such as the orange area), but long-term memory has a small impact (such as the black and green areas). This is the short-term memory problem of RNN: RNN has short-term memory problems and cannot handle very long input sequences. ;
    • Training RNN requires a huge cost
  4. Improvement: LSTM (Long Short Term Memory) is suitable for processing and predicting important events with long intervals and delays in time series.

Model evaluation

  1. Overfitting and underfitting
    • Concept analysis
      • Overfitting: The model performs well on the training set but performs poorly on the test set. This results in reduced model generalization performance.
      • Underfitting: The model has not yet learned the general properties of the training samples. The performance on the training set and test set is not good.
    • Causes and solutions:
      • Overfitting: From the perspective of data , the original training data itself lacks diversity, so the amount of data can be increased and a variety of data enhancement methods can be used; from the perspective of model features , because the features simulate various details in the data It is combined, so the model features can be reduced to alleviate; regularization is introduced to prevent certain features from being dominant when there are many features.
      • Underfitting: From the perspective of model characteristics, the features have not learned the characteristics of the data, so the feature dimensions are increased through feature combination; the Boosting method is used to combine the current weak models into a strong model.
  2. Regularization: L1 regularization and L2 regularization
    ∥ x ∥ p = ( ∑ i = 1 n ∣ xi ∣ p ) 1 p \left \| x \right \| _p=(\sum_{i=1}^{n }\left | x_i \right |^p )^\frac{1}{p}xp=(i=1nxip)p1
    • L1 regularization obeys Laplace distribution, when p = 1 p=1p=1 , this formula represents the L1 norm, which represents the sum of the absolute values ​​of all elements in the vector.
      • Function: L1 regularization can generate a sparse weight matrix, that is, a sparse model, which can ensure the sparsity of the model, that is, some parameters are equal to 0, and can be used for feature selection;
      • Practical application: L1 regularization of linear regression is usually called Lasso regression. The difference between it and general linear regression is that an L1 regularization term is added to the loss function. The L1 regularization term has a constant coefficient alpha to adjust the loss function. The weight of the mean square error term and the regularization term. Lasso regression can make the coefficients of some features smaller, and even some coefficients with small absolute values ​​directly change to 0, enhancing the generalization ability of the model .
    • L2 regularization obeys Gaussian distribution, which can ensure the stability of the model, that is, the parameter values ​​will not be too large or too small. .
      • Practical application: L2 regularization of linear regression is usually called Ridge regression. The difference between it and general linear regression is that an L2 regularization term is added to the loss function. Ridge regression reduces the regression coefficient without discarding any feature, making the model relatively stable. However, compared with Lasso regression, this will leave too many features in the model. Model explanation Poor sex.

Guess you like

Origin blog.csdn.net/qq_42312574/article/details/131512927