The key points of learning machine learning

1. Understand the 3 elements of machine learning

  • Machine learning = model + strategy (check the model is good or bad) + algorithm
    Model: law y = ax + b
    strategy: what kind of model is a good model? Loss function
    algorithm: how to efficiently find the optimal parameters, the parameters a and b in the model
1.1 Model
  • In machine learning, we must first consider what kind of model to learn. In supervised learning, the model y=kx+b is what we need to learn.

  • Model classification

  1. Decision function: only give values ​​belonging to one category-decision tree
  2. Conditional probability distribution function: Not only give the value of the category, but also give the probability—LR algorithm
    Insert picture description here
1.2 Strategy

To evaluate the quality of the model, use the loss function to measure the difference between the value given by the model and the actual value.
The loss function measures how well the model predicts at a time. The commonly used loss functions are:
Insert picture description here
Insert picture description here
the smaller the value of the loss function, the better the model.

1.3 Algorithm

The algorithm of machine learning is an algorithm for solving optimization problems

2. How to build a machine learning system

Insert picture description here

  • 1- prepare data
  • 2- feature engineering
    • 1- Process the data
      • Sample data sampling—row
    • 2- Process the features
      • Feature selection
      • Feature dimensionality reduction
      • Feature sampling
  • 3- training model
    • Data + machine learning algorithm
    • Parameters: What machine learning learns is a model, and what it learns is essentially the corresponding parameters in the model
    • Hyperparameters: The parameters specified in advance before the machine learning model training are called hyperparameters, such as the number of iterations alpha
    • y=kx+b specifies the number of iterations alpha=20 k=5, b=3 y=5x+3 y=4x+6
  • 4- choose the best model
    • The model is not only good for the training set, but also good for the test ==> good generalization performance of the model
  • 5- make predictions on new data
    • The model has been trained and can meet the requirements

3. Model selection

  • Basic concepts
    1) Generalization: The model has good generalization ability means that the model not only performs well on the training data set, but also has a good effect on the adaptability to new data.
    When we discuss the learning ability and generalization ability of a machine learning model, we usually use the concepts of overfitting and underfitting. Overfitting and underfitting are also the two main reasons for the poor performance of machine learning algorithms.
    2) Overfitting : The model performs well on training data, but performs poorly on unknown data or test set.
    3) Underfitting : poor performance on training data and unknown data.

Example: Which curve fits best The
Insert picture description here
first image and the second image are under-fitted images

  • Underfitting
    • The model performs poorly on the training data set and the test data set
  • Time of occurrence of underfitting:
    • Early stage of model training
  • The reason for underfitting?
    • The model is too simple
      • y = t or y = -2x + 3
  • How to solve the problem of underfitting?
    • Add polynomial features
      • Predict housing prices-area, increase location + number of houses, etc.
      • y = k1x1 + b ===》 y = k1xk2x2 + b1
    • Increase the degree of the polynomial term
      • y=k1x1+b==》y=k1*x1**2+k2*xb
    • Reduce regular penalties

The fourth image is an overfitted image

  • Overfitting
    • The model works well on the training set, but it works poorly on new data or test data
  • The period of overfitting:
    • Mid and late model training
  • Reasons for overfitting:
    • The model is too complex + too little training data + impure data
  • Over-fitting solution:
    • 1- Increase the regular penalty ===> the model is too complicated
    • 2- Resample training data
    • 3- Re-clean the data
    • 4-dropout method-randomly discard some sample data

Choosing a good model requires good generalization performance to avoid underfitting and overfitting.

Follow the Occam's razor principle: On the basis of having the same or similar generalization ability, prefer a simpler model. The essence is: prevent the model from overfitting

3.1 Experience risk and structural risk

The average loss of the model f(x) on the training data set is called emprical risk or empirical loss, which is recorded as
Insert picture description here
the two basic strategies of R(emp) supervised learning: empirical risk minimization and structural risk Minimize .

Insert picture description here
Insert picture description here

3.2 Model evaluation and model selection

When the loss function is given, the training error of the model based on the loss function and the test error of the model will naturally become the standard for the evaluation of the learning method .
Insert picture description here
Insert picture description here

3.3 Regularization

The typical method of model selection is regularization. The general form of regularization is as follows:
Insert picture description here
a model with less empirical risk may be more complicated, and the value of the regularization term will be larger at this time. The role of regularization is to select both empirical risk and model complexity. Small model.
The regularization term conforms to the Occam's razor principle. Among all possible models, the best model is the one that can analyze the known data well and is very simple. From the perspective of Bayesian estimation, regularization The term corresponds to the prior probability of the model. It can be assumed that a complex model has a smaller prior probability, and a simple model has a larger prior probability.

4. Expand common machine learning libraries

With the help of many powerful open source libraries developed in recent years, now is the best time to enter the field of machine learning. Using a mature machine learning library to help me complete the algorithm, we only need to understand how to adjust the parameters of each model to be able to apply the model to actual business scenarios.

  • Sklearn library based on Python
  1. Simple and efficient data mining and data analysis tools
  2. Can be used by everyone, can be reused in various environments
  3. Built on NumPy, SciPy and matplotlib
  4. Open source, commercially available-get BSD license
    Insert picture description here
  • Scala-based SparkMLLIB
    Insert picture description here
    is now learning machine learning. We only need to use the existing algorithm library, and put more experience on the cognition, processing and integration of data.

Guess you like

Origin blog.csdn.net/m0_49834705/article/details/112852420