Python machine learning (1) The steps of algorithm learning, the application and process of machine learning (acquiring data, feature engineering, model, model evaluation)

Getting Started with Machine Learning

Theoretical knowledge is required in machine learning, such as mathematical knowledge is calculus (derivation process, gradient descent method of linear regression), linear algebra (multiple linear regression, high-latitude data, matrix, etc.), probability theory (Bayesian algorithm), statistics (throughout the whole learning process), the algorithm is deduced step by step according to the mathematical foundation.
A programming language is needed to apply the learned knowledge to practice. The syntax of the python language is relatively simple, and the third-party library is relatively rich, which can play a role in various industries. It is very convenient to analyze and process data and carry out model training.
The main line of machine learning algorithms is K nearest neighbor algorithm, linear regression algorithm, gradient descent method strategy, logistic regression algorithm, decision tree algorithm, integrated algorithm in random forest, support vector machine (SVM), Bayesian algorithm, clustering algorithm, Dimensionality reduction algorithms, etc. The correlation between algorithms is relatively low. Don't give up the whole stage of learning just because you don't understand an algorithm.

The Four Steps of Algorithmic Learning

  • The first step: Master the classic algorithm (derivation principle, code implementation)
  • Step 2: Read the algorithm code, know how the algorithm is realized by mathematics, and realize the derivation process through the code
  • Step 3: Write the classic algorithm. Try to imitate it yourself.
  • The fourth step: change the algorithm. Usually, we don’t need to build algorithms from 0 to 1, understand other people’s algorithm principles, and change some algorithms based on other people’s literature.

Direction of development

1.python data analyst
2.AI foundation

branch of artificial intelligence

The application of artificial intelligence has penetrated into our lives, and it is still in the stage of weak artificial intelligence (decision-making based on data, such as AI recommendation system, AI intelligent reply, etc., is not the thinking ability of artificial intelligence; strong artificial intelligence is It means that the machine has the thinking ability of human beings), and machine learning is the basis of AI direction. Artificial intelligence will replace some repetitive labor tasks, such as ticket inspectors at high-speed rail stations and bank tellers.
At the beginning of 1950, artificial intelligence (Artificial Intelligence) had appeared, and it was applied in the entertainment world of the rich, and in the field of playing chess, such as backgammon, flying flag, chess, go, etc. In the 1980s, machine learning (Machine Learning) appeared, and the western Internet developed rapidly. As a branch of manual functions, machine learning realized the automatic identification of spam. By 2010, with the development of the times, the business volume increased, such as the face punching system at work, the intelligent customer service of the shopping website, etc., and Deep Learning (Deep Learning) appeared to solve the problem of image recognition.
The relationship between artificial intelligence, machine learning, and deep learning
Artificial intelligence is a broader term. Machine learning is a way to realize artificial intelligence. Deep learning is a branch of machine learning. The development of social needs leads to deep learning (neural network algorithm) Slowly develop and grow, can solve the actual needs.

Three Essential Elements of Artificial Intelligence

Data : Weak artificial intelligence requires data (experience) to judge, such as judging the arrival time of the bus while waiting for the bus. For machines, a large amount of data is also required for support.
Algorithm : Appropriate algorithm
Computing power : The ability to calculate and process large amounts of data. It can be put on the server to run real data and code.
CPU, office use, business notebook, read and write files, light and thin, easy to carry, and the amount of data to run is small.
GPU, game notebook, calculates and processes a large amount of data, is heavy, and runs a large amount of data.
Servers, cloud servers, and deployed servers have relatively large amounts of data.

machine learning application

Machine learning is to automatically analyze and obtain models from data, and use the models to predict unknown data. Application aspects: image recognition, recommendation system, automatic driving.

Image Identification

Children can distinguish cats from other animals after seeing kittens 2-3 times, but machines need a lot of algorithm training to be able to recognize them. People can learn and recognize quickly, and people's learning ability is much better than algorithms. When the machine is learning, it needs to extract the characteristics of cats, such as pointed ears, round eyes, mouth and nose, etc. Then summarize the features, the machine judges whether it is a cat based on the features, classifies it according to the features, and builds a model. When a new picture is input, the machine uses the established model to judge whether it is a cat.
Image recognition is actually a process of digitizing images, and the feature values ​​need to be continuously debugged, screened, verified, and inspected.
insert image description here

Recommended system

Recommender systems are ubiquitous in life. When you open a software, you will be recommended. It is a very successful business model. It can be recommended based on the data we have browsed, added data, purchased data, and other data. Give users the products they need. Everyone opens the application software to display different content, which can be said to be different for thousands of people.
For example, if you buy a mobile phone on a certain treasure (a certain website, a certain group), it will recommend products related to the mobile phone, such as chargers, mobile phone cases, tempered film, etc. , to boost your consumption. For example, the offline recommendation system of Wal-Mart supermarket can judge whether a girl is pregnant according to her habit of buying tissues, and then recommend baby products.
insert image description here

Autopilot

Whether domestic or foreign, autonomous driving has reached a certain level. Baidu, Ali, and Tencent in China are all developing autonomous driving. They are still in an auxiliary position and cannot be popularized on a large scale. The surrounding information is collected through on-board cameras and radars, and imported into the model for training and decision-making. precise positioning. Limited by the delay of the network, it has not been fully applied yet.
insert image description here

The Process of Machine Learning

Historical data is like human experience. To clean and standardize the data, not every data is useful. It is necessary to extract features, summarize preference features, and then train the model. The quality of the model is tested through model evaluation. After model evaluation The better ones will use new data for testing and predict the corresponding results. No matter which algorithm is used, the process is basically the same.
insert image description here

Get data (historical data)

  1. The first type of data has a target value and is continuous.
    For example, the location, floor, size, orientation, etc. of the building will affect the price of the house. The price of the house has a target value (label value). According to the number of floors, the price of the house changes continuously.
  2. The second type of data, with target values, is discrete.
    Action movies and romance movies are distinguished according to the number of fights. The genre has a clear goal, either an action movie or a romance movie, which is discrete data.
    In statistics, continuous is usually numerical and discrete is usually categorical.
  3. The third type of data does not have a clear target value. We can group the data together and divide them into the desired types according to what we want.
    The first two types of data have eigenvalues ​​and target values, which can be reasoned through features, have target values, and have evaluation criteria to measure the final result. The target values ​​​​are labeled and can be divided into continuous and discrete; the third type only has features Value, there is no clear target value, it can be divided in any way, there is no right or wrong, as long as it can be divided together.

We call a row of data a sample, such as each movie that needs to be analyzed, and a column of data we call a feature, such as the number of fights, the number of kisses, and the genre of each movie, which constitute the historical data part.
According to the above analysis, it can be roughly divided into two types of data types: the first type is eigenvalue + target value, and the target value is used to measure the final result, which is divided into continuous and discrete; the second type has only eigenvalues ​​and no clear target value. It can be divided in any way, and there is no right or wrong criterion.

Basic Data Processing

When the data is just acquired, there are often many problems. We need to perform the most basic processing on the data.
insert image description here
In the data in the above figure, it can be clearly seen that information such as NaN, missing values, units, and the meaning of each column must be processed.
There is a lack of data annotation for the column names, and the meaning of each column of data is unknown; there are a large number of missing values; the units of the data are inconsistent. For example, in the evaluation of catering, the taste is based on a 100-point system, and the service is based on a five-point system. If comprehensive scoring is to be done, the scoring method must be unified, which requires a normalization process.
The principle we adhere to is: complete unity

  • Integrity: Whether there is a null value in a single piece of data, and whether the statistical fields are perfect.
  • Comprehensiveness: Observe all the values ​​of a certain column, select a column in excel to see the average value, maximum value, minimum value, etc. of the column, and judge whether there is any problem with the data through observation, such as large data differences, unified units, data definitions, Unit identification, the data itself, etc.
  • Legitimacy: the legality of data type, content, and size. For example, whether there are non-ASCII characters in the data type, gender other than male and female, fill in unknown, unclear, etc., age over 150 or less than 0, etc. If these illegal data are not processed, normal data will be interfered.
  • Uniqueness: Whether there are duplicate records in the data, both row data and column data need to be unique, for example, a person cannot be repeatedly recorded multiple times, and a person's weight cannot be repeatedly recorded in the column indicators, etc.

Feature engineering (very important)

Feature engineering is the process of using professional background knowledge and skills to process data so that features can play a better role in machine learning algorithms, extract more important features, and then process and select the best.
For example, biology students learning python and machine learning need to observe the number of white blood cells and red blood cells, and then they must distinguish the difference between the two. According to the observation, summarize the characteristics of white blood cells and red blood cells, and then establish the characteristics. The larger ones are white blood cells, the smaller ones are red blood cells; the larger ones are white blood cells, the shorter ones are red blood cells; the white blood cells are not round, but the red blood cells are relatively round;
insert image description here
The red and white blood cells are divided according to the characteristics. It does not mean that all the established features can be used in the algorithm. We need to extract more preferred features to train the algorithm.
The content of feature engineering includes the following:

  • Feature extraction: Computers cannot understand the analyzed features, and convert text information into numbers that computers can understand. Computers can only understand binary numbers. Applicable to categorical and discrete data.
  • Feature preprocessing: the process of converting feature data into feature data that is more suitable for the algorithm model through some conversion functions. For example, two items with the same score of 80, but the total score is inconsistent, one with a full score of 100, and the other with a full score of 720, at this time the data should be normalized and standardized. Suitable for continuous data.
  • Feature dimensionality reduction: Under certain limited conditions, the process of reducing the number of random variables (features) to obtain a set of "uncorrelated" main variables. The process of changing from more to less, from high-dimensional to low-dimensional, and from correlation to irrelevance.

If the feature extraction is very good and the data is perfect, the model is poorly built, and the final prediction result will be very good; if the feature data is very poor, no matter how well the model is built, the final result will not be bad.

Model

The model is a machine learning algorithm. To choose an appropriate algorithm, you must understand the classification of machine learning algorithms. According to the composition of the data set, it is mainly divided into:

监督学习(Supervised learning)
无监督学习(Unsupervised learning)
半监督学习(Semi-Supervised learning)
强化学习(Reinforcement learning)

In the course, mainly learn supervised learning and unsupervised learning , and master classic algorithms.

supervised learning

The simple understanding of supervised learning is that there is a target value, and it is mainly divided into regression problems and classification problems.
The forecast data of housing prices and movies have clear target values. For the regression problem shown in the figure below, the trained model is y=kx+b, and x and y are passed in as data, and k=2 is trained. When x is 3, y can be predicted to be 6. Regression algorithms are divided into linear regression (unary, multivariate) and ridge regression.
There is also a classification problem, as shown in the figure below to predict red and white blood cells. For example, label0 is white blood cells, label1 is red blood cells, there is a clear target value, or red blood cells or white blood cells, and the red blood cells and white blood cells are separated by a line. The dots and crosses are distributed between each other. In other areas, this is an error. Algorithms for classification problems: K-nearest neighbor algorithm, Bayesian classification, decision trees and random forests, logistic regression, neural networks.
insert image description here

unsupervised learning

Unsupervised learning means that the input data is not labeled and there is no definitive outcome (no specific goal). The category of sample data is unknown, and it is necessary to classify the sample set according to the similarity between samples (a process tending to clustering), trying to minimize the gap within the class and maximize the gap between classes. For example, there is no classification for different graphics, and they can be divided according to the distance between the graphics. The distance between the graphics is divided into one category, and the algorithms used are: K-means, Dbscan, and PCA dimensionality reduction methods.

总结:
监督学习,有标签(目标值),可以做分类或者回归
无监督学习,没有标签(无目标值),可以做聚类

semi-supervised learning

Semi-supervised learning can be simply understood as a part of the data has a target, while another part of the data has no target. It is mainly used when the supervised learning effect cannot meet the demand, and semi-supervised learning is used to enhance the learning effect.

reinforcement learning

Reinforcement learning is mainly used to make decisions automatically and can make continuous decisions. The whole process is dynamic, and the output of the data in the previous step is the input of the data in the next step.
For example, the actions performed by humans are used in Alpha Dog. When playing chess, decisions are made continuously and quickly. Tell the model the rules of chess, and let yourself play games with yourself, but it only took three days to defeat the first player in chess, and the longer the time, the better.

Model evaluation (very important)

After the model is trained, it is not used directly, but the model must be evaluated. Only excellent models will be put into use. Model evaluation is very important.
Model evaluation is an integral part of the model development process, helping to discover the best model to represent the data and how well the selected model will perform in future work.
Model evaluation is used to evaluate the size of the model error. Put the test data into the model, and the difference between the predicted result and the real result is the error. The smaller the error, the better the model.

Errors are mainly divided into:

  • Experience error : the error on the training set, which is equivalent to doing chapter exercises after class
  • Generalization error : the error on unknown data, which is equivalent to the final exam in the later stage of study

Thinking : There is only one data set containing m samples, x is the feature, y is the target, D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . . , ( xm , ym ) } D= \{ (x_1,y_1),(x_2,y_2),...,(x_m,y_m) \}D={(x1,y1),(x2,y2),.....,(xm,ym)} , at this time, this data may not only be used to train the model, but also to test the quality of the model. Is this reliable?
This is unreliable. It’s like making a set of exactly the same exam questions before the exam. At this time, the exam will have no effect. A set of data is used for both training and testing, and the original effect will be lost. We need to divide the data into training set and test set. Part of the data is used for testing and part for training.

empirical error

Method to keep the test set

  • Set aside method: 10 years of stock historical data for model training, the first 8 years as the training set, and the next 2 years as the test set. The distribution of data is not very uniform. For example, stocks lost money in the first 8 years and made profits in the next 2 years, which broke the original continuous data distribution. It can also be 7, 3 points and so on.

Implementation of the set-out method:
1. Divide directly into a training set of N and a test set of M.
2. Randomly extract N into a training set and randomly extract M into a test set from each layer of data.
insert image description here

  • K-fold cross-validation: Make a test set for each piece of data, and average the test results to get the returned result. It is suitable for use when the amount of data is small, and requires relatively high computing power. For example, the data set D is divided into 10 equal parts, the first 9 parts are used as the training set, and the last part is used as the test set to obtain the test result 1; 1-8 and the 10th as the training set, and the 9th as the test set to get the test result 2; and so on, finally use the 2-10 as the training set, and the 1st as the test set to get the test result 10, and perform the 10 results obtained Average returns results.

The implementation method of K-fold cross-validation: Divide the data set into 10 parts, select 9 parts from the data set each time as the training set, and 1 part as the test set to train the model to obtain the test results, and finally use the obtained 10 test results Perform an average to get the returned result. However, this method is suitable for the case where the amount of data is relatively small. When the amount of data is relatively large, the computing power requirements of the computer are relatively high.
insert image description here

  • Bootstrap method (drawing samples with replacement): Given a data set D containing m samples, sample it to obtain a data set D', randomly select a sample from D each time, copy it to D', and then use the sample Put it back into the initial data set D, so that the sample may still be collected in the next sampling. After m times of execution, a data set D' containing m samples will be obtained, which is the result of autonomous sampling.

Defects of the hold-out method and the K-fold cross-validation method:
a part of the sample data is retained from the original sample as the test set, resulting in the data in the training set being smaller than the original set, and the size of the sample data is different, and there will be deviations; the self-help method is random and
has Putting back the extracted samples to construct the training set retains the randomness of the data, the training set and the test set do not affect the original number, and solves the defects of the hold-out method and the K-fold cross-validation method.

比如有10个数,原样本为:{1,2,3,4,5,6,7,8,9,10},可直接进行训练
随机有放回的抽取 : {2,2,3,5,4,6,7,8,2,1},可以出现重复值
选中的概率为:1/10
没有选中的概率为 1-1/10
10次未选中的概率为 (1-1/10)^10
m次就为(1-1/m)^m

During the implementation of the bootstrap method, some samples in D appear many times in D', while other samples do not appear. Then the probability that the sample m is never taken in the sampling is ( 1 − 1 m ) m (1-\frac{1}{m})^m(1m1)m , take the limit to get:lim ⁡ n → + ∞ ( 1 − 1 m ) m = 1 e \lim_{n\rightarrow+\infty} (1-\frac{1}{m})^m=\frac{1 }{e}n+lim(1m1)m=e1

Three methods are used for the division of training set and test set.
The training set is like textbook knowledge; the verification set is like homework or mock exams, which are used for parameter tuning, such as calling out the optimal parameters in an integrated algorithm; the test set is like an exam. Both the verification set and the training set can be adjusted. The number of samples in the training set will be relatively large, and the speed will be affected. Extracting the verification set for parameter adjustment will greatly reduce the impact on the speed. The three are 6, 2, 2 The divided sample set.

performance metrics

The performance measure is the index to evaluate the model. In the forecasting task, a given data set D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . . , ( xm , ym ) } D= \{(x_1,y_1),(x_2,y_2),...,(x_m,y_m)\}D={(x1,y1),(x2,y2),.....,(xm,ym)} , x is the feature, y is the target value, whereyi y_iyi(the y value of the i-th sample) is xi x_ixiThe true value of f ( xi ) f(x_i)f(xi) is the predicted value. If we need to evaluate the performance of the model, we need to take the real valueyi y_iyiand predicted value f ( xi ) f(x_i)f(xi) for comparison. Error is the difference between the predicted value and the true value.
In regression tasks, the most commonly used performance measure is "Mean Squared Error" (Mean Squared Error) MSE = ∑ i = 1 m ( f ( xi ) − yi ) 2 m MSE=\frac{\displaystyle \sum^{m} _{i=1}(f(x_i)-y_i)^2}{m}MSE=mi=1m(f(xi)yi)2
For example:
the data set is: {(1,2), (2,4), (3,5)}
the predicted value is: {(1,3), (2,4), (3,6)}
mean square The error is:((3-2)^2+(4-4)^2+(6-5)^2)/3

More generally, for the data D and the probability density function p(.), the mean square error can be described as: MSE = ∫ x → d ( f ( x ) − y ) 2 p ( x ) dx MSE={\displaystyle \ int_{x \to d}(f(x)-y)^2}p(x)dxMSE=xd(f(x)y)2 p(x)dx
Usually the data is not only the real value and the predicted value, but also the problem of probability. If the probability of occurrence is 0.3, 0.5, 0.2, the mean square error is:((3-2)^2*0.3+(4-4)^2-0.5+(6-5)^2*0.2)/3

  • Error rate :
    the ratio of the number of misclassified samples to the total number of samples. If there are 5 samples and 2 samples are misclassified, the error rate is: 2/5
    E ( f ; D ) = 1 m ∑ i = 1 m II ( f ( xi ) ≠ yi ) E(f;D)=\frac{1}{m}\displaystyle \sum^{m}_{i=1}\Iota\Iota({f(x_i) \neq y_i} )E(f;D)=m1i=1mII ( f ( xi)=yi)
    When the predicted value is not equal to the real value, it means that the classification is wrong, and the score is 1 once, and the correct score is 0. The formula is to sum the scores, representing the total error divided by the total sample data.
  • Accuracy rate :
    the proportion of correctly classified samples to the total number of samples acc ( f ; D ) = 1 m ∑ i = 1 m II ( f ( xi ) = yi ) = 1 − E ( f ; D ) acc(f;D )=\frac{1}{m}\displaystyle \sum^{m}_{i=1}\Iota\Iota({f(x_i) = y_i})=1-E(f;D)acc(f;D)=m1i=1mII ( f ( xi)=yi)=1E(f;D )
    If the predicted value is exactly the same as the real value, the score is 1, and the sum of the scores divided by the total number is the accuracy rate, which can also be written as 1-error rate.
    The commonly used error rate and accuracy rate cannot meet all requirements. For example, the following data:
    data set D = [1,2,3,4,5,6,7,8,9,10], do a binary classification, if If the data is 5, it is correct, and the result is 1; if it is not 5, it is wrong, and the result is 0. The classification has a clear category, and the error rate is: 9/10.
    Looking at these two indicators alone, there will be a problem: the model is very general, but the result is very good. At this time, other indicators are needed, such as precision and recall.
    insert image description here
    Both ground truth and predicted results are divided into positive and negative examples.
    If there are 100 sample data, the real situation:yyy is 1,0,1..., including 60 positive examples and 40 negative examples; prediction data:y ~ \tilde{y}y~0, 1, 0, 1..., including 70 positive examples and 30 negative examples. At this time, the number of real cases is 60, and the number of negative examples is 40; the number of positive examples in the prediction results is 70, and the number of negative examples is 30; if the number of real examples is 50, the number of false positive examples is 70-50=20, and the number of false negative examples is 60-50=10, false positives are 30-10=20.
    At this time, the precision rate is: 50/(50+20); the recall rate is: 50/(50+10)
  • Precision rate :
    the proportion of correct predictions that are positive to all predictions: P = TPTP + FPP=\frac{TP}{TP+FP}P=TP+FPTP
    That is, how many of the predicted results are true.
  • Recall rate :
    the proportion of all positive samples correctly predicted as positive: R = TPTP + FNR=\frac{TP}{TP+FN}R=TP+FNTP
    That is, how much of the real situation is predicted.

generalization error

The error of predicting the data set, and the evaluation standard of generalization error is the fitting situation. Fitting is the model evaluation used to evaluate the performance of the trained model, and its performance can be roughly divided into two categories: underfitting and overfitting.

underfitting

For example, the face recognition model is now trained, and the face picture is passed in, the features are extracted, and the data features are established. The human features learned by the machine are eyes, nose, and mouth, and the pictures with similar features are also recognized after being passed in. For people. Because there are too few human features for machine learning, those with eyes, nose, and mouth features are all recognized as humans, and orangutans are also recognized as humans.
insert image description here

overfitting

The performance of the built machine learning model or deep model in the training samples is too superior, resulting in poor performance in the verification data set and test data set.
insert image description here
Due to the many and single features of machine learning, the extracted features are too superior, and people are considered to be yellow-skinned. If there are foreign students with dark skin for identification, they will be classified as unknown, which is over-fitting.

summary

Preparatory work for model training:
1. Have historical data

  • Divide the data
    • Continuous data (such as: 1-10) + data with target values ​​(with clear goals, such as labels)
    • Discrete data (categorical categories) + data with target values
    • Data without target values.
  • The meaning of the rows and columns in the data table, in a set of data
    • A row represents a sample, and a row represents a set of samples
    • Columns represent features, such as predicting housing prices. It is necessary to know that the area, geographical location, and floors of the house are all characteristics of the house.

The obtained historical data cannot be directly used for model training. For example, some content of the data obtained from the crawler is not completely displayed. After data analysis, processing and saving, there will be missing values; the recorded data will also have missing values, so the data must be corrected. Do basic processing.
2. Data processing
requires that the data adhere to the principle of complete unity

  • Complete means that if the entire data is complete (the row and column labels are complete);
  • Complete, refers to the data without missing values, the missing values ​​are processed (data cleaning)
  • Combined, means that the data must be legal, common sense, and valid (such as age is not negative, etc.)
  • First, the data must be unique, and the data must be deduplicated, normalized (when the dimensions are different), and the unit is unified when the unit is different.

3. Feature engineering (preference induction)
Not all the features of the data are what we want, and not all of them have an effect on the algorithm model. We need to reduce the dimension, so we need to choose the optimal features that are useful to the algorithm model. This step is the most important and should be done by professionals.
4. Algorithm model

  • Supervised: with a goal, working under the supervision of the goal at all times
  • Unsupervised: no target
  • Semi-supervised: a mix of the two
  • Reinforcement: There are very fast learning methods, such as AlphaGo's algorithm, as understanding.

5. Model evaluation
After having a model, it is not necessary to immediately import data to predict the results, but also to conduct model evaluation to see if the existing model can meet the requirements.

  • Experience evaluation: such as training students to teach knowledge points, and then conduct exercises, assessments, etc., to know the students' learning status through continuous testing. The overall data set is divided into training set (textbook knowledge), test set (examination test) and verification set (practice), and there are three ways to divide the three types.
    • Leaving out method: Set aside a part of the overall data sample as a test or verification set. The first method is to cut off directly. Cutting out 6, 2, and 2 from 10 samples will lead to different cycles of data presentation in different time periods. The same; the second is stratification, each time 60% is selected from the data set as data.
    • K-fold cross-validation method: 1. Divide the data into k parts, 2. Make each part a test set in turn, 3. Calculate the mean of the results.
      Disadvantages: If the amount of data is relatively large, it will be a waste of time
    • Bootstrap method: random sampling with replacement.
  • generalization evaluation
    • Underfitting, few features extracted
    • Overfitting, the extracted features are too superior

Guess you like

Origin blog.csdn.net/hwwaizs/article/details/124783556