Machine learning literacy: an overview of machine learning

Machine learning literacy: an overview of machine learning

At 20:14 on September 14, 2020, organized from Tianchi AI machine learning literacy .

1. Machine Learning

A way to realize artificial intelligence

Artificial intelligence mainly consists of several parts: first, perception, including vision, speech, and language; second, decision-making, such as making predictions and judgments; finally, feedback, if you want to build a complete system, like a robot or an automatic Driving, you need a feedback. Among the many capabilities of artificial intelligence, one of the most important capabilities is its learning ability-machine learning, which is the core of artificial intelligence and the key to making computers intelligent. Cannot learn by oneself, artificial intelligence is just useless.

GitHub

To understand artificial intelligence, we need to clarify the relationship between several concepts: artificial intelligence is a big concept that allows machines to think like humans or even surpass humans; while machine learning is a way to realize artificial intelligence, using algorithms To analyze data, learn from it, and then make decisions and predictions about events in the real world; deep learning is an implementation of machine learning, training the network by simulating human neural networks; while statistics is machine learning and neural A basic knowledge of the network.

GitHub

The biggest feature of machine learning is to use data instead of instructions to perform various tasks. The learning process mainly includes: data feature extraction, data preprocessing, training model, testing model, model evaluation and improvement. Next, we focus on common algorithms in the machine learning process.

GitHub

数据的特征提取->数据预处理->训练模型->测试模型->模型评估改进

2. Machine learning algorithms: the key to making computers smart

Machine learning algorithms can be divided into traditional machine learning algorithms and deep learning. Traditional machine learning algorithms mainly include the following five categories:

  • Regression: establish a regression equation to predict the target value for continuous distribution prediction
  • Classification: Given a large amount of labeled data, calculate the label value of the unknown label sample
  • Clustering: gather the unlabeled data into different clusters according to the distance, and each cluster of data has common characteristics
  • Association analysis: calculate the frequent item set between the data
  • Dimensionality reduction: the data points in the original high-dimensional space are mapped to the low-dimensional space

GitHub

Below we will select several common algorithms and introduce them one by one.

1. Linear regression: find a straight line to predict the target value

A simple scenario: Given the historical data of house price and size, what is the selling price when the area is 2000?

GitHub

Such problems can be solved by regression algorithms. Regression refers to a statistical analysis method that determines the quantitative relationship between two or more variables that are dependent on each other, by establishing a regression equation (function) to estimate the possible value of the target variable corresponding to the characteristic value. The most common is linear regression (Y = a X + b), that is, finding a straight line to predict the target value. Regression solving is the process of solving the regression coefficients (a, b) of the regression equation and minimizing the error. In the housing price scenario, according to the relationship between the house area and the selling price, the regression equation can be obtained to predict the selling price at a given house area.

GitHub

The application of linear regression is very extensive, for example:

Predict the lifetime value of customers: Based on the relationship between the historical data of old customers and the life cycle of customers, a linear regression model is established to predict the lifetime value of new customers, and then carry out targeted activities.

Airport passenger flow distribution prediction: Using massive airport WiFi data and security check-in check-in data, data algorithms are used to realize airport terminal passenger flow analysis and prediction.

Money fund inflow and outflow forecast: through user basic information data, user subscription and redemption data, yield table and inter-bank lending rate and other information, grasp the user's subscription and redemption data, and accurately predict the future daily capital inflow and outflow situation .

Movie box office prediction: Based on Internet public data such as historical box office data, movie review data, and public opinion data, predict the movie box office.

线性回归:找到对散点数据拟合程度最好的直线

2. Logistic regression: find a straight line to classify data

Although the name of logistic regression is regression, it is a classification algorithm. It maps the result of the linear function to the Sigmoid function through the Sigmoid function, predicts the probability of occurrence of the event and classifies it.
Sigmoid is a normalized function that can convert continuous values ​​into the range of 0 to 1, providing a method of discretizing continuous data into discrete data.
Therefore, logistic regression intuitively draws a classification line. Data on one side of the classification line with probability> 0.5 belongs to category A; data on the other side of the classification line with probability <0.5, belongs to category B.
For example, in the figure, by calculating the probability of suffering from a tumor, the results are classified into two categories, which are located on both sides of the logical classification line.

GitHub


3. K-nearest neighbor: measure the nearest neighbor classification label by distance

A simple scenario: Given the number of fighting and kissing scenes in a movie, determine whether it belongs to a romance movie or an action movie. When there are more kissing scenes, we judge it as a romance based on experience. So how does the computer judge it?

GitHub

K-nearest neighbor algorithm can be used, and its working principle is as follows:

  • Calculate the distance between the point in the sample data and the current point
  • The algorithm extracts the classification label of the most similar data (nearest neighbor) of the sample
  • Determine the frequency of occurrence of the category of the first k points. Generally, only the first k most similar data in the sample data set are selected. This is the source of k in the k-nearest neighbor algorithm, usually k is an integer not greater than 20
  • Return the category with the highest frequency of the first k points as the predicted category of the current point

In the movie classification scene, the value of k is 3, and the three points sorted by distance are action movies (108, 5), action movies (115, 8), and romantic movies (5, 89). Among these three points, the frequency of action movies is two-thirds, and the frequency of romance movies is one-third, so the movie marked by the red dot is an action movie.

A common application of the K-nearest neighbor algorithm is handwritten digit recognition. For the human brain, the handwritten font sees the number as an image, and the computer sees it as a two-dimensional or three-dimensional array. How do you recognize the number?

The specific steps of using the K nearest neighbor algorithm for identification are:

  • First, process each picture to have the same color and size: width and height are 32 pixels x 32 pixels.
  • Convert 3232 binary image matrix to 11024 test vector.
  • Store the training samples in the training matrix to create a training matrix with m rows and 1024 columns. Each row of the matrix stores an image.
  • Calculate the distance between the target sample and the training sample, and select the number with the highest frequency of the first k points as the predicted classification of the current handwritten font.
K-紧邻:最短距离

4. Naive Bayes: Choose the class with the largest posterior probability as the classification label

A simple scenario: Bowl 1 (C1) has 30 fruit candies and 10 chocolate candies, and bowl 2 (C2) has 20 fruit candies and 20 chocolate candies. Now choose a bowl at random, take out a candy from it, and find it is fruit candy.

Ask which bowl of fruit candy (X) is most likely to come from? This type of problem can be calculated with the aid of Bayesian formulas, without the need to build a model for the target variable. In classification, the probability of the sample belonging to each category is calculated, and then the category with the larger probability value is selected as the classification category.

P(X|C): conditional probability, the probability that X appears in C
P©: the prior probability, the probability of C appearing
P(C|X): the posterior probability, the probability that X belongs to class C

Suppose there are two classes of C1 and C2. Since P(X) is the same, there is no need to consider P(X). Just consider the following:
If P(X|C1)P(C1)> P(X|C2) P(C2), then P(C1|X)> P(C2|X), and X belongs to C1;
if P(X|C1) P(C1) <P(X|C2) P(C2), then P (C2|X) <P(C2|X), so X belongs to C2.

For example, in the above example: P(X): the probability of fruit candy is 5/8
P(X|C1): the probability of fruit candy in the first bowl is 3/4
P(X|C2): fruit in the second bowl The probability of sugar is 2/4
P(C1)=P(C2): The probability of two bowls being selected is the same, which is 1/2

Then the probability of fruit candy coming from the first bowl is:
$P(C1|X)=P(X|C1)P(C1)/P(X)=(3/4)(1/2)/(5/8 )=3/5
The probability of fruit candy coming from the second bowl is:
P(C2|X)=P(X|C2)P(C2)/P(X)=(2/4)(1/2)/( 5/8)=2/5
P(C1|X)>P(C2|X)
so this sugar is most likely from the first bowl.

The main applications of Naive Bayes are text classification, junk text filtering, sentiment discrimination, multi-class real-time prediction, etc.

朴素贝叶斯:后验概率

5. Decision tree: construct a classification tree with the fastest drop in entropy

A simple scenario: When going on a
blind date, it may first detect whether the partner has a room. If so, consider further contact. If there is no room, then observe whether the person is motivated, if not, just say Goodbye. If so, they can be included in the candidate list.

This is a simple decision tree model. A decision tree is a tree structure in which each internal node represents a test on an attribute, each branch represents a test output, and each leaf node represents a category. The top-down recursive method is adopted, and the feature with the largest information gain is selected as the current split feature.

The decision tree can be applied to: user classification evaluation, loan risk evaluation, stock selection, bidding decision, etc.


6. Support vector machine (SVM): constructing hyperplane, classifying nonlinear data

A simple scenario: It is
required to separate balls of different colors with a thread, and it is still applicable after putting more balls as much as possible. Both lines A and B can meet the conditions. If you continue to add more balls, line A can still separate the balls well, but line B cannot.

GitHub

To further increase the difficulty, when the ball does not have a clear dividing line, it is no longer possible to separate the ball with a straight line. How to solve it?

GitHub

Two issues related to support vector machines in this scenario:

  1. When the data is linearly separable for a classification problem, just place the position of the line at the position that maximizes the distance between the ball and the line. The process of finding this maximum interval is called optimization.
  2. General data is linear and inseparable. The kernel function can be used to map the data from two dimensions to the high order, and the data can be divided by the hyperplane.

The classification intervals of the optimal decision planes in different directions are usually different, and the decision plane with the "maximum interval" is the optimal solution that SVM is looking for. The sample points crossed by the dotted lines on both sides of the true optimal solution are the support sample points in the SVM, which are called support vectors.

The application of SVM is very wide, and it can be applied to spam recognition, handwriting recognition, text classification, stock selection, etc.

SVM:最优化、最大间隔

7. K-means: Calculate centroid, cluster unlabeled data

In the classification algorithm described above, the data set that needs to be classified has been marked, for example, the data set has been marked as ○ or ×, and the two types of data are divided by learning the hypothesis function. For unlabeled data sets, it is hoped that an algorithm can automatically divide the same elements into closely related subsets or clusters. This is a clustering algorithm.

GitHub

To give a specific example, for example, if there is data on the age of a group of people, it is generally known that there are a group of children, a group of young people, and a group of old people.

Clustering is to automatically discover these three piles of data and aggregate similar data into the same pile. If you want to gather into three piles, then the input is a pile of age data. Note that the age data at this time does not have a class label, which means that there are roughly three piles of people in it. As for who is which pile, it is now I don't know, but the output is the class label to which each data belongs. After the clustering is completed, you will know who and who are the same.

The classification is to tell you in advance what the ages of children, young people and the elderly are. Now a new age has arrived. Enter its age and output the classification she belongs to. Generally, a classifier needs to be trained before it can recognize new data.

K-Means algorithm is a common clustering algorithm, and its basic steps are:

  1. Randomly generate k initial points as centroids;
  2. Sort the data in the data set into clusters according to the distance from the centroid;
  3. Average the data in each cluster as the new centroid, and repeat the previous step until all clusters no longer change. The farther the distance between the two classifications, the better the clustering effect.

An example of the K-means algorithm is: customer value segmentation and precise investment.
Take airlines as an example. Because of fierce business competition, the focus of corporate marketing has shifted from product center to customer center; establishing a reasonable customer value evaluation model, customer classification, and precise marketing are the keys to solving the problem.

Identify customer value through five indicators: the most recent consumption time interval R, the consumption frequency F, the average flight mileage M and the discount coefficient C, and the customer relationship length L (LRFMC model). The K-Means algorithm is used to group customer data into five categories (combined with business understanding and analysis to determine the number of customer categories) to draw a radar chart of customer group characteristics.

Customer value analysis:

  • Important to keep customers: C, F, M are high, R is low. Priority should be given to resources to such customers for differentiated management to improve customer loyalty and satisfaction.
  • Important development customers: C is higher, R, F, and M are lower. This type of customer has a short joining time (L), low current value, and great development potential, which should encourage customers to increase their consumption at the company and partners.
  • Important to retain customers: C, F, or M are higher, R is higher or L becomes smaller, and the uncertainty of customer value changes is high. Should grasp the latest customer information and maintain interaction with customers.

General and low-value customers: C, F, M, L are low, R is high. Such customers may only choose to spend when discounting promotions.

An interesting case of the K-means algorithm is image compression. In a color image, the size of each pixel is 3 bytes (RGB), and the total number of colors that can be represented is 256 256 256. The K-means algorithm is used to put similar colors in K clusters, so only the label of each pixel and the color coding of each cluster are needed to complete the image compression.

K-means:important!

8. Association analysis: mining beer and diapers (frequent item sets) association rules

In a Wal-Mart supermarket in the United States in the 1990s, when analyzing sales data, supermarket managers found that two seemingly unrelated products, "beer" and "diaper", often appeared in the same shopping basket. After investigation, it was found that this phenomenon appeared in the young father. In families with babies in the United States, the mother usually takes care of the baby at home. When young fathers go to the supermarket to buy diapers, they often buy beer for themselves. If only one of the two items can be bought in a store, he is likely to give up shopping and go to another store where he can buy beer and diapers at the same time. As a result, Wal-Mart discovered this unique phenomenon and began to try to place beer and diapers in the same area in the store, so that young fathers could find the two products at the same time, thus obtaining good sales income.

The correlation algorithm is used in the "beer + diaper" story, and a more common correlation algorithm is the FP-growth algorithm.

Several related concepts in the algorithm:

  • Frequent item sets: A large number of data sets that frequently appear in the database. For example, {'beer'}, {'diaper'}, {'beer','diaper'} appear more frequently in shopping list data.
  • Association rules: from set A, set B can be derived under a certain degree of confidence. That is, if A happens, then B is also likely to happen. For example, people who bought {'diaper'} are likely to buy {'beer'}.
  • Support: refers to the proportion of a certain frequent item set in the entire data set. Assuming there are 10 records in the data set and 5 records containing {'beer','diaper'}, then the support of {'beer','diaper'} is 5/10 = 0.5.
  • Confidence: there are association rules such as {'diaper'} -> {'beer'}, and its confidence is {'diaper'} -> {'beer'}

Assuming that the support of {'diaper','beer'} is 0.45 and the support of {'diaper'} is 0.5, the confidence of {'diaper'} -> {'beer'} is 0.45 / 0.5 = 0.9.

在买了尿布的前提下又买了啤酒

It is widely used, for example: Used to formulate marketing strategies. Just like the example of beer and diapers, if the supermarket places beer and diapers next to each other, it will increase the sales of both. Used to find co-occurring words. When entering "Puyuan" in the browser, the browser will automatically pop up alternative records such as "Puyuan Platform", "Puyuan EOS" and so on. A simple case of FP-growth algorithm: Analyze the relationship between products through shopping cart data.

GitHub

The analysis steps are:

  1. Mining frequent itemsets from shopping cart data
  2. Generate association rules from frequent itemsets and calculate support
  3. Output confidence

GitHub

According to the results, it can be analyzed that if you buy shoes, you are most likely to buy socks at the same time; if you buy eggs and bread, you are most likely to buy milk.


9. PCA dimensionality reduction: reduce data dimensions and reduce data complexity

Dimensionality reduction refers to mapping data points in the original high-dimensional space to the low-dimensional space. Because the number of high-dimensional features is huge and distance calculation is difficult, the performance of the classifier will decrease as the number of features increases; reducing the error caused by high-dimensional redundant information can improve the accuracy of recognition.

GitHub

The more commonly used is the principal component analysis algorithm (PCA). It uses some kind of linear projection to map high-dimensional data to low-dimensional space, and expects that the variance of the data in the projected dimension is the largest, so as to use less data dimensions while retaining more The characteristics of the original data point.

GitHub

For example, when dimensionality reduction is performed on numbers, when one feature vector is used, the basic contour of 3 has been retained. The more feature vectors used, the closer it is to the original data.


10. Artificial neural network: abstract layer by layer, approximating any function

Nine traditional machine learning algorithms were introduced earlier, and now I will introduce the foundation of deep learning: artificial neural networks. It is a model designed to simulate the neural network of the human brain. It is composed of multiple nodes (artificial neurons) connected to each other and can be used to model complex relationships between data. The connections between different nodes are given different weights, and each weight represents the influence of one node on another node. Each node represents a specific function, and the information from other nodes is comprehensively calculated by its corresponding weight. It is a learnable function that accepts training with different data, and continuously adjusts the weights to obtain a model that fits the actual situation. A three-layer neural network can approximate any function.

GitHub

For example, a single-layer neural network is used to implement logic AND gates and XNOR gates.

GitHub

The neurons of each layer of the multilayer neural network learn a more abstract representation of the neuron value of the previous layer, and distinguish things by extracting more abstract features, so as to obtain better distinguishing and classification capabilities. For example, in image recognition, the first hidden layer learns the feature of "edge", the second layer learns the feature of "shape" composed of "edge", and the third layer learns the feature of "shape". The characteristic of the "pattern", the last hidden layer learned is the characteristic of the "target" composed of the "pattern".

GitHub


11. Deep learning: Give artificial intelligence a bright future

Deep learning is a branch of machine learning and the development of artificial neural networks. Deep learning is the core driver of today's artificial intelligence explosion, giving artificial intelligence a bright future.

Look at the difference between deep learning and traditional machine learning. Traditional machine learning feature processing and prediction are separated, and feature processing generally requires manual intervention. This type of model is called shallow model, or shallow learning, does not involve feature learning, and its features are mainly extracted by manual experience or feature conversion methods.

GitHub

To improve the representation ability of a representation method, the key is to construct a multi-level feature representation with a certain depth. The advantage of a deep structure is that it can increase the reusability of features, thereby exponentially increasing the ability to express. Starting from the low-level features, multi-step non-linear transformations are generally required to obtain more abstract high-level semantic features. This method of automatically learning effective features is called "representation learning".

Deep learning is a method based on characterization learning of data. Using a multi-layer network, it can learn abstract concepts while integrating self-learning. It gradually abstracts related concepts from a large number of samples layer by layer, then makes an understanding, and finally does Make judgments and decisions. By building a model with a certain "depth", the model can automatically learn a good feature representation (from the low-level feature, to the middle-level feature, and then to the high-level feature), thereby ultimately improving the accuracy of prediction or recognition.

At present, deep learning is widely used, such as image recognition, speech recognition, machine translation, autonomous driving, financial risk control, intelligent robots, etc.

GitHub


3. Anaconda: the first choice for beginners in Python and introductory machine learning

Now that you understand the algorithms used in the machine learning process, how do you practice it? Anaconda is the first choice for beginners in Python and introductory machine learning. It is a Python release for scientific computing, which provides package management and environment management functions, and can easily solve the problems of coexistence, switching of multiple versions of Python, and installation of various third-party packages.

GitHub

Integrated package function:

  • NumPy: Provides the function of matrix operations. It is generally used with Scipy and matplotlib. It is the basis of all higher-level tools created by Python. It does not provide advanced data analysis functions.
  • Scipy: Depends on NumPy, which provides convenient and fast N-dimensional vector array operations. Modules are provided for optimization, linear algebra, integration, and other general tasks in data science.
  • Pandas: A tool based on NumPy, which was created to solve data analysis tasks, including advanced data structures, and tools that make data analysis fast and simple
  • Matplotlib: Python's most famous plotting library

Among them, Scikit-Learn is an open source machine learning toolkit integrated in Anaconda, which mainly covers classification, regression and clustering algorithms, and can directly call traditional machine learning algorithms for use. At the same time, Anaconda is also compatible with the second-generation artificial intelligence system TensorFlow developed by Google for the development of deep learning.

Finally, through a Python-based decision tree case, to intuitively understand the process of machine learning. A decision tree for loan applications is used to classify future loan applications.

Specific implementation process:

  1. Prepare the data set: From the loan application sample data table, select the features that have the ability to classify the training data
  2. Build a tree: select the feature with the largest information gain as the split feature to build a decision tree
  3. Data visualization: Use Matplotlib to visualize data
  4. Perform classification: used for classification of actual data. For example, enter the test data [0,1], it means there is no house, but there is work, the classification result is "mortgage loan".

GitHub


4. Summary

Basic overview only

For deeper perception, please refer to the attached code

# -*- coding: utf-8 -*-
# *
# *基础感知:线性拟合数据
# *

import numpy as np
import matplotlib.pyplot as plt

#原始数据
X=[ 1 ,2  ,3 ,4 ,5 ,6]
Y=[ 2.6 ,3.4 ,4.7 ,5.5 ,6.47 ,7.8]

#用一次多项式拟合,相当于线性拟合
z1 = np.polyfit(X, Y, 1)#最后一位为幂次 用1次多项式拟合,可改变多项式阶数;
p1 = np.poly1d(z1)#数字“1” 1d  得到多项式系数,按照阶数从高到低排列
print (z1)  #[ 1.          1.49333333] 系数
print (p1)  # 1 x + 1.493 

#作图显示
x = np.arange(1,7)
y = z1[0] * x + z1[1]
plt.figure()
plt.scatter(X, Y)#散点
plt.plot(x, y)
plt.show()


#python多项式拟合 np.polyfit和np.poly1d介绍参加下述网址
#https://www.cnblogs.com/qi-yuan-008/p/12323535.html

At 20:14 on September 14, 2020, organized from Tianchi AI machine learning literacy .

Guess you like

Origin blog.csdn.net/weixin_44063734/article/details/108587687