Artificial Intelligence-Machine Learning & Artificial Neural Network

machine learning

The main content of the machine learning part is the naive Bayes algorithm and the decision tree algorithm.

Machine learning studies how computers can simulate human learning behavior, acquire new knowledge or new skills, reorganize existing knowledge to improve their own performance, so that computers have human learning capabilities, thereby realizing artificial intelligence.

Machine learning is often classified as:

  • Supervised learning: establishing a mapping from input to output when the input and output are known, often used for classification and regression tasks
  • Unsupervised learning: There is no correct output, only input, and the model summarizes the characteristic information of the data by itself. It is often used in clustering tasks.
  • Reinforcement learning: The agent chooses an action to act on the environment, and the environment gives a reinforcement signal and a changed state. The agent selects the next action based on the reinforcement signal and the current state of the environment, and chooses an action that increases the probability of receiving positive reinforcement.

Decision tree algorithm

The decision tree algorithm is a commonly used classification algorithm in machine learning. In the decision tree, each attribute is used for classification, and there are many ways to select the attributes for each classification. The simplest is the ID3 algorithm, which uses information gain as the basis for selecting attributes.

  • Entropy: Entropy represents the degree of chaos of things. The greater the entropy, the greater the degree of chaos. The smaller the entropy, the smaller the degree of chaos. For a random event S, if there are N possibilities and the probability of each possible occurrence is Pi, then the entropy of the event is:
    H ( S ) = − ∑ i = 1 N pi log ⁡ 2 pi H(S) = -\ sum_{i=1}^Npi \log_2piH(S)=i=1Npilog2pi
  • Information gain: Information gain represents the degree to which the information of feature A reduces the uncertainty of the information of class D. It is the average entropy of the class that is not divided by a certain attribute minus the entropy divided by different values ​​of a certain attribute. Defined as:
    G ain ( D , A ) = H ( D ) − ∑ i = 1 V ∣ D i ∣ ∣ D ∣ H ( D i ) Gain(D,A) = H(D) - \sum_{i= 1}^V\frac{|D_i|}{|D|}H(D_i)Gain(D,A)=H(D)i=1VDDiH(Di)
    where Di is a sample whose value is Ai on attribute A. There are V kinds of values ​​for attribute A.

When building a decision tree, select the attribute with the largest information gain as the attribute to divide the sample. After dividing the sample once, calculate the information gain of each attribute and continue dividing until all attributes are used or the sample is completely divided.

Question
Insert image description here
a) Concept question

b)
Calculate the information gain of each attribute to construct a decision tree:
H (attraction) = − 1 2 log 2 1 2 − 1 2 log 2 1 2 = 1 H (attraction) = -\frac{1}{2}log_2 \frac{1}{2}-\frac{1}{2}log_2\frac{1}{2}=1H ( attraction )=21log22121log221=1
There are two temperature values, each accounting for half. The attractive ones account for 2/5 (hot) and 3/5 (cold) respectively.

H (attraction|temperature) = 1 2 ( − 2 5 log 2 2 5 − 3 5 log 2 3 5 ) + 1 2 ( − 2 5 log 2 2 5 − 3 5 log 2 3 5 ) = 0.971 H (attraction force|temperature) = \frac{1}{2}(-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5} ) + \frac{1}{2}(-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5}) = 0.971H ( attraction | temperature )=21(52log25253log253)+21(52log25253log253)=0.971
There are three flavors in total, sweet, sour and salty 4:3:3. The attractive ones account for 1/2, 1 and 0 respectively.

H (attractiveness | taste) = 0.4 ( − 0.5 log 2 0.5 − 0.5 log 2 0.5 ) + 0 + 0 = 0.4 H (attractiveness | taste) = 0.4(-0.5log_20.5-0.5log_20.5) + 0 + 0 = 0.4H ( attractiveness | taste )=0.4 ( 0.5 l or g20.50.5 l or g20.5)+0+0=
There are two values ​​for 0.4 servings, each accounting for half. The attractive ones account for 1/5 (large) and 4/5 respectively.

H (attractive force | serving size) = 0.5 ( − 0.2 log 2 0.2 − 0.8 log 2 0.8 ) + 0.5 ( − 0.2 log 2 0.2 − 0.8 log 2 0.8 ) = 0.722 H (attractive force | serving size) = 0.5(-0.2log_20 .2-0.8log_20.8)+0.5(-0.2log_20.2-0.8log_20.8) = 0.722H ( attractiveness | weight )=0.5(0.2log20.20.8log20.8)+0.5(0.2log20.20.8log20.8)=0.722
can calculate the information gain of each attribute as:

G ain (attraction, temperature) = 1 − 0.971 = 0.029 G ain (attraction, taste) = 1 − 0.4 = 0.6 G ain (attraction, portion size) = 1 − 0.722 = 0.278 Gain (attraction, temperature) = 1-0.971 = 0.029\\ Gain (attractiveness, taste) = 1-0.4 = 0.6\\ Gain (attractiveness, portion size) = 1-0.722 = 0.278G ain ( attraction, temperature )=10.971=0.029G ain ( attraction, taste )=10.4=0.6G ain ( attractiveness, weight )=10.722=0.278
Therefore, taste should be selected as the attribute to divide the samples to divide the samples. After the division, it is found that the samples with sour and salty tastes have been divided, and the sweet samples can obviously be divided into two categories by the portion size. Therefore, directly select the portion size and continue dividing, you can get the final decision tree:
Insert image description hereFor a given sample, no third layer of temperature is needed to classify all samples.

c)
By querying the decision tree, the "no" result can be predicted.

Naive Bayes algorithm

The Naive Bayes algorithm is commonly used to implement classifiers. It uses training samples to obtain the probability of having a certain feature in a specific category, and calculates the probability that the sample belongs to a specific category under certain feature conditions. For samples that need to be classified, calculate the probability of belonging to each category according to the characteristics. The probability of each category is selected, and the one with the highest probability is selected as the classification result.

The question
Insert image description here first calculates the probability of taking each feature under each category: (According to the answer, you can only calculate the ones you need to use)

P(C=1) = 9/15
P(C=-1) = 6/15
P(X1=1|C=1) = 2/9
P(X1=2|C=1) = 3/9
P (X1=3|C=1) = 4/9
P(X1=1|C=-1) = 3/6
P(X1=2|C=-1) = 2/6
P(X1=3|C =-1) = 1/6
P(X2=S|C=1) = 1/9
P(X2=P|C=1) = 4/9
P(X2=Q|C=1) = 4/9
P(X2=S|C=-1) = 3/6
P(X2=P|C=-1) = 2/6
P(X2=Q|C=-1) = 1/6

Compute the desired probability:
P ( C = 1 ∣ X 1 = 3 , X 2 = S ) = α P ( X 1 = 3 , X 2 = S ∣ C = 1 ) = α P ( X 1 = 3 ∣ C = 1 ) P ( X 2 = S ∣ C = 1 ) = 0.0296 α P ( C = − 1 ∣ X 1 = 3 , 1 ) = α P ( X 1 = 3 ∣ C = − 1 ) P ( X 2 = S ∣ C = − 1 ) = 0.333 α P(C=1| 3,X2=S|C=1) = αP(X1=3|C=1)P(X2=S|C=1) \\ = 0.0296α \\ P(C=-1|X1=3,X2 =S) = αP(X1=3,X2=S|C=-1) = αP(X1=3|C=-1)P(X2=S|C=-1) \\ =0.333αP(C=1∣X1=3,X 2=S)=αP(X1=3,X 2=SC=1)=αP(X1=3∣C=1 ) P ( X _=SC=1)=0.0296 aP(C=1∣X1=3,X 2=S)=αP(X1=3,X 2=SC=1)=αP(X1=3∣C=1 ) P ( X _=SC=1)=0.333 α
so the classifier predicts the category of this sample to be C=-1.

Here αP(X1=3,X2=S|C=1) can be directly written as αP(X1=3|C=1)P(X2=S|C=1) because the naive Bayes algorithm adopts conditions The independence assumption simplifies by assuming that all features are independent from each other.

Artificial neural networks

The main content of the artificial neural network part is the principle of artificial neural network, perceptron algorithm, multi-layer neural network

perceptron algorithm

The goal of the perceptron algorithm is to minimize the loss function, learn the parameters w and b in the model, and find the separation hyperplane that divides the sample.

Insert image description here
The mapping relationship from input to output is as follows:
Y = F ( w 1 x 1 + w 2 x 2 − θ ) Y = F(w1x1+w2x2-θ)Y=F(w1x1+w2x2θ )
F is the activation function, which takes a step function. When it is less than 0, it is 0, and when it is greater than 0, it is 1.

When training the perceptron, assume that the training round is i, the learning rate is α, the error E(i) is calculated each time, and the parameters are adjusted:

E ( i ) = d ( i ) − y ( i ) w 1 ( i + 1 ) = w 1 ( i ) + α E ( i ) x 1 ( i ) w 2 ( i + 1 ) = w 2 ( i ) + α E ( i ) x 2 ( i ) θ ( i + 1 ) = θ ( i ) + α E ( i ) ( − 1 ) E(i)=d(i)-y(i) \\ w_1(i+1)=w1(i)+αE(i)x_1(i) \\ w_2(i+1)=w2(i)+αE(i)x_2(i) \\ θ(i+1) = θ(i)+αE(i)(-1) E ( i )=d(i)y(i)w1(i+1)=w1(i)+αE(i)x1(i)w2(i+1)=w2(i)+αE(i)x2(i)θ(i+1)=θ(i)+α E ( i ) ( 1 )
where d(i) is the ideal output of the i-th round, and the parameter adjustment part is actually: the learning rate x the inverse of the gradient of the loss function on the parameters, and the loss function on the parameters The gradient of is obtained using the chain derivation rule as:
∂ L ∂ w 1 = ∂ L ∂ O × ∂ O ∂ y × ∂ y ∂ w 1 \frac{\partial L}{\partial w_1} = \frac{\ partial L}{\partial O} \times \frac{\partial O}{\partial y} \times \frac{\partial y}{\partial w_1}w1L=OL×yO×w1y
The loss function is usually taken as:
LOSS = 1 2 (Correct output − Actual output) 2 Derived to get ∂ L ∂ O = − (Correct output − Actual output) LOSS = \frac{1}{2} (Correct output – Actual output )^2 \\ Derived to get\frac{\partial L}{\partial O} = -(correct output-actual output)LOSS=21( Correct outputactual output )2Seek the derivativeOL=( correct outputActual output )
and the step function cannot be differentiated, so in the perceptron algorithm, the step function is not multiplied to derivation of the weighted result, so only:
y = w 1 x 1 + w 2 x 2 − θ derivation Obtain ∂ y ∂ w 1 = x 1 y = w1x1+w2x2-\theta \\ Derive to get\frac{\partial y}{\partial w_1} = x_1y=w1x1+w2x2iSeek the derivativew1y=x1
Therefore, the final adjustment parameter is calculated as
w 1 ( i + 1 ) = w 1 ( i ) + α × − ( − (correct output − actual output) × x 1 ( i ) ) = w 1 ( i ) + α E ( i ) x 1 ( i ) w_1(i+1)=w1(i)+α\times -(-(correct output-actual output) \times x_1(i)) =w1(i)+αE(i) x_1(i)w1(i+1)=w1(i)+a×( ( Correct outputactual output )×x1(i))=w1(i)+αE(i)x1( i )
What is multiplied after α is the opposite direction of the gradient, so it finally becomes multiplied by E(i)x1(i). The adjustment method for finding θ is similar.

Artificial neural networks

The core of this part of the content is the principle of gradient descent and the calculation of error backpropagation. In fact, it is explained more clearly in the PPT part of deep learning. Mainly learn to find the formula for error back propagation.

Question
Insert image description here
Insert image description here
Insert image description here
                 (2)
Insert image description here

Guess you like

Origin blog.csdn.net/Aaron503/article/details/130953711