Artificial Intelligence-Experiment 3

third experiment

1. Experimental purpose

​ Master the algorithmic ideas of classification algorithms, including Naive Bayes algorithm, decision tree algorithm, etc. Write the Naive Bayes algorithm for classification operations. This experiment is mainly about the implementation of the Naive Bayes algorithm. When completing this experiment, I also reviewed the idea of ​​​​the decision tree classification algorithm learned in the course.

2. Experimental principles

1. Conditional probability and Bayesian formula

​ Conditional probability refers to the probability of event A occurring under the condition that another event B has occurred, recorded as P(A|B). If there are only two events A and B, then:
P ( A ∣ B ) = P ( AB ) P ( B ) P(A|B) = \frac{P(AB)}{P(B)}P(AB)=P(B)P(AB)
​ Bayes’ formula can be derived from the conditional probability formula:
P ( AB ) = P ( A ∣ B ) P ( B ) P ( B ∣ A ) = P ( AB ) P ( A ) = P ( A ∣ B ) P ( B ) P ( A ) P(AB) = P(A|B)P(B)\\ P(B|A) = \frac{P(AB)}{P(A)} = \frac{ P(A|B)P(B)}{P(A)}P(AB)=P(AB)P(B)P(BA)=P(A)P(AB)=P(A)P(AB)P(B)
Bayesian formula is often used to find the probability of various causes of a known event.

2. Naive Bayes classification algorithm

​ The Naive Bayes classification algorithm is a classification method based on Bayes theorem and the assumption of feature conditional independence. For a given training set, the joint probability distribution of input and output (prior probability and conditional probability distribution) is learned based on the assumption of feature conditional independence. After learning the probability distribution, Bayes' theorem is used to find the output with the largest posterior probability as a prediction. result.

The process of model training is equivalent to learning P(Ai|B), where B is a specific category and Ai is a feature. When calculating this probability, it is assumed that the features do not affect each other and each feature is conditionally independent. This assumption is the conditional independence assumption, which can simplify the naive Bayes method, but may sacrifice certain classification accuracy.

​ The process of model prediction is to find the probability of P(B|Ai) based on the Bayesian formula, that is, under the conditions of each feature, predict the probability of which category the input belongs to, and which category has the highest probability, then predict the input What category is the sample? According to Bayesian formula, P(B|Ai) is converted to:
P ( B ∣ A i ) = P ( A i ∣ B ) P ( B ) P ( A i ) P(B|Ai) = \frac{ P(Ai|B)P(B)}{P(Ai)}P(BAi)=P ( A i )P(AiB)P(B)
Since for each category, the calculation formula has a denominator P(Ai), you only need to calculate the numerator and compare the size. According to the assumption of conditional independence, it is converted to: P ( B )
P ( A 1 ∣ B ) P ( A 2 ∣ B ) . . P ( A n ∣ B ) P(B)P(A_1|B)P(A_2|B)..P(A_n|B)P(B)P(A1B)P(A2B)..P(AnB )
​ And P(Ai|B) is the conditional probability calculated during model training.

3. Decision tree algorithm

​ A decision tree is a prediction model. Each node in the tree represents an object, and each bifurcation path represents a possible attribute value, and each leaf node corresponds to the journey from the root node to the leaf node. A path represents a complete process of making decisions based on attribute values.

​ Common decision tree algorithms include ID3, C4.5, and CART algorithms. The focus of this course is the ID3 algorithm. The ID3 algorithm uses information gain as the attribute selection criterion to determine the attributes used to generate each node.

​ Information entropy is used to measure the purity of a data set and is defined as:

Insert image description here

Assume A is an attribute of the sample, A has v values, the training data set is divided into V parts, each subset has pi positive examples and ni negative examples, then the conditional entropy is calculated as follows.
H ( D ∣ A ) = ∑ i = 1 vpi + nip + n H ( [ pipi + ni , nipi + ni ] ) H(D|A) = \sum_{i=1}^v\frac{pi+ni }{p+n}H([\frac{pi}{pi+ni},\frac{ni}{pi+ni}])H(DA)=i=1vp+npi+it isH([pi+inpi,pi+init is])
​ Information gain refers to the difference between the information entropy of set D and the information conditional entropy of feature A under given conditions:
G ain ( D , A ) = H ( D ) − H ( D ∣ A ) Gain(D,A ) = H(D)-H(D|A)Gain(D,A)=H(D)H ( D A )
When building a decision tree, it is hoped that the selected decision attributes can play a classification role as much as possible, that is, the purity of the classified set should be as high as possible, that is, the smaller H(D|A), the better. , so the attribute with the largest information gain is selected as the classification decision of the current level. After completing the attribute selection at the current level, calculate the information gain of the remaining attributes, and repeat the above process until all attributes are selected or all samples have been classified, then the construction of the entire decision tree is completed. For classification, you only need to start from the root node of the decision tree, select the left and right branches according to the attributes of the node, until you reach the leaf node, and you will get the classification result.

3.Experimental content

1. Naive Bayes algorithm

​ Suppose A represents the category, B, C, and D represent three features. According to the principle of Naive Bayes algorithm, the required probability is:
P ( A ∣ BCD ) = α P ( A ) P ( B ∣ A ) P ( C ∣ A ) P ( D ∣ A ) P(A|BCD) =αP(A)P(B|A)P(C|A)P(D|A)P(ABCD)=α P ( A ) P ( B A ) P ( C A ) P ( D A )
training

​ It is agreed that the storage structure of the conditional probability calculation results used by the naive Bayes algorithm is:

#类别的先验概率
self.label_prob = {
    
    类别:概率} 
#每个类别下特征为某个取值的条件概率
self.condition_prob = {
    
    }
#内部结构为:类别(好/坏):{特征编号:{特征取值:条件概率}}

To calculate the probability to be used, first calculate how many of each category appears in the training samples, divide the total number of samples to get P(A):

        #计算label_prob
        cnt = 0
        num = 0
        for item in label:
            num+=1
            if item == 1:
                cnt+=1
        self.label_prob[0] = (num-cnt)/num
        self.label_prob[1] = cnt/num

​ Next, save the dictionary of feature value probabilities under the initialized two categories:

		self.condition_prob[0] = {
    
    }
        self.condition_prob[1] = {
    
    }
        #初始化每个特征取值的字典
        for item in self.condition_prob:
            for feat in range(len(feature[0])): #len(feature[0])为特征数量
                self.condition_prob[item][feat] = {
    
    }

​ Traverse the features of all samples and count the values ​​of each feature:

        i=0         #样本编号
        for data in feature:
            j=0     #特征序号
            for feat in data:
                if(self.condition_prob[0][j].get(feat)==None):
                    self.condition_prob[0][j][feat] = 0
                if(self.condition_prob[1][j].get(feat)==None):
                    self.condition_prob[1][j][feat] = 0
                if label[i]==0:
                    self.condition_prob[0][j][feat] += 1
                else:
                    self.condition_prob[1][j][feat] += 1
                j+=1
            i+=1

Finally, divide by the number of samples in each category to get the conditional probability:

        for feat in range(len(feature[0])):
            for item in self.condition_prob[0][feat]:
                self.condition_prob[0][feat][item] /= (num-cnt)
            for item in self.condition_prob[1][feat]:
                self.condition_prob[1][feat][item] /= cnt

predict

The prediction process only needs to be calculated according to the Bayesian formula. Since there are only two categories (good/bad), it only needs to be calculated separately:
P (A good) = α P (A bad) P (B ∣ A bad) P ( C ∣ A bad ) P ( D ∣ A bad ) P ( A bad ) = α P ( A bad ) P ( B ∣ A bad ) P ( C ∣ A bad ) P ( D ∣ A bad ) P(A_{ good}) = αP(A_{bad})P(B|A_{bad})P(C|A_{bad})P(D|A_{bad}) \\ P(A_{bad}) = αP( A_{bad})P(B|A_{bad})P(C|A_{bad})P(D|A_{bad})P(Agood)=α P ( Abad)P(BAbad)P(CAbad)P(DAbad)P(Abad)=α P ( Abad)P(BAbad)P(CAbad)P(DAbad)
​ The values ​​of B, C, and D are the characteristic values ​​of the input sample. The process of prediction calculation is as follows:


    def predict(self, feature):
        '''
        对数据进行预测,返回预测结果
        :param feature:测试数据集所有特征组成的ndarray
        :return:
        '''
        res = []
        for item in feature:
            P_good = self.label_prob[1]
            P_bad = self.label_prob[0]
            feat_idx = 0
            for feat in item:
                P_good *= self.condition_prob[1][feat_idx][feat]
                P_bad *= self.condition_prob[0][feat_idx][feat]
                feat_idx+=1
            if P_good>P_bad:
                res.append(1)
            else:
                res.append(0)
        return res

2. Laplacian smoothing

​ If there are not enough samples, there may be certain classifications that do not have specific features. In this way, when using the Naive Bayes algorithm for classification, if this feature appears, the prediction will be unreasonable. For example:

Insert image description here

​ In the bad melon category, there are no samples with fuzzy texture features, then the probability of P (fuzzy|bad melon) is 0. If the predicted sample has fuzzy texture features, the calculated probability is the probability of bad melon. Just 0. Smoothing is required for this purpose, and the most commonly used method is Laplacian smoothing.

Laplacian smoothing refers to assuming that N represents the total number of categories in the training data set, and Ni represents the total number of values ​​in the i-th column in the training data set. Then during the training process, when calculating the probability of a category, add 1 to the numerator and add N to the denominator. When calculating the conditional probability, add 1 to the numerator and add Ni to the denominator.

​ That is, P(Ai) is corrected to:
P ( A i ) = number of samples in category A i + 1 number of samples + NP(Ai) = \frac{number of samples in category Ai + 1} {number of samples + N}P ( A i )=Number of samples+NThe number of samples in category A i+1
​ And P(Bj|Ai) is corrected to:
P ( B j ∣ A i ) = The number of samples whose category is A i and feature B is B j + 1 The number of samples whose category is A i + N i (feature B's Number of value categories) P(Bj|Ai) = \frac{The number of samples whose category is Ai and feature B is Bj+1}{The number of samples whose category is Ai+Ni (the number of value categories of feature B)}P(BjAi)=The category is the number of samples of A i+N i ( the number of value types of feature B )The number of samples whose category is A i and feature B is B j+1
​ The corrected probability and conditional probability will not be 0, avoiding the above unreasonable situation.

According to the Laplace smoothing principle, the process of calculating category probability is modified as:

        cnt = 0
        num = 0
        for item in label:
            num+=1
            if item == 1:
                cnt+=1
        types = 2	#类别数
        self.label_prob[0] = (num-cnt)+1/(num+types)
        self.label_prob[1] = cnt+1/(num+types)

​ The initialization of the dictionary storing conditional probabilities does not need to be modified and is still:

        self.condition_prob[0] = {
    
    }
        self.condition_prob[1] = {
    
    }
        #初始化每个特征取值的字典
        for item in self.condition_prob:
            for feat in range(len(feature[0])):
                self.condition_prob[item][feat] = {
    
    }

When calculating conditional probability, adjust the numerator and denominator to achieve Laplace smoothing:

        #记录每个特征的取值
        i=0         #样本编号
        for data in feature:
            j=0     #特征序号
            for feat in data:
                if(self.condition_prob[0][j].get(feat)==None):
                    self.condition_prob[0][j][feat] = 1     #分子加一,初始化为1
                if(self.condition_prob[1][j].get(feat)==None):
                    self.condition_prob[1][j][feat] = 1     #分子加一,初始化为1
                if label[i]==0:
                    self.condition_prob[0][j][feat] += 1
                else:
                    self.condition_prob[1][j][feat] += 1
                j+=1
            i+=1
        #计算条件概率,每个特征取值 除 label为0和1的个数+特征的取值种类数
        for feat in range(len(feature[0])):
            for item in self.condition_prob[0][feat]:
                self.condition_prob[0][feat][item] /= (num-cnt)+len(self.condition_prob[0][feat])
            for item in self.condition_prob[1][feat]:
                self.condition_prob[1][feat][item] /= cnt+len(self.condition_prob[1][feat])

The prediction process remains unchanged and is still calculated according to the Bayesian formula. The category with the highest probability of belonging to a specific category under given features is taken as the classification result.

3.Experimental results

​ Without Laplacian smoothing, the prediction accuracy for the given data is higher than 0.8. After Laplacian smoothing, the prediction accuracy for the given data is higher than 0.9.

4.Thinking questions

The naive Bayes algorithm is relatively simple to implement and has no parameters, but the conditional independence assumption may sacrifice the accuracy of prediction.

​ The decision tree algorithm (ID3) does not have a pruning strategy. The information gain strategy may have a preference for features with a large number of values ​​(which can reduce the entropy value), while features with similar numbers will make the information gain close to 1. And it can only Dealing with the characteristics of discrete distributions. On the basis of ID3, the C4.5 algorithm uses information gain rate as the selection criterion for classification attributes, avoiding ID3's emphasis on the number of features. And a pruning strategy is introduced to prune and discretize continuous features (take the average of two adjacent samples as the dividing point, and calculate the information gain when each dividing point is used as a classification point)

For algorithms such as support vector machine algorithm and artificial neural network, the effect of model training and performance can be improved by increasing the number of iterations or reducing the learning rate.

Guess you like

Origin blog.csdn.net/Aaron503/article/details/131104758