Integrated algorithm-Adaboost code

     The ensemble algorithm is that we combine different classifiers, and the result of this combination is called an ensemble method or a meta-algorithm. There are many forms when using the integration method: it can be the integration of different algorithms, the integration of agreed algorithms under different settings, or the integration after different parts of the data set are assigned to different classifiers.

Two forms:

Bagging method: After selecting S times from the original data set, S new data sets are obtained, and then a certain learning algorithm is applied to the data set respectively, and S classifiers are obtained. When classifying the new data set, use these The classifier performs classification, and at the same time, the category with the most votes from the classifier is selected as the final classification result. Different classifiers are obtained through serial training, and each new classifier is trained according to the performance of the trained classifier. The weights of the classifiers are equal.

Example: random forest

Boosting method: Using multiple classifiers, it obtains a new classifier by focusing on the data that has been misclassified by the existing classifiers. The result of boosting classification is based on the weighted summation results of all classifiers, and the weights are not equal. , Each weight represents the success of its corresponding classifier in the previous iteration.

Examples: Adaboost, GBDT

The idea of ​​AdaBoost:

    1. Each sample in the training data is given a weight, which is initialized to an equal value, and these weights are reconstructed into a vector D

    2. First train a weak classifier on the training data and calculate the error rate of the classifier, and then train the weak classifier again on the same data set. In the second training of the classifier, the weight of each sample will be re-adjusted. Among them, the weight of the first paired sample will decrease, and the weight of the first wrong sample will increase.

    3. In order to get the final classification results from all weak classifiers, Adaboost assigns a weight alpha to each classifier. These alpha values ​​are based on the error rate of each weak classifier.

                               

4. After calculating the alpha value, the weight vector D can be updated, so that the weight of the correctly classified samples is reduced and the weight of the misclassified samples is increased.

 Correct classification:


Misclassification:


After calculating D, in the next iteration, the process of training and adjusting the weights will be repeated until the error rate of the training set is 0 or the number of weak classifiers reaches the user's specified value.

Code:

import numpy as np
import matplotlib.pyplot as plt
def loadSimpData():
    dataMat=np.matrix([[1.,2.1],
                  [1.5,1.6],
                   [1.3,1.],
                   [1.,1.],
                   [2.,1.]
                  ])
    classLabels=[1.0 , 1.0 , -1.0 ,-1.0 ,1.0]
    return dataMat,classLabels
#数组过滤  将数据分成正好相反的两类
def  stumpClassify(dataMatrix,dimen,threshVal,threshIneq): # dimen特征值  threshVal 阈值  threshIneq 代表是lt或者是gt
    retArray=np.ones((np.shape(dataMatrix)[0],1))  #数组元素全部设置为1
    if threshIneq=='lt':
        retArray[dataMatrix[:,dimen]<= threshVal]=-1.0
    else:
        retArray[dataMatrix[:,dimen]> threshVal]=-1.0
    return retArray

Build a single-level decision tree to find the feature and index with the smallest error rate

def buildStump(dataArr, classLabels,D): #最佳基于数据的权重向量D来定义的
    dataMatrix=np.mat(dataArr);labelMat=np.mat(classLabels).T
    m,n=np.shape(dataMatrix)
    numSteps=10.0;bestStump={};bestClasEst= np.mat(np.zeros((m,1)))  #bestStump空字典
    minError = float('inf');#初始化为无穷大,之后用于寻找可能的最小的错误率 
    for i in range(n):#所有的特征上进行遍历
        # 计算出最大的步长
        rangeMin = dataMatrix[:,i].min();rangeMax = dataMatrix[:,i].max()
        stepSize = (rangeMax-rangeMin)/numSteps #最大的步长
        #
        for j in range(-1,int(numSteps)+1):
            #大于或小于阈值的
            for inequal in ['lt','gt']:
                threshVal=(rangeMin+float(j)*stepSize) #阈值的计算
                predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)
                #计算加权错误率
                errArr= np.mat(np.ones((m,1)))
                errArr[predictedVals== labelMat]=0
                weightedError=D.T*errArr  
                print("split:dim %d, thresh %.2f ,thresh ineqal : %s, the weighted error is %.3f" % (i, threshVal,inequal,weightedError))
                # inequal 类型
                if weightedError<minError:
                    minError = weightedError
                    bestClasEat = predictedVals.copy()
                    bestStump['dim']=i
                    bestStump['thresh']=threshVal
                    bestStump['ineq']=inequal
    return bestStump,minError,bestClasEat
First, the error rate of the classifier is calculated for the first training, and then the training is continued and the weights are adjusted.
def adaBoostTrains(dataArr,classLabels,numIt=40):
    weakClassArr = []
    m=np.shape(dataArr)[0]
    D=np.mat(np.ones((m,1))/m)
    aggClassEst=np.mat(np.zeros((m,1)))
    for i in range (numIt):
        #利用buildStump()找到最佳的单层决策树
        bestStump,error,classEst = buildStump(dataArr,classLabels,D) #D 权重
        print("D: ",D.T)
        alpha=float(0.5*np.log((1.0-error)/max(error,1e-16)))  #alpha公式 1e是科学计数法   max确保在没有错误时除以0不会溢出
        bestStump['alpha']=alpha
        weakClassArr.append(bestStump)#  转化为list
        print("classEst:", classEst.T) #特征
        #权重的分布
        expon=np.multiply(-1*alpha*np.mat(classLabels).T,classEst)#如果分对了,则同号,分错了异号,正好对应公式
        D=np.multiply(D,np.exp(expon))
        D=D/D.sum()   
        #  ai*yi
        aggClassEst += alpha*classEst
        print("aggClassEst :" ,aggClassEst.T)
        #  sign将aggClassEst转化为[1,-1.....]的m*1的矩阵,再与特征矩阵对比,得出[1,0....],其中1为错误分类,转置之后与ones相乘得到错误分类的个数
        aggErrors=np.multiply(np.sign(aggClassEst)!= np.mat(classLabels).T,np.ones((m,1)))
        #计算错误率
        errorRate = aggErrors.sum()/m
        print("total error:",errorRate,"\n")
        if errorRate == 0.0 :break
    return weakClassArr
dataArr,classLabels=loadSimpData()
weakClassArr,aggClassEst = adaBoostTrains(dataArr,classLabels)
print(weakClassArr)
print(aggClassEst)

Output result:





Guess you like

Origin blog.csdn.net/qq_28409193/article/details/79609040