-AdaBoost machine learning algorithm and code that implements the principle of binary classification 2

Brief introduction

Previous accordance with a specific link, describes the process of training a strong classifier;
this article, mask another piece of code, and classified with a strong classifier to verify the classification results.


1. Create a data set and tag
def create_dataMat():
    dataMat = mat([[0],[1],[2],[3],[4],[5]]) #x
    labels = [1, 1, -1, -1, 1, -1] #y
    return dataMat, labels
2. using a single layer of data classification tree

dataMatirx: data to be classified
dimen: Dimensions
threshVal: Threshold
threshIneq: There are two ways, 'lt' = lower than, 'gt' = greater than

def stumpClassify(dataMatrix,dimen,threshVal,threshIneq): 
   retArray = ones((shape(dataMatrix)[0],1))
   if threshIneq == 'lt': #按照大于或小于关系分类
       retArray[dataMatrix[:,dimen] <= threshVal] = -1.0   #方法‘lt’,如果希望大于阈值的是1,则小于阈值的部分置为-1
   else:
       retArray[dataMatrix[:,dimen] > threshVal] = -1.0    #方法‘gt’,如果希望大于阈值的是-1,则小于阈值的部分置为1
   return retArray
3. Establishment of optimal single tree, i.e., the weak classifiers

And an article on different points of this cut is no longer used in the threshold mean two adjacent points, but on the basis of the minimum value, the minimum value + = * iteration step
function returns weak classifiers: the most good tree, error rate, classification results

def buildStump(dataMat,labels,D):
   labelMat = mat(labels).T #转置为列向量
   m,n = shape(dataMat) # m为数据个数,n为每条数据含有的样本数(也就是特征)
   numSteps = 10.0
   bestStump = {} #存贮决策树
   bestClasEst = mat(zeros((m,1))) # 初始化为全部分类错误,即分类器为[[0],[0],[0],...]
   minError = inf #最小错误,初始化为无穷大
   for i in range(n): #遍历所有的特征
       rangeMin = dataMat[:,i].min()   # 找这一列特征的最小值
       rangeMax = dataMat[:,i].max()   # 找这一列特征的最大值
       stepSize = (rangeMax-rangeMin)/numSteps  #步长

       for j in range(-1,int(numSteps)+1): #将阈值起始与终止设置在该特征取值的范围之外
           for inequal in ['lt', 'gt']: #取less than和greater than,即:大于阈值是1还是小于阈值是1
               threshVal = (rangeMin + float(j) * stepSize)   # 阈值设为最小值+第j个步长
               predictedVals = stumpClassify(dataMat,i,threshVal,inequal) #使用单层决策树分类

               errArr = mat(ones((m,1)))  #1表示分类错误,0表示分类正确,初始化为全部错误
               errArr[predictedVals == labelMat] = 0  #矢量比较,分类正确的置为0
               weightedError = D.T*errArr  #乘以系数D

               if weightedError < minError: #如果错误率变小,更新最佳决策树
                   minError = weightedError
                   bestClasEst = predictedVals.copy()
                   bestStump['dim'] = i
                   bestStump['thresh'] = threshVal
                   bestStump['ineq'] = inequal
   return bestStump,minError,bestClasEst
4. Use Adaboost establish strong classifier, numIt represents the maximum number of iterations

Returns a set of weak classifiers, weak classifiers parameters comprise: dimen- dimension, threshVal- threshold, threshIneq- classification (lt, gt), alpha- weight of
the weak classifiers are combined by weight, to obtain strong classifiers

def adaBoostTrainDS(dataMat,classLabels,numIt=40):
    weakClassArr = []
    m = shape(dataMat)[0]
    D = mat(ones((m,1)) / m) # 初始化权重系数D,给每个样本相同的权重,[[1/m],[1/m],[1/m],...]
    aggClassEst = mat(zeros((m,1)))   # 初始化每个样本的预估值为0
    for i in range(numIt):
        bestStump,error,classEst = buildStump(dataMat,classLabels,D) # 构建一棵单层决策树,返回最好的树,错误率和分类结果
        #print "D:",D.T
        alpha = float(0.5*log((1.0-error)/max(error,1e-16)))  #计算分类器权重alpha, max(error,1e-16)防止下溢出
        bestStump['alpha'] = alpha   #将alpha值也加入最佳树字典
        weakClassArr.append(bestStump) #保存弱分类器
        #print "classEst: ",classEst.T
        expon = multiply(-1*alpha*mat(classLabels).T,classEst) #自然底数的指数,为了更新权重D
        D = multiply(D,exp(expon)) #为下次迭代更新D
        D = D/D.sum()

        # 累加错误率,直到错误率为0或者到达迭代次数
        aggClassEst += alpha*classEst #矢量相加
        #print "aggClassEst: ",aggClassEst.T
        aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1))) #分类正确与错误的结果
        errorRate = aggErrors.sum()/m #分类错误率
        #print "total error: ",errorRate
        if errorRate == 0.0: break #如果分类错误率为0,结束分类
    return weakClassArr  #弱分类器集合
5. Use # Adaboost classifier to classify the data

Several weighted sum of weak classifiers, the classification according to the positive or negative, return is a positive number, negative -1

def adaClassify(datToClass, classifierArr):
   dataMatrix = mat(datToClass) #待分类数据转化为矩阵
   m = shape(dataMatrix)[0] #待分类数据的个数
   aggClassEst = mat(zeros((m,1))) #所有待分类数据的分类,全部初始化为正类

   #结果等于几个弱分类器的加权求和
   for i in range(len(classifierArr)): #使用弱分类器,均为矢量运算
       classEst = stumpClassify(dataMatrix, classifierArr[i]['dim'],classifierArr[i]['thresh'],classifierArr[i]['ineq'])
       aggClassEst += classifierArr[i]['alpha']*classEst   # 将弱分类器结果加权求和

       print(aggClassEst)
   return sign(aggClassEst)   #根据结果的正负情况得到分类输出,1表示正号,-1表示负号
6. call functions, training the classifier, and the use of classification for classification
if __name__ == '__main__':
    #1.训练分类器
    dataMat, labels = create_dataMat()
    weakClassArr = adaBoostTrainDS(dataMat, labels)
    #2.测试数据
    for i in range(6):
        res = adaClassify([i], weakClassArr)
        print('data: %d, class: %2d' % (i, res))

It is assumed that the data set and tags:

dataMat = mat([[0],[1],[2],[3],[4],[5],[6],[7]])
labels = [1, 1, -1, -1, 1, -1,1,1]


Example 1: The data set is set to 6, i.e. 0-5, to verify the correctness of the classification

Data Set:
Datamat MAT = ([[0], [. 1], [2], [. 3], [. 4], [. 5]])
Labels = [. 1,. 1, -1, -1,. 1, -1 ]

Forecast results are consistent with the original labels, classified correctly:
Here Insert Picture Description


Example 2: the data set to eight, i.e. 0-7, again verify the correctness of the classification

Data Set:
Datamat MAT = ([[0], [. 1], [2], [. 3], [. 4], [. 5], [. 6], [. 7]])
Labels = [. 1,. 1, -1 , -1, 1, -1,1,1]

Results: The
samples corresponding to the label '5' -1, -1 classification results below, correct!
Here Insert Picture Description
As it is seen above, after seven iterations trained only then the error rate to a minimum, i.e. to obtain seven weak classifiers;
them above scale weighted sum, the result is the result of the prediction obtained.

Guess you like

Origin blog.csdn.net/gm_Ergou/article/details/90731551