The idea behind meta-algorithms is a way of combining other algorithms, A
from numpy import * def loadSimpData(): datMat = matrix([[ 1. , 2.1], [ 2. , 1.1], [ 1.3, 1. ], [ 1. , 1. ], [ 2. , 1. ]]) classLabels = [1.0, 1.0, -1.0, -1.0, 1.0] return datMat,classLabels def loadDataSet(fileName): #general function to parse tab -delimited floats numFeat = len(open(fileName).readline().split('\t')) #get number of fields dataMat = []; labelMat = [] fr = open(fileName) for line in fr.readlines(): lineArr =[] curLine = line.strip().split('\t') for i in range(numFeat-1): lineArr.append(float(curLine[i])) dataMat.append(lineArr) labelMat.append(float(curLine[-1])) return dataMat,labelMat def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#just classify the data retArray = ones((shape(dataMatrix)[0],1)) if threshIneq == 'lt': retArray[dataMatrix[:,dimen] <= threshVal] = -1.0 else: retArray[dataMatrix[:,dimen] > threshVal] = -1.0 return retArray def buildStump(dataArr,classLabels,D): dataMatrix = mat(dataArr); labelMat = mat(classLabels).T m,n = shape(dataMatrix) numSteps = 10.0; bestStump = {}; bestClasEst = mat(zeros((m,1))) minError = inf #init error sum, to +infinity for i in range(n):#loop over all dimensions rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max(); stepSize = (rangeMax-rangeMin)/numSteps for j in range(-1,int(numSteps)+1):#loop over all range in current dimension for inequal in ['lt', 'gt']: #go over less than and greater than threshVal = (rangeMin + float(j) * stepSize) predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)#call stump classify with i, j, lessThan errArr = mat(ones((m,1))) errArr[predictedVals == labelMat] = 0 weightedError = D.T*errArr #calc total error multiplied by D #print "split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError) if weightedError < minError: minError = weightedError bestClasEst = predictedVals.copy() bestStump[ ' dim ' ] = i bestStump['thresh'] = threshVal bestStump[ ' ineq ' ] = inequal return bestStump,minError,bestClassEst def adaBoostTrainDS(dataArr,classLabels,numIt=40): weakClassArr = [] m = shape(dataArr)[0] D = mat(ones((m,1))/m) #init D to all equal aggClassEst = mat(zeros((m,1))) for i in range(numIt): bestStump,error,classEst = buildStump(dataArr,classLabels,D)#build Stump #print "D:",D.T alpha = float(0.5*log((1.0-error)/max(error,1e-16)))#calc alpha, throw in max(error,eps) to account for error=0 bestStump['alpha'] = alpha weakClassArr.append(bestStump) #store Stump Params in Array #print "classEst: ",classEst.T expon = multiply(-1*alpha*mat(classLabels).T,classEst) #exponent for D calc, getting messy D = multiply(D,exp(expon)) #Calc New D for next iteration D = D/D.sum() #calc training error of all classifiers, if this is 0 quit for loop early (use break) aggClassEst += alpha*classEst #print "aggClassEst: ",aggClassEst.T aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1))) errorRate = aggErrors.sum()/m print "total error: ",errorRate if errorRate == 0.0: break return weakClassArr,aggClassEst def adaClassify(datToClass,classifierArr): dataMatrix = mat(datToClass)#do stuff similar to last aggClassEst in adaBoostTrainDS m = shape(dataMatrix)[0] aggClassEst = mat(zeros((m,1))) for i in range(len(classifierArr)): classEst = stumpClassify(dataMatrix,classifierArr[i]['dim'],\ classifierArr[i]['thresh'],\ classifierArr[i]['ineq'])#call stump classify aggClassEst += classifierArr[i]['alpha']*classEst print aggClassEst return sign(aggClassEst) def plotROC(predStrengths, classLabels): import matplotlib.pyplot as plt cur = (1.0,1.0) #cursor ySum = 0.0 #variable to calculate AUC numPosClas = sum(array(classLabels)==1.0) yStep = 1/float(numPosClas); xStep = 1/float(len(classLabels)-numPosClas) sortedIndicies = predStrengths.argsort()#get sorted index, it's reverse fig = plt.figure() fig.clf() ax = plt.subplot(111) #loop through all the values, drawing a line segment at each point for index in sortedIndicies.tolist()[0]: if classLabels[index] == 1.0: delX = 0; delY = yStep; else: delX = xStep; delY = 0; ySum += cur[1] #draw line from cur to (cur[0]-delX,cur[1]-delY) ax.plot([cur[0],cur[0]-delX],[cur[1],cur[1]-delY], c='b') cur = (cur [0] -delX, cur [1] - delY) ax.plot([0,1],[0,1],'b--') plt.xlabel('False positive rate'); plt.ylabel('True positive rate') plt.title('ROC curve for AdaBoost horse colic detection system') ax.axis([0,1,0,1]) plt.show() print "the Area Under the Curve is: ",ySum*xStep
daboost is the most popular meta-algorithm and one of the most powerful tools in machine learning
The combination method can be the combination of different algorithms, the integration of the same algorithm under different settings, or the integration of different parts of the data set after they are assigned to different classifiers
Advantages: low generalization error rate, easy to code, can be applied to most classifiers, no parameters need to be adjusted
Cons: Sensitive to outliers
Applicable to numeric data and nominal data
Bagging is a technique of obtaining S new data sets after selecting S times from the original data set. The new data set is equal in size to the original data set, and each data set is replaced by randomly selecting a sample from the original data set. A process allows the selection of repeated values, while some values may not appear
After the S pieces of data are established, an algorithm is applied to each data set to obtain S classifiers. When we classify new data, we can use these S classifiers to classify and select the classifier voting results. The most results in the final classification result
The more advanced bagging method is random forest
Boosting is a technique similar to bagging, bagging is obtained through serial training, boosting is to focus on the part of the data that has been misclassified by existing classifiers to obtain new classifiers
The result of boosting is the result of the weighted summation of all classifiers. Bagging has equal weights, and boosting weights are different. Each weight represents the success of the classifier in the previous iteration.
Adaboost is a kind of boosting
The Adaboost algorithm can be briefly described as three steps:
(1) First, initialize the weight distribution D1 of the training data. Assuming that there are N training sample data, each training sample is given the same weight at the beginning: w1=1/N.
(2) Then, train the weak classifier hi. The specific training process is: if a training sample point is accurately classified by the weak classifier hi, its corresponding weight should be reduced in the construction of the next training set; on the contrary, if a training sample point is misclassified , then its weight should increase. The sample set with updated weights is used to train the next classifier, and the whole training process goes on iteratively.
(3) Finally, combine the weak classifiers obtained from each training into a strong classifier. After the training process of each weak classifier is completed, increase the weight of the weak classifier with a small classification error rate, so that it plays a larger decisive role in the final classification function, and reduce the weight of the weak classifier with a large classification error rate. weights so that they play a lesser decisive role in the final classification function.
In other words, a weak classifier with a low error rate will have a larger weight in the final classifier, otherwise it will be smaller.