decision tree classification
Decision tree classification is classified as supervised learning, which can classify the data set layer by layer according to the eigenvalues. Its advantage is that the computational complexity is not high, and the classified results can be presented intuitively, but there will also be an over-matching problem. The first step of decision tree classification using the ID3 algorithm is to select a feature value that can best classify the data set, and then recursively form a classification tree. Use information gain to get the best classification specificity.
information gain
After the data set is divided according to a certain feature, the change in information is called information gain. It is calculated according to which feature value the data set has the largest information gain value, and that special feature is the best feature of the current divided data set. The information gain value can be calculated using the following formula:
is the probability value of the same category. For example, a dataset has 10 records, 3 records belong to class A, and 7 records belong to class B. According to the formula, the information gain value is - (0.3*log 2 (0.3) + 0.7*log 2 (0.7)).
define a function that computes this value
def calcuShannon(dataset): dataset_row = len(dataset) #Get the number of rows in the dataset labels = {} #Declare a label dictionary for featVec in dataset: lable = featVec[-1] #Get the last label classification of each row if label not in labels.keys(): labels[lable] = 0 labels[lable] += 1 #Get the number of labels for each category shannon = 0.0 for key in labels: pro = float(labels[key])/dataset_row Shannon -= pro *log(pro, 2 ) return shannon #return information gain value
As an example of practical application, the following data set has four features X1, X2, X3, X4
Next, calculate the later information gain value divided by the X2 feature.
The value of X2 is 1, 0, after cutting the matrix by 1, we get
Calculate S1 = -((2/3) * log 2 (2/3) + (1/3)*log 2 (1/3))
After cutting the matrix by 0, we get
Calculate S2 = -(1 * log 2 1 )
The total information gain value S = (3/6)*S1 + (3/6)*S2
The above is the information gain value obtained by dividing by X2, and the value of all features is obtained by cyclic calculation, and finally the largest one is taken out, then the feature is the best classification feature, and the subsequent sub-dataset matrix is obtained. Repeat this process, when All features are divided, or the data in the subset belong to the same class, stop dividing eg .
#The specified value of a column is divided into a data matrix def splitdataset(dataset, axis, value): redataset = [] for featVec in dataset: if featVec[axis] == value: reduecfeatVec = featVec[:, axis] reduecfeatVec.extend(featVec[axis, :]) redataset.append(reduecfeatVec) return redataset
#Incoming data input the best division feature def getbestfeature(dataset): numfeature = len(dataset[0]) - 1 #Get the number of features baseEntropy = caluShannonent(dataset) #Calculate the basic information gain value of the dataset bestinfoGain = 0.0 bestfeature = -1 for i in range(numfeature): featlist = [] for data_row in dataset: featlist.extend(data_row[i]) #Get the feature value of the i-th column uniqueVals = set(featlist) #Construct a list to remove duplicate values newEntropy = 0.0 for value in uniqueVals: #divide the dataset according to the i column and the value subdataset = splitdataset(dataset, i, value) #get the subdataset prob =len(subdataset)/ float(len(dataset)) newEntropy += prob *calcuShannon(subdataset) #Get the i-th information gain value infoGain = baseEntropy - newEntropy if infoGain> bestinfoGain: bestinfoGain = infoGain bestfeature = i return bestfeature #return the column number of the best divided feature
Recursively generate a sub-classification tree:
#All features are divided, if not all of the last labels belong to the same class, call this function to return which class is the most def majorityCnt(classlist): classcount = {} for vote in classlist: if vote in classlist: classcount[vote] = 0 classcount[vote] += 1 sortedclasscount = sorted(classcount.iteritems(), key = operator.itemgetter(1), reverse= True) return sortedclasscount[0][0] def createTree(dataset, lables): classList = [example[-1] for example in dataset] #Generate classification column if classList.count(classList[0]) == len(classList): #All labels belong to the same class return classList[0] if len(dataset[0]) == 1: #Only label column exists return majorityCnt(classList) bestfeat = getbestfeature(dataset) #return the best classification feature index bestfeatlabel = lables[bestfeat] #return feature value del (lables[bestfeat]) #Delete the best classification feature classifyTree = { bestfeatlabel : {}} bestfeatColValue = [example[bestfeat] for example in dataset] uniqueValue = set(bestfeatColValue) for value in uniqueValue: sublabels = lables[:] classifyTree = createTree(splitdataset(dataset, bestfeat, value), sublabels) #Recursive call to generate classification tree return classifyTree