decision tree classification

decision tree classification 

  Decision tree classification is classified as supervised learning, which can classify the data set layer by layer according to the eigenvalues. Its advantage is that the computational complexity is not high, and the classified results can be presented intuitively, but there will also be an over-matching problem. The first step of decision tree classification using the ID3 algorithm is to select a feature value that can best classify the data set, and then recursively form a classification tree. Use information gain to get the best classification specificity.

  information gain

  After the data set is divided according to a certain feature, the change in information is called information gain. It is calculated according to which feature value the data set has the largest information gain value, and that special feature is the best feature of the current divided data set. The information gain value can be calculated using the following formula:

  

  is the probability value of the same category. For example, a dataset has 10 records, 3 records belong to class A, and 7 records belong to class B. According to the formula, the information gain value is - (0.3*log 2 (0.3) + 0.7*log 2 (0.7)).

  

define a function that computes this value

def calcuShannon(dataset):
    dataset_row = len(dataset) #Get the number of rows in the dataset 
    labels = {} #Declare a label dictionary 
    for featVec in dataset:
        lable = featVec[-1] #Get the last label classification of each row 
        if label not  in labels.keys():
            labels[lable] = 0
        labels[lable] += 1 #Get the number of labels for each category
    shannon = 0.0

    for key in labels:
        pro = float(labels[key])/dataset_row

        Shannon -= pro *log(pro, 2 )
     return shannon   #return information gain value

 As an example of practical application, the following data set has four features X1, X2, X3, X4

Next, calculate the later information gain value divided by the X2 feature.

The value of X2 is 1, 0, after cutting the matrix by 1, we get

Calculate S1 = -((2/3) * log 2  (2/3) + (1/3)*log 2 (1/3))

After cutting the matrix by 0, we get

Calculate S2 = -(1 * log 2 1 )

The total information gain value S = (3/6)*S1 + (3/6)*S2

The above is the information gain value obtained by dividing by X2, and the value of all features is obtained by cyclic calculation, and finally the largest one is taken out, then the feature is the best classification feature, and the subsequent sub-dataset matrix is ​​obtained. Repeat this process, when All features are divided, or the data in the subset belong to the same class, stop dividing eg .

#The   specified value of a column is divided into a data matrix 
def splitdataset(dataset, axis, value):
    redataset = []
    for featVec in dataset:
        if featVec[axis] == value:
            reduecfeatVec = featVec[:, axis]
            reduecfeatVec.extend(featVec[axis, :])
            redataset.append(reduecfeatVec)
    return  redataset

 

#Incoming data input the best division feature 
def getbestfeature(dataset):
    numfeature = len(dataset[0]) - 1 #Get the number of features 
    baseEntropy = caluShannonent(dataset) #Calculate the basic information gain value of the 

    dataset bestinfoGain = 0.0 
    bestfeature = -1

    for i in range(numfeature):
        featlist  = []
        for data_row in dataset:
            featlist.extend(data_row[i]) #Get the feature value of the i-th column 
        uniqueVals = set(featlist) #Construct a list to remove duplicate values
        newEntropy = 0.0
        for value in uniqueVals: #divide the dataset according to the i column and the value 
            subdataset = splitdataset(dataset, i, value) #get the subdataset 
            prob =len(subdataset)/ float(len(dataset))
            newEntropy += prob *calcuShannon(subdataset) #Get the i-th information gain value 

        infoGain = baseEntropy - newEntropy

        if infoGain> bestinfoGain:
            bestinfoGain = infoGain
            bestfeature = i
     return bestfeature #return the column number of the best divided feature

 

Recursively generate a sub-classification tree:

 

#All features are divided, if not all of the last labels belong to the same class, call this function to return which class is the most 
def majorityCnt(classlist):
    classcount = {}
    for vote in classlist:
        if vote in classlist:
            classcount[vote] = 0
            classcount[vote] += 1

    sortedclasscount = sorted(classcount.iteritems(),
                              key = operator.itemgetter(1),
                              reverse= True)
    return sortedclasscount[0][0]

def createTree(dataset, lables):
    classList = [example[-1] for example in dataset] #Generate classification column 
    if classList.count(classList[0]) == len(classList): #All labels belong to the same class 
        return classList[0]

    if len(dataset[0]) == 1: #Only label column exists 
        return majorityCnt(classList)

    bestfeat = getbestfeature(dataset) #return the best classification feature index 

    bestfeatlabel = lables[bestfeat] #return feature value

    del (lables[bestfeat]) #Delete the best classification feature 
    classifyTree = { bestfeatlabel : {}}


    bestfeatColValue = [example[bestfeat] for example in dataset]

    uniqueValue = set(bestfeatColValue)

    for value in uniqueValue:
        sublabels = lables[:]
        classifyTree = createTree(splitdataset(dataset,
                                               bestfeat, value), sublabels) #Recursive call to generate classification tree

    return classifyTree 

 

 

 

   

  

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324885966&siteId=291194637