Machine learning—decision tree (case: judging whether a user in an online shop buys a computer)

What is a decision tree/decision tree?

The decision tree is a tree structure similar to a flowchart: where each internal node represents a test on an attribute, each branch represents an attribute output, and each leaf node represents a class or class distribution.
This picture shows the decision tree that determines whether to play according to the weather
This picture shows the decision tree that determines whether to play according to the weather

Decision tree is an important algorithm in classification methods in machine learning

Its processing of small-scale data sets is very intuitive and reliable. Generally used to process discrete data, while continuous data (1-100 years old) we need to segment the data. For example, 1-25 years old is called young 25-50 middle-aged 50+ old, but in practical applications, The segmentation processing is similar in nature to the hyperparameters, and has a great impact on the results, so the segmentation processing of continuous data is a new content.

Basic algorithm for constructing decision tree

The concept of entropy:

The amount of information of a piece of information has a direct relationship with its uncertainty. To figure out a very, very uncertain thing, or something we don’t know, we need to understand a lot of information ==> measurement of the amount of information It is equal to the uncertainty. The
Insert picture description here
greater the uncertainty of the variable, the greater the entropy.

The concept of Information Gain:

Information Gain: Gain(A) = Info(D)-Is Infor_A(D)
a big head? Look down at a wave of practice

Introduce a problem: determine whether the online shop browsing user buys a computer

Collect user data and get a small-scale sample (decision tree is applicable). The
Insert picture description here
first step is to calculate the entropy of whether to purchase:
Insert picture description here
the formula for entropy is 9 yes and 5 no. Get the above formula

We then calculate age (entropy of age) in Insert picture description here
14 samples, there are 5 young people, 4 middle-aged people and five old men to get the above formula and result.

Next, calculate the amount of information about age: the amount of information
Insert picture description here
above is age.
Insert picture description here
By observing that we select the root node with the largest amount of information,
Insert picture description here
repeat the above root node operation to get (ID3 decision tree)
Insert picture description here

Algorithm idea:

The tree starts with a single node that represents the training sample (step 1).
If the samples are all in the same class, the node becomes a leaf and is labeled with that class (steps 2 and 3).
Otherwise, the algorithm uses an entropy-based metric called information gain as heuristic information and selects the attribute that can best classify the sample (step 6). This attribute becomes the "test" or "decision" attribute of the node (step 7). In this version of the algorithm,
all attributes are classified, that is, discrete values. Continuous attributes must be discretized.
For each known value of the test attribute, create a branch and divide the sample accordingly (steps 8-10).
The algorithm uses the same process to recursively form a sample decision tree on each partition. Once an attribute appears on a node, there is no need to consider it on any descendants of that node (step 13).
The recursive division step only stops when one of the following conditions is true:
(a) All samples at a given node belong to the same class (steps 2 and 3).
(b) There are no remaining attributes that can be used to further divide the sample (step 4). In this case, majority voting is used (step 5).
This involves converting a given node into a leaf and labeling it with the class in which most of the samples are located. Alternatively,
the class distribution of node samples can be stored .
© Branch
test_attribute = ai No samples (step 11). In this case,
create a leaf with most classes in samples (step 12)

Other algorithms: (In fact, it is to select a suitable root by changing the method of calculating the amount of information)

C4.5: Quinlan
Classification and Regression Trees (CART): (L. Breiman, J. Friedman, R. Olshen, C. Stone) Common points: all are greedy algorithms, top-down approach difference: Different attribute selection measurement methods: C4.5 (gain ratio), CART (gini index), ID3 (Information Gain)

Tree pruning (avoid overfitting)

First pruning: according to a function method to get a similar tree (fineness) and then stop
and pruning: according to a function method to obtain a complete tree (all attribute categories) and then pruning leaves (remove those that are too fine Leaf node)

to sum up

Advantages of decision trees:

Intuitive, easy to understand, effective for small-scale data sets

Disadvantages of decision trees:

Dealing with continuous variables is not good. When there
are many categories, errors will increase faster
. Scalability is average (small amount of data, few attributes)

I hope you can give me a thumbs up after reading it. If you make a mistake, please correct me and give me a chance to improve and correct.

#该id3直接使用科学库完成,所以不过多解释代码详细逻辑及转换
#control+鼠标指指向看不懂的方法,会有详细的使用帮助
#看不懂的再给我留言,我会进行答复
from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import tree
from sklearn import preprocessing
# from sklearn.externals.six import StringIO
import numpy

# Read in the csv file and put features into list of dict and list of class label
allElectronicsData = open('AllElectronics.csv', 'rt')
reader = csv.reader(allElectronicsData)
headers = next(reader)

print(headers)

featureList = []
labelList = []

for row in reader:
    labelList.append(row[len(row)-1])#row[len(row)-1] 每一行最后一个值
    rowDict = {
    
    }
    for i in range(1, len(row)-1):
        rowDict[headers[i]] = row[i]
    featureList.append(rowDict)

print(featureList)

# Vetorize features
vec = DictVectorizer()
dummyX = vec.fit_transform(featureList) .toarray()

print("dummyX: " + str(dummyX))
print(vec.get_feature_names())

print("labelList: " + str(labelList))

# vectorize class labels
lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
print("dummyY: " + str(dummyY))

# Using decision tree for classification
# clf = tree.DecisionTreeClassifier()
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(dummyX, dummyY)
print("clf: " + str(clf))


# Visualize model
with open("allElectronicInformationGainOri.dot", 'w') as f:
    f = tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file=f)

oneRowX = dummyX[0, :]
print("oneRowX: " + str(oneRowX))

newRowX = oneRowX
newRowX[0] = 1
newRowX[2] = 0
print("newRowX: " + str(newRowX))
newRowX=numpy.array(newRowX).reshape(1,-1)
predictedY = clf.predict(newRowX)
print("predictedY: " + str(predictedY))



Guess you like

Origin blog.csdn.net/LiaoNailin/article/details/108638477