Binary classification decision tree under Spark Mllib - Category (2)

Model Assessment

At the end of the last chapter we mentioned assessment models, then simply seeking a little percentage that way can only reflect the accuracy of the rough model for binary classification algorithm, we have AUC (Area under the curve of the area under the ROC) ROC curve i.e. to assess the quality model before AUC was calculated to be understood that the following concepts:

/ true false
Positive TP FP
Yin TN FN
  • True positive True Positives (TP): The forecast is 1, the actual one.
  • False positive False Positives (FP): The forecast is 1, the actual zero.
  • True negative True Negatives (TN): prediction is 0, the actual zero.
  • False negative True Negatives (FN): prediction is 0, the actual one.
  • TPR: 1 for all practical sample of Comparative Example 1 is correctly determined that the.
    TPR = TP / (TP + FN)
  • FPR: In all of the actual sample 0 is erroneously determined as Comparative Example 1.
    The FPR = the FP / (the FP + the TN)

With TPR and FPR ROC curve can be drawn up, as shown below:

Here Insert Picture Description
AUC value is the positive direction of the ROC curve area surrounded XY axis

1, the training model

model = DecisionTree.trainClassifier(train_data,numClasses=2,categoricalFeaturesInfo={},impurity="entropy",maxDepth=5,maxBins=5)

2, the real value and predicted result are compressed together

score = model.predict(validation_data.map(lambda p:p.features))
score_and_labels = score.zip(validation_data.map(lambda p:p.label))
score_and_labels.take(5)
[(0.0, 1.0), (0.0, 0.0), (1.0, 0.0), (1.0, 0.0), (0.0, 1.0)]

3, introduced into the binary classifier model evaluation package

from pyspark.mllib.evaluation import BinaryClassificationMetrics

# 计算模型的AUC值(ROC曲线囊括的面积)
metrics = BinaryClassificationMetrics(score_and_labels)
metrics.areaUnderROC
0.6199946682707425

4, AUC integrated value calculated binary classification

def evaluationBinaryClassification(model,validation_data):
    # 将验证数据的features传入通过模型进行评估,然后得到预测结果
    score = model.predict(validation_data.map(lambda p:p.features))
    # 构造为 (预测值,真实值) 的集合
    score_and_labels = score.zip(validation_data.map(lambda p:p.label))
    # 计算出评估矩阵
    metrics = BinaryClassificationMetrics(score_and_labels)
    # 返回矩阵的ROC曲线面积也就是AUC的值
    return metrics.areaUnderROC

evaluationBinaryClassification(model,validation_data)
0.6769676269676269

5, the factors affecting the accuracy of the model

Because the training model we can change there are three parameters, impurity (branching), maxDepth (maximum depth of the tree), maxBins (the maximum number of branches per node), if simple permutations and combinations of the three variables plus enumeration the training model in columns and find the maximum value, then certainly accurate is accurate, but it was too resource-intensive, so this method (be a single variable principle) control variables were tested, the optimal solution for each parameter separately to explore;

5.1 impurity parameters influence

5.1.2 the definition of training evaluation model

import time

# 训练并评估模型
def evaluationTrainModel(train_data,validation_data,impurity,maxDepth,maxBins):
    # 记录开始时间
    start_time = time.time()
    # 根据传入参数训练模型
    model = DecisionTree.trainClassifier(train_data,numClasses=2,categoricalFeaturesInfo={},impurity=impurity,maxDepth=maxDepth,maxBins=maxBins)
    # 记录模型的训练时间
    duration = time.time() - start_time
    # 根据训练出的模型使用验证数据算出AUC值
    AUC = evaluationBinaryClassification(model,validation_data)
    return (model,duration,AUC,impurity,maxDepth,maxBins)

evaluationTrainModel(train_data,validation_data,"entropy",10,10)
(DecisionTreeModel classifier of depth 10 with 591 nodes,
 0.641793966293335,
 0.6459731773005045,
 'entropy',
 10,
 10)

5.1.3 Creating model assessment matrix

maxBinsList = [10]
maxDepthList = [10]
impurityList = ["gini","entropy"]
# 排列组合参数进行构造评估矩阵
metrics = [
        evaluationTrainModel(train_data,validation_data,impurity,maxDepth,maxBins)
        for maxBins in maxBinsList
        for maxDepth in maxDepthList
        for impurity in impurityList]
metrics
[(DecisionTreeModel classifier of depth 10 with 699 nodes,
  0.7073895931243896,
  0.6269453519453518,
  'gini',
  10,
  10),
 (DecisionTreeModel classifier of depth 10 with 673 nodes,
  0.7021915912628174,
  0.63500891000891,
  'entropy',
  10,
  10)]

5.1.4 The matrix to an DataFrame

import pandas as pd
# 转换为pandans的DataFrame
df = pd.DataFrame(data=metrics,index=impurityList,columns=["Model","Duration","AUC","Impurity","maxDepth","maxBins"])
df
Model Duration AUC Impurity maxDepth maxBins
gini DecisionTreeModel classifier of depth 10 with ... 0.707390 0.626945 gini 10 10
entropy DecisionTreeModel classifier of depth 10 with ... 0.702192 0.635009 entropy 10 10

5.1.5 rendering image

Some common methods and draw the image on pyplot can refer to another blogger's article:
Pyplot common drawing method

from matplotlib import pyplot as plt
# 绘制时间曲线
plt.plot(df["Duration"],ls="-",lw=2,c="r",label="Duration",marker="o")
# 绘制模型AUC分数柱
plt.bar(df["Impurity"],df["AUC"],ls="-",lw=2,color="c",label="AUC")
# 显示图例
plt.legend()
# 设置x轴方向的范围
plt.xlim(-1,2)
# 设置 y轴方向范围
plt.ylim(0.6,0.75)
# 绘制网格
plt.grid(ls=":",lw=1,c="gray")
# impurity = 'entropy'

Here Insert Picture Description
The image we can clearly see that entropy (entropy) is used as the basis for the branch, either from the model training time, and the accuracy of the model than the "gini" is better, so the branch method can be selected as " entropy "of

5.2 maxDepth parameters influence

maxBinsList = [10]
maxDepthList = [i for i in range(1,21)]
impurityList = ["entropy"]
metrics = [
        evaluationTrainModel(train_data,validation_data,impurity,maxDepth,maxBins)
        for maxBins in maxBinsList
        for maxDepth in maxDepthList
        for impurity in impurityList]
df = pd.DataFrame(data=metrics,index=maxDepthList,columns=["Model","Duration","AUC","Impurity","maxDepth","maxBins"])

# 绘制时间曲线
plt.plot(df["Duration"],ls="-",lw=2,c="r",label="Duration")
# 绘制模型AUC分数柱
plt.bar(df["maxDepth"],df["AUC"],ls="-",lw=2,color="c",label="AUC",tick_label=df["maxDepth"])
# 显示图例
plt.legend()
# 设置x轴方向的范围
plt.xlim(0,21)
# 设置 y轴方向范围
plt.ylim(0.4,0.7)
# 绘制网格
plt.grid(ls=":",lw=1,c="gray")
# maxDepth = 8

Here Insert Picture Description
From the figure we can see the model training time with increasing depth of the tree increases, but we found at a depth of 4 to 10, there is a model training time low period, but it happens AUC value of this range is also within reach maximum, so we choose the optimal solution as maxDepth of 8

5.3 maxBins parameters influence

maxBinsList = [i for i in range(2,21)]
maxDepthList = [8]
impurityList = ["entropy"]
metrics = [
        evaluationTrainModel(train_data,validation_data,impurity,maxDepth,maxBins)
        for maxBins in maxBinsList
        for maxDepth in maxDepthList
        for impurity in impurityList]
df = pd.DataFrame(data=metrics,index=maxBinsList,columns=["Model","Duration","AUC","Impurity","maxDepth","maxBins"])

# 绘制时间曲线
plt.plot(df["Duration"],ls="-",lw=2,c="r",label="Duration")
# 绘制模型AUC分数柱
plt.bar(df["maxBins"],df["AUC"],ls="-",lw=2,color="c",label="AUC",tick_label=df["maxBins"])
# 显示图例
plt.legend()
# 设置x轴方向的范围
plt.xlim(0,21)
# 设置 y轴方向范围
plt.ylim(0.6,0.7)
# 绘制网格
plt.grid(ls=":",lw=1,c="gray")
# maxBins = 7

Here Insert Picture Description
Can be seen from the figure, the maximum number of branches basically training time model and each node is not much of a relationship, the training time for a single model is basically in the 0.6s, so we only need to find a point that is the maximum value of the AUC can that maxBins = 7 Shi

6, the final results of the model

6.1 Importing data test file

def loadTestData(sc):
    raw_data_and_header = sc.textFile("file:/home/zh123/.jupyter/workspace/stumbleupon/test.tsv")
    # 文件头部
    header_data = raw_data_and_header.first()
    # 取头
    raw_non_header_data = raw_data_and_header.filter(lambda l:l != header_data)
    # 去引号
    raw_non_quot_data = raw_non_header_data.map(lambda s:s.replace("\"",""))
    # 最终测试文件数据
    data = raw_non_quot_data.map(lambda l:l.split("\t"))
    # 初始化类型映射字典
    categories_dict = data.map(lambda fields:fields[3]).distinct().zipWithIndex().collectAsMap()
    # 这里是因为我前面训练过程中没有保存那时的类型映射字典,所以这里就补了两位不然会报错
    categories_dict.update({"t1":-1,"t2":-1})
    
    label_point_rdd = data.map(lambda fields:(
                        fields[0],
                        extract_features(fields,categories_dict,len(fields))
    ))
    return label_point_rdd

test_file_data = loadTestData(sc)
print(test_file_data.take(1))
print(test_file_data.count())
[('http://www.lynnskitchenadventures.com/2009/04/homemade-enchilada-sauce.html', array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 4.43906000e-01, 2.55813954e+00,
       3.89705882e-01, 2.57352941e-01, 4.41176470e-02, 2.20588240e-02,
       4.89572471e-01, 0.00000000e+00, 0.00000000e+00, 6.71428570e-02,
       0.00000000e+00, 2.30285215e-01, 1.99438202e-01, 1.00000000e+00,
       1.00000000e+00, 1.50000000e+01, 0.00000000e+00, 5.64300000e+03,
       1.36000000e+02, 3.00000000e+00, 2.42647059e-01, 8.05970150e-02]))]
3171

6.2 Load Model

# 使用最终确认的的参数进行训练模型
model = evaluationTrainModel(train_data,validation_data,impurity="entropy",maxDepth=8,maxBins=7)[0]

6.3 10 randomly selected set of data to predict

# 从测试文件集中随机抽取10个数据
for f in test_file_data.randomSplit([10,3171-10])[0].collect():
    # 打印网站名称和预测结果
    print(f[0],bool(model.predict(f[1])))
http://culinarycory.com/2008/12/30/pear-apple-crumb-pie/ True
http://mobile.seriouseats.com/2011/03/ramen-hacks-30-easy-ways-to-upgrade-your-instant-noodles-japanese-what-to-do-with-ramen.html False
http://blogs.babble.com/family-kitchen/2011/10/03/halloween-hauntingly-beautiful-candied-apples/ True
http://www.ivillage.com/paprika-potato-frittata-0/3-r-60973 True
http://www.goodlifeeats.com/2011/06/chocolate-covered-brownie-ice-cream-sandwich-recipe.html False
http://redux.com/stream/item/2071196/Two-Women-Fight-for-a-Parking-Spot-Then-Brilliance-Happens False
http://news.boloji.com/2008/10/25084.htm False
http://azoovo.com/a-corporate-web-design/ True
http://funstuffcafe.com/need-to-want-less False
http://www.insidershealth.com/article/nestle_cookie_dough_recall/3601 False
http://bakingbites.com/category/recipes/bar-cookies/ True

That is above our forecast result, the format (url, whether long-term recommendation on the page), we can click on the page links man to see if the predicted results pages and back as part of the non / long-term recommendation on the page;

The last page of items to this classification will end up Sahua!
Blog in problem areas welcome chiefs pointed out that, of course, the question of small partners can also whisper a message or comment area yo!
The most important thing thumbs up! comment! Collection! Triple Oh, thank you.

Published 27 original articles · won praise 62 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_42359956/article/details/105417692