垃圾邮件识别-朴素贝叶斯算法-补充

我们设定的max_features = 5000,从调优的角度,我们试图分析词袋最大特征数max_features对结果的影响,我们分别计算max_features从1000到20000对评估准确度的影响。构造如下函数:

def show_diffrent_max_features():
    global max_features
    a=[]
    b=[]
    for i in range(1000,20000,2000):
        max_features=i
        print("max_features=%d" % i)
        x, y = get_features_by_wordbag()
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=0)
        gnb = GaussianNB()
        gnb.fit(x_train, y_train)
        y_pred = gnb.predict(x_test)
        score=metrics.accuracy_score(y_test, y_pred)
        a.append(max_features)
        b.append(score)
        plt.plot(a, b, 'r')
    plt.xlabel("max_features")
    plt.ylabel("metrics.accuracy_score")
    plt.title("metrics.accuracy_score VS max_features")
    plt.legend()
    plt.show()

加到上一个 “垃圾邮件识别-朴素贝叶斯算法”代码中,main()函数中添加

show_diffrent_max_features()

 输出结果:
 

有可视化结果可见:max_features值越大,模型评估准确度越高,同时整个系统运算时间也增长,当max_features超过13000以后,准确率反而下降,所以将max_features设置为15000左右,准确度接近96.7%。但实验表明,当max_features超过5000以后计算时间明显过长,max_features=5000是,准确率达95.5%。

当系统max_features设置为15000时,系统运行结果:

Hello spam-mail
get_features_by_wordbag
Load C:/Users/Administrator/PycharmProjects/tensortflow快速入门/tensorflow_study\MNIST_data_bak/enron1/ham/
Load C:/Users/Administrator/PycharmProjects/tensortflow快速入门/tensorflow_study\MNIST_data_bak/enron1/spam/
CountVectorizer(analyzer='word', binary=False, decode_error='ignore',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=15000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents='ascii', token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
NB and wordbag
0.9671338811019816
[[1419   44]
 [  24  582]]

猜你喜欢

转载自blog.csdn.net/zqzq19950725/article/details/86607730