Tumor prediction (AdaBoost)-Python machine learning

Experiment content

Based on the Wisconsin breast cancer data set, the AdaBoost algorithm is used to achieve tumor prediction.

Experimental requirements

1. Load the data set that comes with sklearn and explore the data in DataFrame form.
2. Divide the training set and the test set, and check the average cancer incidence rate of the training set and the test set.
3. Configure the model, train the model, model prediction, and model evaluation.
(1) Construct a decision tree weak learner with a maximum depth of 2, train, predict, and evaluate.
(2) Construct an AdaBoost ensemble classifier containing 50 trees (step size is 3), train, predict, and evaluate.
Reference: Increase the number of decision trees from 1 to 50 in steps of 3. Outputs the integrated accuracy.
(3) Compare the performance of (2) with weak learners.
4. Draw a line chart of accuracy. The x-axis is the number of decision trees and the y-axis is accuracy.

AdaBoostClassifier parameter explanation

  • base_estimator: weak classifier, the default is CART classification tree: DecisionTressClassifier
  • algorithm: Two AdaBoost classification algorithms are implemented in scikit-learn, namely SAMME and SAMME.R. SAMME is the AdaBoost algorithm and refers to Discrete. AdaBoost.SAMME.R refers to Real AdaBoost. The return value is no longer a discrete type, but a real value representing probability. The iteration of SAMME.R is generally faster than SAMME, and the default algorithm is SAMME.R. Therefore, base_estimator must use a classifier that supports probabilistic predictions.
  • n_estimator: Maximum number of iterations, default 50. In the actual parameter adjustment process, n_estimator and learning rate learning_rate are often considered together.
  • learning_rate: The weight reduction coefficient v of each weak classifier. fk(x)=fk−1∗ak∗Gk(x)f_k(x)=f_{k-1}a_kG_k(x)fk​(x)=fk−1​∗ak​∗G k​(x) . Smaller v means more iterations. The default is 1, which means v does not play a role.

Experimental code

#导入包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier#导入AdaBoost包
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

#1.加载sklearn自带的数据集,使用DataFrame形式探索数据。
breast=load_breast_cancer()
data=pd.DataFrame(breast.data)
target=pd.DataFrame(breast.target)
feature_names=pd.DataFrame(breast.feature_names)

#print(pd.get_option("max_info_columns"))
#print(data.head())#使用pandas工具查看数据
#print(feature_names)
#pd.options.display.max_info_columns=200
#print(data.info())#查看数据集摘要
#print(data.describe())#数据描述性统计分析
#优化后
data=breast['data']
target=breast['target']
feature_names=breast['feature_names']
df=pd.DataFrame(data,columns=feature_names)
print(df.head())
df.info()
 
# 2.划分训练集和测试集,检查训练集和测试集的平均癌症发生率。
train_X,test_X,train_y,test_y=train_test_split(data,target,test_size=0.2)
 
# 3.配置模型,训练模型,模型预测,模型评估。

'''
#利用AdaBoost模型进行预测,输出模型评估报告
AdaBoost1=AdaBoostClassifier()
AdaBoost1.fit(train_X,train_y)
pred1=AdaBoost1.predict(test_X)
print("模型的准确率:",metrics.accuracy_score(test_y,pred1))
print("模型的评估报告:",metrics.classification_report(test_y,pred1))
'''

#(1)构建一棵最大深度为2的决策树弱学习器,训练、预测、评估。
AdaBoost2=AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3))
AdaBoost2.fit(train_X,train_y)
pred2=AdaBoost2.predict(test_X)
print("模型的准确率:",metrics.accuracy_score(test_y,pred2))
print("模型的评估报告:",metrics.classification_report(test_y,pred2))
#(2)再构建一个包含50棵树的AdaBoost集成分类器(步长为3),训练、预测、评估。 
AdaBoost3=AdaBoostClassifier(base_estimator=DecisionTreeClassifier(),n_estimators=50,learning_rate=3) 
AdaBoost3.fit(train_X,train_y)
pred3=AdaBoost3.predict(test_X)
print("模型的准确率:",metrics.accuracy_score(test_y,pred3))
print("模型的评估报告:",metrics.classification_report(test_y,pred3))
'''
参考:将决策树的数量从1增加到50,步长为3。输出集成后的准确度。
'''
#(3)将(2)的性能与弱学习者进行比较。
print("弱学习者的均方误差:",round(metrics.mean_squared_error(test_y,pred2),2))
print("决策树的均方误差:",round(metrics.mean_squared_error(test_y,pred3),2))
#4.绘制准确度的折线图,x轴为决策树的数量,y轴为准确度。
score_all=[]
for i in range(1,50):
    AdaBoost4=AdaBoostClassifier(base_estimator=DecisionTreeClassifier(),n_estimators=i,learning_rate=3) 
    AdaBoost4.fit(train_X,train_y)
    pred4=AdaBoost4.predict(test_X)
    score_all.append(metrics.accuracy_score(test_y,pred4))
    
plt.figure(figsize=(10,6))
plt.plot(range(1,50),score_all)
plt.xlabel(u'TreeSum')
plt.ylabel(u'sorce')
plt.title(u'The soulution of TreeSum and sorce')
plt.show()

Run screenshot

Insert image description here
Insert image description here
Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/weixin_48434899/article/details/124146908