Prediction of benign and malignant tumors using logistic regression and random parameter estimation regression, respectively
I downloaded the dataset locally, you can come to my git to download the source code and dataset: https://github.com/linyi0604/kaggle
1 import numpy as np
2 import pandas as pd
3 from sklearn.cross_validation import train_test_split
4 from sklearn.preprocessing import StandardScaler
5 from sklearn.linear_model import LogisticRegression, SGDClassifier
6 from sklearn.metrics import classification_report
7
8 '''
9 Linear classifier
10 The most basic and commonly used machine learning
models11 are limited by linear assumptions between data features and classification objectives12
Logistic regression takes a long time to compute, and model performance is
slightly higher13 Random parameter estimation calculation time is short, model performance is slightly lower
14 '''
15
16 '''
17 1 Data preprocessing
18 '''
19 #Create feature list
20 column_names = [ ' Sample code number ' , ' Clump Thickness ' , ' Uniformity of Cell Size ' ,
21 ' Uniformity of Cell Shape ' , ' Marginal Adhesion ' , ' Single Epithelial Cell size ' ,
22 ' Bare Nuclei ', 'Bland Chromatin ' , ' Normal Nucleoli ' , ' Mitoses ' , ' Class ' ]
23 #Use pandas.read_csv to take dataset
24 data = pd.read_csv( ' ./data/breast-cancer-wisconsin.data ' , names= column_names )
25 #Replace ? with standard missing values to indicate
26 data = data.replace(to_replace= ' ? ' , value= np.nan)
27 #Lost data with missing values As long as one dimension is missing, discard
28 data = data.dropna(how= ' any ')
29 #Number and dimension of output data data 30 # print ( data.shape) 31 32 33 ''' 34 2 Prepare benign and malignant tumor training, test data part
35 ''' 36 #Randomly sample 25% of the data for testing 75 % data for training 37 x_train, x_test, y_train, y_test = train_test_split(data[column_names[1:10 ]],
38 data[column_names[10 ]],
39 test_size=0.25 ,
40 random_state=33 )
41 #
查验训练样本和测试样本的数量和类别分布
42 # print(y_train.value_counts())
43 # print(y_test.value_counts())
44 '''
45 训练样本共512条 其中344条良性肿瘤 168条恶性肿瘤
46 2 344
47 4 168
48 Name: Class, dtype: int64
49 测试数据共171条 其中100条良性肿瘤 71条恶性肿瘤
50 2 100
51 4 71
52 Name: Class, dtype: int64
53 '''
54
55
56 '''
57 3 机器学习模型进行预测部分
58 '''
59 # 数据标准化,保证每个维度特征的方差为1 均值为0 预测结果不会被某些维度过大的特征值主导
60 ss = StandardScaler()
61 x_train = ss.fit_transform(x_train) #Standardize x_train
62 x_test = ss.transform(x_test) #Standardize x_test with the same rules as x_train, do not re-establish the rules
63
64 #Use separately Two methods of logistic regression and random parameter estimation are used for learning prediction
65
66 lr = LogisticRegression() #Initialize logistic regression model
67 sgdc = SGDClassifier() #Initialize random parameter estimation model
68
69 #Use logistic regression in training Training on the set
70 lr.fit(x_train, y_train)
71 #After training, predict the test set and save the prediction result in lr_y_predict
72 lr_y_predict =lr.predict(x_test)
73
74 #Use random parameter estimation to train 75 on the training set
sgdc.fit(x_train, y_train)
76 #After training, the prediction result of the test set is saved in sgdc_y_predict 77 sgdc_y_predict = sgdc.predict(x_test )
78 79 ''' 80 4 Performance analysis part
81 ''' 82 #Logistic regression model comes with scoring function score to obtain the accuracy of the model on the test set 83 print ( " Logistic regression accuracy: " , lr .score(x_test, y_test))
84 #Other indicators of logistic regression 85 print ( " Other indicators of logistic regression:\n "
, classification_report(y_test, lr_y_predict, target_names=[ " Benign " , " Malignant " ]))
86
87 #Performance analysis of random parameter estimation
88 print ( " Random parameter estimation accuracy: " , sgdc.score(x_test, y_test))
89 #Other indicators of random parameter estimation 90 print ( " Other indicators of
random parameter estimation:\n " , classification_report(y_test, sgdc_y_predict, target_names=[ " Benign " , " Malignant " ]))
91 92 ''' 93
recall 召回率
94 precision 精确率
95 fl-score
96 support
97
98 逻辑斯蒂回归准确率: 0.9707602339181286
99 逻辑斯蒂回归的其他指标:
100 precision recall f1-score support
101
102 Benign 0.96 0.99 0.98 100
103 Malignant 0.99 0.94 0.96 71
104
105 avg / total 0.97 0.97 0.97 171
106
107 随机参数估计准确率: 0.9649122807017544
108 随机参数估计的其他指标:
109 precision recall f1-score support
110
111 Benign 0.97 0.97 0.97 100
112 Malignant 0.96 0.96 0.96 71
113
114 avg / total 0.96 0.96 0.96 171
115 '''