The road to machine learning: python linear regression classifier for benign and malignant tumor classification prediction

Prediction of benign and malignant tumors using logistic regression and random parameter estimation regression, respectively

I downloaded the dataset locally, you can come to my git to download the source code and dataset: https://github.com/linyi0604/kaggle

 

  1  import numpy as np
   2  import pandas as pd
   3  from sklearn.cross_validation import train_test_split
   4  from sklearn.preprocessing import StandardScaler
   5  from sklearn.linear_model import   LogisticRegression, SGDClassifier
   6  from sklearn.metrics import classification_report
   7  
  8  ''' 
  9  Linear classifier
 10  The most basic and commonly used machine learning
 models11  are limited by linear assumptions between data features and classification objectives12
 Logistic regression takes a long time to compute, and model performance is
 slightly higher13  Random parameter estimation calculation time is short, model performance is slightly lower
 14  ''' 
15  
16  ''' 
17  1 Data preprocessing
 18  ''' 
19  #Create feature list 
20 column_names = [ ' Sample code number ' , ' Clump Thickness ' , ' Uniformity of Cell Size ' ,
 21                  ' Uniformity of Cell Shape ' , ' Marginal Adhesion ' , ' Single Epithelial Cell size ' ,
 22                  ' Bare Nuclei ', 'Bland Chromatin ' , ' Normal Nucleoli ' , ' Mitoses ' , ' Class ' ]
 23  #Use pandas.read_csv to take dataset 
24 data = pd.read_csv( ' ./data/breast-cancer-wisconsin.data ' , names= column_names )
 25  #Replace ? with standard missing values ​​to indicate 
26 data = data.replace(to_replace= ' ? ' , value= np.nan)
 27  #Lost data with missing values ​​As long as one dimension is missing, discard 
28 data = data.dropna(how= ' any ')
 29 #Number  and dimension of output data data 30 # print ( data.shape) 31 32 33 ''' 34 2 Prepare benign and malignant tumor training, test data part
 35 ''' 36 #Randomly sample 25% of the data for testing 75 % data for training 37 x_train, x_test, y_train, y_test = train_test_split(data[column_names[1:10 ]],
 38                                                      data[column_names[10 ]],
 39                                                      test_size=0.25 ,
 40                                                      random_state=33 )
 41 #
 
 
 
 
  
 
  查验训练样本和测试样本的数量和类别分布
 42 # print(y_train.value_counts())
 43 # print(y_test.value_counts())
 44 '''
 45 训练样本共512条 其中344条良性肿瘤  168条恶性肿瘤
 46 2    344
 47 4    168
 48 Name: Class, dtype: int64
 49 测试数据共171条 其中100条良性肿瘤 71条恶性肿瘤
 50 2    100
 51 4     71
 52 Name: Class, dtype: int64
 53 '''
 54 
 55 
 56 '''
 57 3 机器学习模型进行预测部分
 58 '''
 59 # 数据标准化,保证每个维度特征的方差为1 均值为0 预测结果不会被某些维度过大的特征值主导
60 ss = StandardScaler()
 61 x_train = ss.fit_transform(x_train)      #Standardize x_train 
62 x_test = ss.transform(x_test) #Standardize x_test        with the same rules as x_train, do not re-establish the rules 
63  
64  #Use separately Two methods of logistic regression and random parameter estimation are used for learning prediction 
65  
66 lr = LogisticRegression()    #Initialize logistic regression model 
67 sgdc = SGDClassifier() #Initialize   random parameter estimation model 
68  
69  #Use logistic regression in training Training on the set 
70  lr.fit(x_train, y_train)
 71  #After training, predict the test set and save the prediction result in lr_y_predict 
72 lr_y_predict =lr.predict(x_test)
 73  
74  #Use random parameter estimation to train 75 on the training set 
sgdc.fit(x_train, y_train)
 76 #After training, the prediction result of the test set is saved in sgdc_y_predict 77 sgdc_y_predict = sgdc.predict(x_test )
 78 79 ''' 80 4 Performance analysis part
 81 ''' 82 #Logistic regression model comes with scoring function score to obtain the accuracy of the model on the test set 83 print ( " Logistic regression accuracy: " , lr .score(x_test, y_test))
 84 #Other indicators of logistic regression 85 print ( " Other indicators of logistic regression:\n "  
 
 
  
 
  
 , classification_report(y_test, lr_y_predict, target_names=[ " Benign " , " Malignant " ]))
 86  
87  #Performance analysis of random parameter estimation 
88  print ( " Random parameter estimation accuracy: " , sgdc.score(x_test, y_test))
 89  #Other indicators of random parameter estimation 90 print ( " Other indicators of 
random parameter estimation:\n " , classification_report(y_test, sgdc_y_predict, target_names=[ " Benign " , " Malignant " ]))
 91 92 ''' 93  
 
 recall 召回率
 94 precision 精确率
 95 fl-score
 96 support
 97 
 98 逻辑斯蒂回归准确率: 0.9707602339181286
 99 逻辑斯蒂回归的其他指标:
100               precision    recall  f1-score   support
101 
102      Benign       0.96      0.99      0.98       100
103   Malignant       0.99      0.94      0.96        71
104 
105 avg / total       0.97      0.97      0.97       171
106 
107 随机参数估计准确率: 0.9649122807017544
108 随机参数估计的其他指标:
109               precision    recall  f1-score   support
110 
111      Benign       0.97      0.97      0.97       100
112   Malignant       0.96      0.96      0.96        71
113 
114 avg / total       0.96      0.96      0.96       171
115 '''

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325031681&siteId=291194637