Based sklearn simple linear regression, logistic regression, svm and other summary

The basic operation of some AI algorithms based sklearn

sklearn some related libraries

These libraries were introduced into correlation algorithm

import pandas as pd #导入一个用于读取csv数据的容器
from sklearn.model_selection import train_test_split #用于数据集划分的模块
from sklearn.model_selection import GridSearchCV #用于交叉验证的模块
from sklearn.neighbors import KNeighborsClassifier #knn算法的模块
from sklearn.linear_model import LinearRegression  #线性回归算法的模块
from sklearn.linear_model import LogisticRegression #逻辑回归算法模块
from sklearn.svm import SVC #SVC算法的模块
import matplotlib.pyplot as plt #可视化的绘图模块
import warnings #此处是用于忽视警告的模块,warnings.filterwarnings("ignore")语句忽视警告
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler #特征预处理处理模块
import numpy as np 
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler #特征预处理模块,one-hot编码和均一化处理

The basic idea;

Tag defined features and objectives -> read the entire data set -> tag are read characteristic data sets the XY -> Data Partitioning (test set, the training set) -> Model Algorithm declare -> training, the accuracy of the test is calculated

Column defined features and objectives

fruit_label = { 'apple': 0,'mandarin': 1,'orange': 2,'lemon': 3}  #用一个字典表示水果标签的映射关系,有监督的分类中需要用到

feature_data=['mass','width','height','color_score'] #用一个序列说明特征含有这几种

Read the entire data set

 data_fruit=pd.read_csv(文件路径) #读取整个csv文件中的数据

Map generation column label (label is the result of the original text, to generate a digital) (KNN classification there is need supervision)

  data_fruit['Label']=data_fruit['fruit_name'].map(fruit_label) 
  #生成了一个新的标签列,列名是Lable,数据是根据fruit_name列数据生成的0-3的数据(0-3在上面定义了的)
  #也就有监督的分类中需要用到,有监督的预测貌似不用诶...

They are set to read data tags and features

 X=data_fruit[feature_data].values #读取了之前定义的特征序列中的列,作为X数组
 y=data_fruit['Label'].values #读取标签列中数据作为Y数组
 
 #data[key] 类型是pandas的series 类型的数据;而data[key].values 类型是numpy 的ndarray类型的数据

Data Partitioning (training set, test set)

#将整个数据集划分为4个部分特征的训练集、测试集,结果标签的训练集、测试集

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1 / 3, random_state=10)
 
 #上述语句表示对X,y进行划分,其中测试集占整体数据的1/3,并且随机划分(防止原先数据有序排放使得某些标签数据没有得到训练)

Statement algorithm model, training, computing accuracy

Statement KNN algorithm, training, precision calculated

knn_model=KNeighborsClassifier()
knn_model.fit(X_train,y_train) #对训练集进行训练
accuracy=knn_model.score(X_test,y_test) #对测试集进行测试计算精度

Disclaimer linear regression algorithm, training, precision calculated

  linear_reg_model = LinearRegression()
  linear_reg_model.fit(X_train, y_train)
  r2_score = linear_reg_model.score(X_test, y_test)
  
  #单个样本的取值;X_test[i, :]表示测试集中i行的全部数据
  

Statement logistic regression algorithm, training, precision calculated

  LogisticRegression_model=LogisticRegression()
  LogisticRegression_model.fit(X_train, y_train)
  r2_score =LogisticRegression_model.score(X_test, y_test)
 
 

Statement of SVM, training, precision calculated

  SVM_model=SVC()
  SVM_model.fit(X_train, y_train)
  accuracy= SVM_model.score(X_test, y_test)

Super parameters to find the optimal (KNN in K, logistics regression C, SVM in C)

KNN K in a determined way of thinking:

 定义一个K序列,放入想要测试的一些K的值,然后遍历K,重复进行上述基本操作(读取数据集中的特征标签、划分数据集、训练与测试)
 
 k_sets=[3,5,8]
 
 for k_set in k_sets:
    round_function(fruit_data,k_set)#这边需要fruit_data,所以定义这个函数要传入fruit_data参数
    
 def round_function(fruit_data,k_set):
    X=fruit_data[feature_data].values 
    y=fruit_data['Label'].values
    
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=1 / 3, random_state=10)
    
    knn_model=KNeighborsClassifier(n_neighbors=k_set)#再传入参数K进行验证,默认是5
    knn_model.fit(X_train,y_train)
    accuracy=knn_model.score(X_test,y_test)

    print('K为{}时候的精确度为{:.2f}%'.format(k_set,accuracy*100))

Cross validation (For grid search)

Cross-validation: the parameter adjustment test (the test to determine the optimum parameters over), the training set is divided into N portions, each portion in turn are part of N as a trial test set, then calculate the accuracy of each degree, the average value of N last seeking accuracy. N represents the number of folds (fold)

For the known grid search over a plurality of parameters to be determined, sklearn using GridSearchCV () process wherein, for example, require the following debugging kNN case of two parameters p and k

步骤:定义需要验证的字典-->遍历测试打印精确度
model_dict={
     'KNN': (KNeighborsClassifier(),{'n_neighbors':[3,5,7],'p':[1,2]}),
    'Logistic':(LogisticRegression(),{'C':[1e-2,1,1e2]}),
    'SVM':(SVC(),{'C':[1e-2,1,1e2] })
   }
 
#遍历比较
for model_name,(model,model_param) in model_dict.items():

    #训练模型,选出最好的参数
    clf=GridSearchCV(estimator=model, param_grid=model_param, cv=5) #放入算法模型、需要确定的参数、设置为5折交叉验证
    clf.fit(X_train,y_train)
    best_model=clf.best_estimator_
    
    #计算精确度
    acc=best_model.score(X_test,y_test)
    
    #打印比较
    print('{}的最好参数是{}准确度为{:.2f}%'.format(model_name,best_model,acc*100))

To find the optimal algorithm

A dictionary definition of a good algorithm, traverse

   #定义好一个算法字典,将算法名和模型对应好
   model_dict={'KNN': KNeighborsClassifier(n_neighbors=7), 'Logistic':LogisticRegression(C=1), 'SVM':SVC(C=1) }
   
   #遍历比较
   for model_name,model in model_dict.items():
    #训练模型
    model.fit(X_train,y_train)
    #计算精确度
    acc=model.score(X_test,y_test)
    #打印比较
    print('{}的准确度为{:.2f}%'.format(model_name,acc*100))

note

items () is used when traversing the dictionary, corresponding to the value taken health

Visualization Drawing

Call matplotlib drawing module, the definition of a drawing method, drawing the basic line process: create a graphic instance (plt.figure ()) -> Drawing -> Display (plt.show ())

 def plot_fitting_line(linear_reg_model, X, y, feat):
    """
    绘制线型回归线
    """
    w = linear_reg_model.coef_  #使用coef_获得权重
    b = linear_reg_model.intercept_  # .intercept_ 获得偏置项

    plt.figure()#创建一个图形实例,相当于一个画布
    
    # 真实值的散点图
    plt.scatter(X, y, alpha=0.5) #绘制基于X轴(特征)的真实值的散点图,透明度为50%

    # 直线
    plt.plot(X, w * X + b, c='red')#绘制基于x轴(特征)的预测值的直线图,红线绘制
    plt.title(标题) #图像的标题
    plt.show() #显示图像

Drawing incoming X, Y-axis, and some other parameters defined

note

We need to use what kind of what kind of parameters passed on, the above parameters are passed

Pretreatment Characteristics

Features can be divided into: numeric features, feature type ordered, the type of characteristics (sex and the like), wherein the digital type may be normalized, for the type of feature may be one-hot encoding

Step: Statement Phyletic -> pretreatment -> wherein the process is completed with the rear main function processing training test

# 使用的特征列
NUM_FEAT_COLS = ['AGE','BMI', 'BP', 'S1', 'S2','S3','S4','S5','S6']#数字型特征
CAT_FEAT_COLS=[ 'SEX']#类别型特征

#定义预处理方法,传入训练集和特征集的特征
def process_features(X_train, X_test):

# 1. 对类别型特征做one-hot encoding
encoder = OneHotEncoder(sparse=False)
encoded_tr_feat = encoder.fit_transform(X_train[CAT_FEAT_COLS])
encoded_te_feat = encoder.transform(X_test[CAT_FEAT_COLS])
#因为这边需要通过列的名称取到特征数据,所以主函数中划分X,y时不加.values,这也不加因为这里是numpy。然后训练集特征与测试集特征处理函数是不同的

# 2. 对数值型特征值做归一化处理
scaler = MinMaxScaler()
scaled_tr_feat = scaler.fit_transform(X_train[NUM_FEAT_COLS])
scaled_te_feat = scaler.transform(X_test[NUM_FEAT_COLS])

# 3. 特征合并
X_train_proc = np.hstack((encoded_tr_feat, scaled_tr_feat))
X_test_proc = np.hstack((encoded_te_feat, scaled_te_feat))

return X_train_proc, X_test_proc

Guess you like

Origin www.cnblogs.com/jacker2019/p/11235631.html