[Radiomics] Classifier Model Design - Random Forest + Support Vector Machine


1. Random Forest Classification

  • Decision tree ( Decision Tree)
    • It is based on knowing the occurrence probability of various situations, by constructing a decision tree to obtain the probability that the expected value of net present value is greater than or equal to zero.
    • A decision tree is a tree structure in which each internal node represents a test on an attribute, each branch represents a test output, and each leaf node represents a category.

  • Implementing Decision Trees in Python
    • Function: sklearn.tree.DecisionTreeClassifier ( from sklearn.tree import DecisionTreeClassifier)
    • Model initialization: dt_model = DecisionTreeClassifier()
    • Training data:dt_model.fit(X, y)

  • Random Forest
    • Random Forest refers to a classifier that uses multiple trees to train and predict samples .

  • The main advantages of random forests
    • It has excellent accuracy
    • It can run efficiently on large data sets
    • It can handle input samples with high-dimensional features
    • It can evaluate the importance (weight) of each feature in the classification problem

  • Implementing Random Forest Classification in Python
    • Functions: from sklearn.ensemble import RandomForestClassifier
    • Model initialization: model_rf = RandomForestClassifier()
    • Training data:model_rf.fit(X, y)

  • Random forest classification python code :
    R code reference: R language - random forest classification

    # 导入包
    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import StandardScaler
    from scipy.stats import ttest_ind, levene
    from sklearn.linear_model import LassoCV
    from sklearn.utils import shuffle
    from sklearn.ensemble import RandomForestClassifier # 随机森林分类器
    from sklearn.model_selection import train_test_split # 训练集测试集分割
    
    # 导入数据
    xlsx_a = 'data/featureTable/aa.xlsx'
    xlsx_b = 'data/featureTable/bb.xlsx'
    data_a = pd.read_excel(xlsx_a)
    data_b = pd.read_excel(xlsx_b)
    print(data_a.shape,data_b.shape)
    # (212, 30) (357, 30)
    
    # t 检验特征筛选
    index = []
    for colName in data_a.columns[:]: 
        if levene(data_a[colName], data_b[colName])[1] > 0.05: 
            if ttest_ind(data_a[colName], data_b[colName])[1] < 0.05: 
                index.append(colName)
        else: 
            if ttest_ind(data_a[colName], data_b[colName],equal_var=False)[1] < 0.05: 
                index.append(colName)
    print(len(index))  # 25
    print(index)
    # ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'M', 'N', 'P', 'Q', 'R', 'U', 'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB', 'AC', 'AD']
    
    # t 检验后数据处理
    data_a = data_a[index]
    data_b = data_b[index]
    rows_a,cols_a = data_a.shape
    rows_b,cols_b = data_b.shape
    labels_a = np.zeros(rows_a)
    labels_b = np.ones(rows_b)
    data_a.insert(0, 'label', labels_a)
    data_b.insert(0, 'label', labels_b)
    data = pd.concat([data_a,data_b])
    data = shuffle(data)
    data.index = range(len(data))
    X = data[data.columns[1:]]
    y = data['label']
    X = X.apply(pd.to_numeric, errors='ignore')
    colNames = X.columns
    X = X.fillna(0)
    X = X.astype(np.float64)
    X = StandardScaler().fit_transform(X)
    X = pd.DataFrame(X)
    X.columns = colNames
    print(data.shape)  # (569, 26)
    
    # LASSO 特征筛选
    alphas = np.logspace(-4,1,50)
    model_lassoCV = LassoCV(alphas = alphas,max_iter = 100000).fit(X,y)
    coef = pd.Series(model_lassoCV.coef_, index = X.columns)
    print(model_lassoCV.alpha_)
    print('%s %d'%('Lasso picked',sum(coef != 0)))
    print(coef[coef != 0])
    index = coef[coef != 0].index  # 提取权重不为 0 的特征数据
    X = X[index]  
    
    np.set_printoptions(threshold=np.inf)  # 设置输出结果不带省略
    
    # 分割训练集、测试集
    X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state = 15)
    # X 数据,y label 分组,test_size=0.3 训练集:测试集=7:3,random_state = 15(随机种子)
    
    # 随机森林分类
    model_rf = RandomForestClassifier(
      n_estimators = 200  # default 100 设置随机森林中有多少棵树
      , criterion = 'entropy' # 分类标准:'gini' and 'entropy'熵,default 'gini',gini 指数
      , random_state = 20  # default = None(随机种子)
      , class_weight = 'balanced' # default = None 分类出的数据组别之间平衡处理
      )
    
    model_rf.fit(X_train,y_train)  # 训练集训练
    # print(model_rf.score(X_test,y_test))  # 测试集准确率
    # print(model_rf.predict(X_test))  # 测试集各病例基于训练集模型的预测结果
    # print(model_rf.predict_proba(X_test)) # 测试集预测结果的预测概率
    # print(model_rf.n_features_)  # 拟合模型过程中用了多少特征
    # print(model_rf.feature_importances_)  # 各特征的权重,权重加和=1
    print(model_rf.get_params())  # 构建模型时的各项参数
    

2. Support Vector Machine Classification

  • Support vector machine : support vector machines, SVM
    • A binary classification model (actually multi-classification can be done , and SVM processing)
    Basic model : a linear classifier with the largest interval defined in the feature space (actually nonlinear)
    Basic idea : solvingThe separating hyperplane that can correctly divide the training data set and has the largest geometric interval(Similar to a straight line in two-dimensional space and a plane in three-dimensional space, two case classes can be divided.)
    • Has a kernel trick , which makes it a substantially non-linear classifier
    insert image description here

  • Kernel function
    • In practical applications, most of the data is not linearly separable , that is, there is no hyperplane that satisfies the conditions
    • The data can be mapped to a high-dimensional space through the kernel function to solve the problem of linear inseparability of the original space
    • Commonly used kernel function: linear kernel Function (linear), polynomial kernel function (poly), radial basis kernel function (rbf), sigmoid kernel function (S shape)
    insert image description here

  • Important parameters of the rbf radial basis kernel function
    • Parameter γ(gamma): defines the influence range of a single sample, the larger the γ, the more support vectors (the more samples are considered, and all samples are easy to overfit, as shown in the yellow line in the figure below )
    Penalty factorC : defines the tolerance for " foul " samples
    insert image description here

  • Implement support vector machine classification in Python
    • Functions: from sklearn.svm import SVC
    • Model initialization: model_svc = svm.SVC(kernel = ‘rbf’, gamma = 0.05, C = 1)set the kernel function kernel, parameter γ and penalty factor C.
    • Data training: model_svc.fit(X_train, y_train)
    • Accuracy: model_svc.score(X, y)
    • Obtaining parameters:model_svc.get_params()

  • References :
    Support Vector Machines for Classification Algorithms: SVM (Theory)
    Support Vector Machines for Classification Algorithms: SVM (Applications)

  • Support vector machine classification python code :
    Except for importing modules, the steps before SVM classification are the same as random forest classification.

    # 导入包
    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import StandardScaler
    from scipy.stats import ttest_ind, levene
    from sklearn.utils import shuffle
    from sklearn.linear_model import LassoCV
    from sklearn.model_selection import train_test_split
    from sklearn.svm import SVC  # 支持向量分类器
    
    # 导入数据
    xlsx_a = 'data/featureTable/aa.xlsx'
    xlsx_b = 'data/featureTable/bb.xlsx'
    data_a = pd.read_excel(xlsx_a)
    data_b = pd.read_excel(xlsx_b)
    print(data_a.shape,data_b.shape)
    # (212, 30) (357, 30)
    
    # t 检验特征筛选
    index = []
    for colName in data_a.columns[:]: 
        if levene(data_a[colName], data_b[colName])[1] > 0.05: 
            if ttest_ind(data_a[colName], data_b[colName])[1] < 0.05: 
                index.append(colName)
        else: 
            if ttest_ind(data_a[colName], data_b[colName],equal_var=False)[1] < 0.05: 
                index.append(colName)
    print(len(index))  # 25
    print(index)
    # ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'M', 'N', 'P', 'Q', 'R', 'U', 'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB', 'AC', 'AD']
    
    # t 检验后数据处理
    data_a = data_a[index]
    data_b = data_b[index]
    rows_a,cols_a = data_a.shape
    rows_b,cols_b = data_b.shape
    labels_a = np.zeros(rows_a)
    labels_b = np.ones(rows_b)
    data_a.insert(0, 'label', labels_a)
    data_b.insert(0, 'label', labels_b)
    data = pd.concat([data_a,data_b])
    data = shuffle(data)
    data.index = range(len(data))
    X = data[data.columns[1:]]
    y = data['label']
    X = X.apply(pd.to_numeric, errors='ignore')
    colNames = X.columns
    X = X.fillna(0)
    X = X.astype(np.float64)
    X = StandardScaler().fit_transform(X)
    X = pd.DataFrame(X)
    X.columns = colNames
    print(data.shape)  # (569, 26)
    
    # LASSO 特征筛选
    alphas = np.logspace(-4,1,50)
    model_lassoCV = LassoCV(alphas = alphas,max_iter = 100000).fit(X,y)
    coef = pd.Series(model_lassoCV.coef_, index = X.columns)
    print(model_lassoCV.alpha_)
    print('%s %d'%('Lasso picked',sum(coef != 0)))
    print(coef[coef != 0])
    index = coef[coef != 0].index  # 提取权重不为 0 的特征数据
    X = X[index]  
    
    np.set_printoptions(threshold=np.inf)  # 设置输出结果不带省略
    
    # 分割训练集、测试集
    X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state = 15)
    # X 数据,y label 分组,test_size=0.3 训练集:测试集=7:3,random_state = 15(随机种子)
    
    # SVM 分类
    model_svm = SVC(kernel='rbf', gamma = 'scale', probability=True)
    # 设置核函数为 rbf
    # gamma = 'scale' =1/(特征数×方差);'auto' = 1/特征数;还可以写浮点数,具体需要做参数优化
    # probability=True 设置为 True,后面才能查看概率
    
    # 训练集拟合模型
    model_svm.fit(X_train,y_train)
    # print(model_svm.score(X_test,y_test))  # 测试集准确率
    # print(model_svm.predict(X_test))   # 测试集各病例基于训练集模型的预测结果
    # print(model_svm.predict_proba(X_test))  # 测试集预测结果的预测概率
    print(model_svm.get_params())  # 构建模型时的各项参数
    

Guess you like

Origin blog.csdn.net/zea408497299/article/details/125308443