Implementation of Fisher's two-class discrimination model based on Python

1. Completion form

The code of this Fisher two-class discrimination model is independently written using Python, and is basically based on the content of the previous class, without reference to the online code.

2. Implementation of algorithm ideas

-Data set selection and loading initialization In the
power industry, the data set that is more suitable for the Fisher classification and discrimination model is the classification of user portraits. However, due to the particularity of national control in the power industry, there are too few open source data sets that can be found on the Internet . More than a dozen energy customer profile data sets on the Dataju platform were all due to copyright and the principle of confidentiality of customer information in the second half of this year. Wait for reasons to be removed. After searching for a period of time, I had no choice but to give up choosing a ready-made open source classification data set. The data set of this code uses the breast cancer data set in the Scikit-Learn package.

#数据
#二分类问题
#直接从sklearn.datasets导入乳腺癌数据集
from sklearn.datasets import load_breast_cancer

The following is the more important information in the
Breast cancer wisconsin (diagnostic) dataset

Data Set Characteristics:

:Number of Instances: 569

:Number of Attributes: 30 numeric, predictive attributes and the class

:Attribute Information:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

    The mean, standard error, and "worst" or largest (mean of the three
    worst/largest values) of these features were computed for each image,
    resulting in 30 features.  For instance, field 0 is Mean Radius, field
    10 is Radius SE, field 20 is Worst Radius.

    - class:
            - WDBC-Malignant
            - WDBC-Benign

:Summary Statistics:

===================================== ====== ======
                                       Min    Max
===================================== ====== ======
radius (mean):                        6.981  28.11
texture (mean):                       9.71   39.28
perimeter (mean):                     43.79  188.5
area (mean):                          143.5  2501.0
smoothness (mean):                    0.053  0.163
compactness (mean):                   0.019  0.345
concavity (mean):                     0.0    0.427
concave points (mean):                0.0    0.201
symmetry (mean):                      0.106  0.304
fractal dimension (mean):             0.05   0.097
radius (standard error):              0.112  2.873
texture (standard error):             0.36   4.885
perimeter (standard error):           0.757  21.98
area (standard error):                6.802  542.2
smoothness (standard error):          0.002  0.031
compactness (standard error):         0.002  0.135
concavity (standard error):           0.0    0.396
concave points (standard error):      0.0    0.053
symmetry (standard error):            0.008  0.079
fractal dimension (standard error):   0.001  0.03
radius (worst):                       7.93   36.04
texture (worst):                      12.02  49.54
perimeter (worst):                    50.41  251.2
area (worst):                         185.2  4254.0
smoothness (worst):                   0.071  0.223
compactness (worst):                  0.027  1.058
concavity (worst):                    0.0    1.252
concave points (worst):               0.0    0.291
symmetry (worst):                     0.156  0.664
fractal dimension (worst):            0.055  0.208
===================================== ====== ======

Researching the data set can find that the data set consists of 30 indicators and a two-category target to indicate whether or not they are sick. It can be found from the min&max table that there is an indicator area with a particularly large variance. In the case of uncertainty about its impact on the disease, it is necessary to consider whether the data set needs to be standardized.
After loading the data set, use 20% of the data as the test sample.

from sklearn.model_selection import train_test_split
x = breast_cancer['data']
y = breast_cancer['target']
#随机采样,将20%的数据作为测试样本
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0,test_size=0.2)

The standardized code is as follows. When actually running the model, choose whether to do data and processing according to the final generated result.

# 标准化处理
from sklearn.preprocessing import StandardScaler
ss_x=StandardScaler()
ss_y=StandardScaler()
#分别对训练和测试数据的特征以及目标值进行标准化处理
x_train = ss_x.fit_transform(x_train)
x_test = ss_x.transform(x_test)

-Function design
According to the content learned in class, design the following function
| get_mean_vector(target) | find the mean vector |
| get_dispersion_matrix(target, mean_vector) | find the dispersion matrix within the sample |
| get_sample_divergence(mean_vector1, mean_vector2) | find the dispersion between samples Degree |
| get_w_star(dispersion_matrix, mean_vector1, mean_vector2) | Find the w_star solution of the Fisher criterion function |
| get_sample_projection(w_star, x) | Find the projection of a feature vector on w_star |
| get_segmentation_threshold(w_star, way_flag) |Find segmentation threshold|
| test_single_smaple(w_star, y0, test_sample, test_target) | Singleton test|
| test_single_smaple_check(w_star, y0, test_sample, test_target)|Single case test (for statistics) |
| test_check(w_star, y0) | Statistical test sample|
Specific The implementation is as follows:

def get_mean_vector(target):
    '''
    求均值向量
    :param target:
    :return:
    '''
    m_target_list = [0 for i in range(x_train.shape[1])]
    count = 0
    for i in range(x_train.shape[0]):
        if y_train[i] == target:
            count = count + 1
            temp = x_train[i].tolist()
            m_target_list = [m_target_list[j] + temp[j] for j in range(x_train.shape[1])]
    m_target_list = [x / count for x in m_target_list]
    # 其实可以用类似torch的压缩维度的函数直接求和
    return m_target_list

Calculate the mean vector labeled target by selecting the value of target.

def get_dispersion_matrix(target, mean_vector):
    '''
    求样本内离散度矩阵
    :param target:
    :param mean_vector:
    :return:
    '''
    s_target_matrix = np.zeros((x_train.shape[1], x_train.shape[1]))
    for i in range(x_train.shape[0]):
        if y_train[i] == target:
            temp = np.multiply(x_train[i] - mean_vector, (x_train[i] - mean_vector).transpose())
            s_target_matrix = s_target_matrix + temp
    return s_target_matrix

The in-sample dispersion matrix is ​​calculated by the target and the matching mean_vector.

def get_sample_divergence(mean_vector1, mean_vector2):
    '''
    求样本间离散度
    :param mean_vector1:
    :param mean_vector2:
    :return:
    '''
    return np.multiply((mean_vector1 - mean_vector2), (mean_vector1 - mean_vector2).transpose())

Calculate the dispersion between samples of two mean vectors.

def get_w_star(dispersion_matrix, mean_vector1, mean_vector2):
    '''
    求Fisher准则函数的w_star解
    :param dispersion_matrix:
    :param mean_vector1:
    :param mean_vector2:
    :return:
    '''
    return np.matmul(np.linalg.inv(dispersion_matrix), (mean_vector1 - mean_vector2))

Based on the in-sample dispersion matrix and the two mean vectors, the optimal solution w_star of the projected vector is obtained inversely according to the Fisher criterion.

def get_sample_projection(w_star, x):
    '''
    求一特征向量在w_star上的投影
    :param w_star:
    :param x:
    :return:
    '''
    return np.matmul(w_star.transpose(), x)

Use the obtained w_star to find the projection value of a feature vector on w_star.

def get_segmentation_threshold(w_star, way_flag):
    '''
    求分割阈值
    :param w_star:
    :param way_flag:
    :return:
    '''
    if way_flag == 0:
        y0_list = []
        y1_list = []
        for i in range(x_train.shape[0]):
            if y_train[i] == 0:
                y0_list.append(get_sample_projection(w_star, x_train[i]))
            else:
                y1_list.append(get_sample_projection(w_star, x_train[i]))
        ny0 = len(y0_list)
        ny1 = len(y1_list)
        my0 = sum(y0_list) / ny0
        my1 = sum(y1_list) / ny1
        segmentation_threshold = (ny0 * my0 + ny1 * my1) / (ny0 + ny1)
        return  segmentation_threshold
    elif way_flag == 1:
        y0_list = []
        y1_list = []
        for i in range(x_train.shape[0]):
            if y_train[i] == 0:
                y0_list.append(get_sample_projection(w_star, x_train[i]))
            else:
                y1_list.append(get_sample_projection(w_star, x_train[i]))
        ny0 = len(y0_list)
        ny1 = len(y1_list)
        my0 = sum(y0_list) / ny0
        my1 = sum(y1_list) / ny1
        py0 = ny0 / (ny0 + ny1)
        py1 = ny1 / (ny0 + ny1)
        segmentation_threshold = (my0 + my1) / 2 + math.log(py0 / py1) / (ny0 - ny1 - 2)
        return segmentation_threshold
    else:
        return 0

The original feature vector in the w_star projection label is used to find the segmentation threshold. This function provides two ways to realize the segmentation threshold.

def test_single_smaple(w_star, y0, test_sample, test_target):
    '''
    单例测试
    :param y0:
    :param x:
    :return:
    '''
    y_test = get_sample_projection(w_star, test_sample)
    predection = 1
    if y_test > y0:
        predection = 0
    print("This x_vector's target is {}, and the predection is {}".format(test_target, predection))

Test function, the singleton test function can be entered by the user with a new feature vector, and then the prediction result of the model will be given.

def test_single_smaple_check(w_star, y0, test_sample, test_target):
    '''
    单例测试(用于统计)
    :param y0:
    :param x:
    :return:
    '''
    y_test = get_sample_projection(w_star, test_sample)
    predection = 1
    if y_test > y0:
        predection = 0
    if test_target == predection:
        return True
    else:
        return False

This singleton test function is used for statistical data.

def test_check(w_star, y0):
    '''
    统计测试样本
    :param w_star:
    :param y0:
    :return:
    '''
    right_count = 0
    for i in range(x_test.shape[0]):
        boolean = test_single_smaple_check(w_star, y0, x_test[i], y_test[i])
        if boolean == True:
            right_count = right_count + 1
    return x_test.shape[0], right_count, right_count / x_test.shape[0]

The test statistics of the test data set are calculated by comparing the predicted result with the actual label to obtain the statistical sample number, and the correct prediction sample number and accuracy rate.

-Algorithm implementation

if __name__ == "__main__":
    m0 = np.array(get_mean_vector(0)).reshape(-1, 1)
    m1 = np.array(get_mean_vector(1)).reshape(-1, 1)
    s0 = get_dispersion_matrix(0, m0)
    s1 = get_dispersion_matrix(1, m1)
    sw = s0 + s1
    sb = get_sample_divergence(m0, m1)
    w_star = np.array(get_w_star(sw, m0, m1)).reshape(-1, 1)
    y0 = get_segmentation_threshold(w_star, 0)
    print("The segmentation_threshold is ", y0)
    test_sum, right_sum, accuracy = test_check(w_star, y0)
    print("Total specimen number:{}\nNumber of correctly predicted samples:{}\nAccuracy:{}\n".format(test_sum, right_sum, accuracy))

According to the divided data set, the training data set of positive and negative labels is used to calculate the mean vector, the dispersion matrix within the sample, and the dispersion matrix between samples. The in-sample dispersion matrix of the positive and negative labels is used to obtain the total in-sample dispersion matrix. Then use the dispersion matrix within the total sample to obtain the projected vector. According to the projected vector, the original feature vectors are projected onto a one-dimensional line, and the threshold splitting method is used to weight the correction to obtain the segmentation threshold. Then use the test sample data set to test the quality of the segmentation threshold.

-Experimental results
Data standardization and modified weighted threshold algorithm results:
Insert picture description here
data standardization and ordinary weighted threshold algorithm results:
Insert picture description here
no data standardization, modified weighted threshold algorithm results:
Insert picture description here
no data standardization, no modified weighted threshold algorithm results:
Insert picture description here

-Experimental conclusion analysis

  • The area indicator has no effect on the disease, while other indicators will be weakened due to standardization. Differences in characteristics lead to a decline in the quality of prediction.
  • Without the use of data standardization and the modified weighted threshold algorithm results, the model prediction accuracy rate can reach 93.8596%, which exceeds the author's expectations for this model.
  • The modified weighted threshold function may have a better effect when the sample is evenly distributed, but the sample data set may not be evenly distributed.
    Add function:
def analysis_train_set():
    train_positive_count = 0
    train_negative_count = 0
    sum_count = 0
    for i in range(x_train.shape[0]):
        if y_train[i] == 0:
            train_negative_count = train_negative_count + 1
        else:
            train_positive_count = train_positive_count + 1
        sum_count = sum_count + 1
    print("Train Set Analysis:\nTotal number:{}\nNumber of positive samples:{}\tProportion of positive samples:{"
          "}\nNumber of negative samples:{}\tProportion of negative samples:{}\nPositive and negative sample ratio:{"
          "}\n".format(sum_count, train_positive_count, train_positive_count / sum_count, train_negative_count,
                       train_negative_count / sum_count, train_positive_count / train_negative_count))

Observe the output results: It
Insert picture description herecan be found that in this data set, the proportion of positive samples has reached 63.7%, and the ratio of positive and negative samples has reached 1.76, far exceeding the ratio threshold of 1.2. This results in a more biased weighting of the modified weighted threshold function, which greatly reduces the quality of prediction. To solve this problem, you may need to modify the data set or do data enhancement. Not to mention here.

Guess you like

Origin blog.csdn.net/qq_44459787/article/details/109755888