Python机器学习 smote过采样算法

SMOTE全称是Synthetic Minority Oversampling Technique即合成少数类过采样技术,它是基于随机过采样算法的一种改进方案,由于随机过采样采取简单复制样本的策略来增加少数类样本,这样容易产生模型过拟合的问题,即使得模型学习到的信息过于特别(Specific)而不够泛化(General),SMOTE算法的基本思想是对少数类样本进行分析并根据少数类样本人工合成新样本添加到数据集中,算法流程如下。

1、对于少数类中每一个样本x,以欧氏距离为标准计算它到少数类样本集中所有样本的距离,进行排序后得到其k近邻

           例如全部样本数为20w,以欧氏距离为标准计算它到该样本到样本集中所有样本的距离,按从小到大排序,得到列表list,

2、根据样本不平衡比例设置一个采样比例以确定采样倍率N,对于每一个少数类样本x,从其k近邻中随机选择若干个样本,假设选择的近邻为xn。

        例如少数样本有500个,而多数样本有500w零500个,我要生成同样多的样本,那么我需要选取list前1w个数据,即每一个少数样本生成1w个样本

3、对于每一个随机选出的近邻xn,分别与原样本按照如下的公式构建新的样本 

                         {x_{new}} = x + rand(0,1) \times (\mathop x\limits^ \sim - x)

其中xnew为新生成的数据,x为选择的数据点,\mathop x\limits^ \sim-x为与近邻数据点的欧式距离

实现代码及其案例如下:


from sklearn.neighbors import NearestNeighbors
import numpy as np


class Smote:
    """
    SMOTE过采样算法.
    Parameters:
    -----------
    k: int
        选取的近邻数目.
    sampling_rate: int
        采样倍数, attention sampling_rate < k.
    newindex: int
        生成的新样本(合成样本)的索引号.
    """
    def __init__(self, sampling_rate=5, k=5):
        self.sampling_rate = sampling_rate
        self.k = k
        self.newindex = 0

    def fit(self, X, y=None):
        if y is not None:
            negative_X = X[y == 0]
            X = X[y == 1]

        n_samples, n_features = X.shape
        # 初始化一个矩阵, 用来存储合成样本
        self.synthetic = np.zeros((n_samples * self.sampling_rate, n_features))

        # 找出正样本集(数据集X)中的每一个样本在数据集X中的k个近邻
        knn = NearestNeighbors(n_neighbors=self.k).fit(X)
        for i in range(len(X)):
            k_neighbors = knn.kneighbors(X[i].reshape(1, -1),return_distance=False)[0]
            # 对正样本集(minority class samples)中每个样本, 分别根据其k个近邻生成
            # sampling_rate个新的样本
            self.synthetic_samples(X, i, k_neighbors)

        if y is not None:
            return (np.concatenate((self.synthetic, X, negative_X), axis=0),
                    np.concatenate(([1] * (len(self.synthetic) + len(X)), y[y == 0]), axis=0))

        return np.concatenate((self.synthetic, X), axis=0)

    # 对正样本集(minority class samples)中每个样本, 分别根据其k个近邻生成sampling_rate个新的样本
    def synthetic_samples(self, X, i, k_neighbors):
        for j in range(self.sampling_rate):
            # 从k个近邻里面随机选择一个近邻
            neighbor = np.random.choice(k_neighbors)
            # 计算样本X[i]与刚刚选择的近邻的差
            diff = X[neighbor] - X[i]
            # 生成新的数据
            self.synthetic[self.newindex] = X[i] + random.random() * diff
            self.newindex += 1

# ------通过过采样获取284302条calss为1的数据
import pandas as pd
import csv

df=pd.read_csv('creditcard_data.csv')
data = []
for i, element in enumerate(df['Class']):
    if element == 1:
        data.append(df.iloc[i, :])

X = np.array(data)
smote = Smote(sampling_rate=588, k=483)

data1=smote.fit(X).tolist()

#保存为csv
csvFile2 = open('data_Over sampling.csv','w', newline='') # 设置newline,否则两行之间会空一行
writer = csv.writer(csvFile2)
m = len(data1)
for i in range(m):
    writer.writerow(data1[i])
csvFile2.close()


arr1=smote.fit(X)

arr1.to_csv('data_Over sampling.csv')


# ---------------------------
# 提取class为1的数
import csv
import random
data=[]
for i, element in enumerate(df['Class']):
    if element==1:
        data.append(df.iloc[i,:])
csvFile2 = open('csvFile2.csv','w', newline='') # 设置newline,否则两行之间会空一行
writer = csv.writer(csvFile2)
m = len(data)
for i in range(m):
    writer.writerow(data[i])
csvFile2.close()

import csv
import random
import pandas as pd
df=pd.read_csv('creditcard_data.csv')
data=[]
count=0
for i, element in enumerate(df['Class']):
    if element==0:
        data.append(df.iloc[i,:])
        count+=1
    if count==284487:
        break
csvFile2 = open('over.csv','a', newline='') # 设置newline,否则两行之间会空一行
writer = csv.writer(csvFile2)
m = len(data)
for i in range(m):
    writer.writerow(data[i])
csvFile2.close()



猜你喜欢

转载自blog.csdn.net/qq_41686130/article/details/87103856