使用sklearn生成数据集

sklearn.datasets 中有多个生成数据集的方法

1.生成符合正态分布的聚类数据

sklearn.datasets.make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None)

参数说明
n_samples: int, optional (default=100) 生成样本的总数 The total number of points equally divided among clusters.

n_features: int, optional (default=2) 每个样本的特征数目，特征维度数目 The number of features for each sample.

centers: int or array of shape [n_centers, n_features], optional (default=3)
生成的样本中心数，即类别数目
The number of centers to generate, or the fixed center locations.

cluster_std: float or sequence of floats, optional (default=1.0) The standard deviation of the clusters.
每个类别的方差，例如生成2类数据，其中一类比另一类具有更大的方差，可以将cluster_std设置为[1.0,3.0]

center_box: pair of floats (min, max), optional (default=(-10.0, 10.0))
The bounding box for each cluster center when centers are generated at random.
shuffle: boolean, optional (default=True) Shuffle the samples.
random_state: int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.(类似于随机种子)

returns / 返回值
X: array of shape [n_samples, n_features] 生成的样本数据集
The generated samples.

y : array of shape [n_samples] 样本数据集的标签
The integer labels for cluster membership of each sample.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

plt.figure(figsize=(12, 12))
seed = np.random.randint(int(1e9))
x, y = make_blobs(n_samples=1500, n_features=2, centers=5, random_state=seed)

print(type(x), 'length(X)=', len(x),'='*2, type(y), 'length(y)=', len(y))
plt.scatter(x[:,0], x[:,1], c=y, s=6)
plt.show()

<class 'numpy.ndarray'> length(X)= 1500 == <class 'numpy.ndarray'> length(y)= 1500

make_blobs

2.生成同心圆样本点

datasets.make_circles(n_samples=100, shuffle=True, noise=0.04, random_state=None, factor=0.8)

参数说明
n_samples：样本点数目
noise：控制属于同一个圈的样本点附加的漂移程度
factor：控制内外圈的接近程度，越大越接近，上限为1
returns / 返回值:同上

import matplotlib.pyplot as plt
from sklearn.datasets import make_circles

x, y = make_circles(n_samples=15000, shuffle=True, 
                    noise=0.03, random_state=None, factor=0.6)
plt.scatter(x[:,0], x[:,1], c=y, s=7)
plt.savefig('circles.png')
plt.show()

make_circles

3.生成模拟分类数据集

datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)

参数说明
n_samples：控制生成的样本点的个数
n_features：控制与类别有关的自变量的维数
n_classes：控制生成的分类数据类别的数量
returns / 返回值:同上

import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

x,y = make_classification(n_samples=500, n_features=20, n_informative=2,
                          n_redundant=2, n_repeated=0, n_classes=2,
                          n_clusters_per_class=2, weights=None,
                          flip_y=0.01, class_sep=1.0, hypercube=True,
                          shift=0.0, scale=1.0, shuffle=True, random_state=None)
plt.scatter(x[:,0], x[:,1], c=y, s=7) #  共20个特征维度，此处仅使用两个维度作图演示
plt.savefig('make_classification.png')
plt.show()

make_classification

4.生成太极型非凸集样本点

datasets.make_moons(n_samples,shuffle,noise,random_state)

参数基本同上
returns / 返回值:同上

import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

x,y = make_moons(n_samples=1500, shuffle=True,
                 noise=0.06, random_state=None)
plt.scatter(x[:,0], x[:,1], c=y, s=7)
plt.savefig('moons.png')
plt.show()

make_moons