Data mining algorithms and practices (20): general datasets in sklearn

As a data mining toolkit, sklearn not only provides algorithm implementation, but also provides data set use through the sklearn.datasets module. There are 3 data set API interfaces to obtain data sets according to needs, namely load, fetch, generate, load to provide commonly used toy data sets , Fetch provides large data sets, generate can customize production data sets according to needs;

The use of the general data set sklearn.dataset should be placed before the algorithm and model learning. If you are proficient in using it, you can quickly perform model testing and tuning. You can refer to the sklearn Chinese community: sklearn.datasets module usage details ,

 1. Load is used to load a small standard data set (toy data set)

 loda is a type of function starting with "load_", which returns different data sets through parameter settings, where return_X_y=True allows this type of function to output data that only contains data and only labels, namely X and y, X is the data set , Y is the label value, for example, the following lines of code directly get the Boston housing price prediction data and use it to train the model;

# 使用load_boston()得到波士顿房价预测数据集
import sklearn.datasets as dataset
import seaborn 
import pandas as pd
X,y=dataset.load_boston(return_X_y=True)
datas=pd.DataFrame(X)
prices=pd.DataFrame(y)
from  sklearn.linear_model import LinearRegression
modelOne=LinearRegression()
modelOne.fit(datas , prices)
modelOne.score(datas , prices)

Here you can look at the detailed documentation of the parameters and return types of load_boston(), return_X_y=True is the parameter returns the data dataset (506,13), one of the columns is the predicted value, and the returned data is of ndarray type by default, which can be manually converted to pandas .dataframe, and the version number and examples of the data are provided in the document, as follows:

Parameters
----------
return_X_y : bool, default=False.
    If True, returns ``(data, target)`` instead of a Bunch object.
    See below for more information about the `data` and `target` object.

    .. versionadded:: 0.18

Returns
-------
data : :class:`~sklearn.utils.Bunch`
    Dictionary-like object, with the following attributes.

    data : ndarray of shape (506, 13)
        The data matrix.
    target : ndarray of shape (506, )
        The regression target.
    filename : str
        The physical location of boston csv dataset.

        .. versionadded:: 0.20

    DESCR : str
        The full description of the dataset.
    feature_names : ndarray
        The names of features

(data, target) : tuple if ``return_X_y`` is True

    .. versionadded:: 0.18

Notes
-----
    .. versionchanged:: 0.20
        Fixed a wrong data point at [445, 0].

Examples
--------
>>> from sklearn.datasets import load_boston
>>> X, y = load_boston(return_X_y=True)
>>> print(X.shape)
(506, 13)
File:      d:\anaconda3\lib\site-packages\sklearn\datasets\_base.py
Type:      function

 Load functions include Boston, iris, beer, cancer data sets, etc.;

transfer description
load_boston([return_X_y]) Load and return the boston house-prices dataset (regression).
load_iris([return_X_y]) Load and return the iris dataset (classification).
load_diabetes([return_X_y]) Load and return the diabetes dataset (regression).
load_digits([n_class, return_X_y]) Load and return the digits dataset (classification).
load_linnerud([return_X_y]) Load and return the linnerud dataset (multivariate regression).
load_wine([return_X_y]) Load and return the wine dataset (classification).
load_breast_cancer([return_X_y]) Load and return the breast cancer wisconsin dataset (classification).

 Two, fetch provides real data sets

 Similar to the usage of the load data set function, fetch is also a loadable data set. The difference is that the latter needs to be downloaded. The load function is already locally when the sklearn package is downloaded;

transfer description
fetch_olivetti_faces([data_home, shuffle, …]) Load the Olivetti faces data-set from AT&T (classification).
fetch_20newsgroups([data_home, subset, …]) Load the filenames and data from the 20 newsgroups dataset (classification).
fetch_20newsgroups_vectorized([subset, …]) Load the 20 newsgroups dataset and vectorize it into token counts (classification).
fetch_lfw_people([data_home, funneled, …]) Load the Labeled Faces in the Wild (LFW) people dataset (classification).
fetch_lfw_pairs([subset, data_home, …]) Load the Labeled Faces in the Wild (LFW) pairs dataset (classification).
fetch_covtype([data_home, …]) Load the covertype dataset (classification).
fetch_rcv1([data_home, subset, …]) Load the RCV1 multilabel dataset (classification).
fetch_kddcup99([subset, data_home, shuffle, …]) Load the kddcup99 dataset (classification).
fetch_california_housing([data_home, …]) Load the California housing dataset (regression).

Three, generate functions provide customized data requirements

Load and fetch provide a limited set of data. In most scenarios, the generate method (a series of functions starting with "make_") is used to achieve customized data requirements, because the data dimension, sparse ratio, noise distribution, etc. can be adjusted, using make_blobs The function implements clustering data as an example. It generates 100 Gaussian distribution data points by default (default 2 attributes, 3 clusters), make_moons generates two interleaved semicircles, which are often used to measure the pros and cons of clustering algorithms, except Single-label generate can also generate multi-label data, refer to the document;

# make_blobs得到可分的聚类数据集
from sklearn.datasets import make_blobs
import pandas as pd
import seaborn as seaborn
X,y=make_blobs(random_state=1)
datas=pd.DataFrame(X)
clusters=pd.DataFrame(y)
seaborn.scatterplot(datas[0],datas[1])

# 使用k均值进行聚类
from sklearn.cluster import KMeans
kmeans=KMeans(n_clusters=3)
kmeans.fit(datas)

# make_moons得到
from sklearn.datasets import make_moons
X,y=make_moons(n_samples=200,noise=0.05,random_state=0)
datas=pd.DataFrame(X)
clusters=pd.DataFrame(y)    
seaborn.scatterplot(datas[0],datas[1])
make_blobs
make_blobs generates data
Cluster data obtained by make_moons
make_moons generate data

Official documentation of make_blobs:

Signature: dataset.make_blobs(n_samples=100, n_features=2, *, centers=None, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None, return_centers=False)
Docstring:
Generate isotropic Gaussian blobs for clustering.

Read more in the :ref:`User Guide <sample_generators>`.

Parameters
----------
n_samples : int or array-like, optional (default=100)
    If int, it is the total number of points equally divided among
    clusters.
    If array-like, each element of the sequence indicates
    the number of samples per cluster.

    .. versionchanged:: v0.20
        one can now pass an array-like to the ``n_samples`` parameter

n_features : int, optional (default=2)
    The number of features for each sample.

centers : int or array of shape [n_centers, n_features], optional
    (default=None)
    The number of centers to generate, or the fixed center locations.
    If n_samples is an int and centers is None, 3 centers are generated.
    If n_samples is array-like, centers must be
    either None or an array of length equal to the length of n_samples.

cluster_std : float or sequence of floats, optional (default=1.0)
    The standard deviation of the clusters.

center_box : pair of floats (min, max), optional (default=(-10.0, 10.0))
    The bounding box for each cluster center when centers are
    generated at random.

shuffle : boolean, optional (default=True)
    Shuffle the samples.

random_state : int, RandomState instance, default=None
    Determines random number generation for dataset creation. Pass an int
    for reproducible output across multiple function calls.
    See :term:`Glossary <random_state>`.

return_centers : bool, optional (default=False)
    If True, then return the centers of each cluster

    .. versionadded:: 0.23

Returns
-------
X : array of shape [n_samples, n_features]
    The generated samples.

y : array of shape [n_samples]
    The integer labels for cluster membership of each sample.

centers : array, shape [n_centers, n_features]
    The centers of each cluster. Only returned if
    ``return_centers=True``.

Examples
--------
>>> from sklearn.datasets import make_blobs
>>> X, y = make_blobs(n_samples=10, centers=3, n_features=2,
...                   random_state=0)
>>> print(X.shape)
(10, 2)
>>> y
array([0, 0, 1, 0, 2, 2, 2, 1, 1, 0])
>>> X, y = make_blobs(n_samples=[3, 3, 4], centers=None, n_features=2,
...                   random_state=0)
>>> print(X.shape)
(10, 2)
>>> y
array([0, 1, 2, 0, 2, 2, 2, 1, 1, 0])

See also
--------
make_classification: a more intricate variant
File:      d:\anaconda3\lib\site-packages\sklearn\datasets\_samples_generator.py
Type:      function

In addition to the above three methods, most of them use pandas's read data API to read and enter external data;

Guess you like

Origin blog.csdn.net/yezonggang/article/details/112762971