Reference books: Python Machine Learning Basic Tutorial

1. Initial data

The Iris data set is a classic data set in machine learning and statistics. It is included in the datasets module of scikit-learn.

We can call the load_iris function to load the data:

from sklearn.datasets import load_iris
iris_dataset = load_iris()

The iris object returned by load_iris is a Bunch object, very similar to a dictionary, containing keys and values:

# 查看数据集的keys
print('Keys of iris_dataset: \n{}'.format(iris_dataset.keys()))

output:

Keys of iris_dataset: 
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

As you can see, the dataset has many keys.

The value corresponding to the DESCR key is a brief description of the dataset. Let's look at the previous part:

print(iris_dataset['DESCR'][:193] + "\n...")

output:

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, pre
...

Through the above description information, we can know that the data set contains 150 pieces of data, and every 50 pieces of data belong to a category, that is, there are three categories, and each piece of data has four features.

The value corresponding to the target_names key is a string array, which contains the species of flowers we want to predict:

print("Target names: {}".format(iris_dataset['target_names']))

output:

Target names: ['setosa' 'versicolor' 'virginica']

From this, we can know that the iris data set iris contains 3 types of iris, namely Iris-setosa, Iris-versicolor and Iris-virginica

The value corresponding to the feature_names key is a list of strings describing each feature:

print("Feature names: \n{}".format(iris_dataset['feature_names']))

output:

Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

From this, we can know that each piece of data contains 4 features: sepal length (sepal length), sepal width (sepal width), petal length (petal length), petal width (petal width)

Data is contained in the target and data fields. data contains the measurement data of sepal length, sepal width, petal length and petal width, in the format of NumPy array:

print("Type of data: {}".format(type(iris_dataset['data'])))

output:

Type of data: <class 'numpy.ndarray'>

Each row of the data array corresponds to a flower, and the columns represent four measurements for each flower:

print("Shape of data: {}".format(iris_dataset['data'].shape))

output:

Shape of data: (150, 4)

From this, we can know that there are 150 rows in the data array, corresponding to the measurement data of 150 flowers, 150 samples, and 4 columns, each column represents a feature.

Let's look at the first 5 pieces of data:

print("First five rows of data:\n{}".format(iris_dataset['data'][:5]))

output:

First five rows of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

The first line represents the first flower with a sepal length of 5.1cm, a sepal width of 3.5cm, a petal length of 1.4cm, and a petal width of 0.2cm.

The target array contains the species of each flower that has been measured. It is a one-dimensional NumPy array, and each flower corresponds to one of the data:

print("Shape of target: {}".format(iris_dataset['target'].shape))

output:

Shape of target: (150,)

We can know from the target_names key that the iris data set has 3 varieties, and these three varieties are converted into numbers in the target array, which are 0, 1, and 2 respectively:

print("Target:\n{}".format(iris_dataset['target']))

output:

Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

From here we can also see that the iris data is not disrupted, that is, the first 50 data is a category (setosa), the middle 50 data is a category (versicolor), and the last 50 data is a category (virginica)

2. Divide the data

Typically, we cannot use the data from which the model was built to evaluate the model. Because our model always remembers the entire training set, it will always predict the correct label for any data point in the training set. This "memory" tells us nothing about how well the model generalizes (in other words, whether it predicts correctly on new data). So we need to use new data to evaluate the performance of the model. The usual way is to divide our data set into training set and test set.

The train_test_split function in scikit-learn can shuffle the dataset and split it:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
        iris_dataset['data'], iris_dataset['target'], test_size=0.25, random_state=0)

The first parameter of the train_test_split function is the data to be divided, the second parameter is the label of the data to be divided, the parameter test_size indicates the size of the test set, the default is 0.25, random_state is the random seed, if this value is specified, then Each split produces the same result. The train_test_split function will shuffle the data by default before dividing. The function returns 4 arrays, which are training data, test data, training labels, and test labels.

We can look at the shape of the training and test sets:

print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

output:

X_train shape: (112, 4)
y_train shape: (112,)
X_test shape: (38, 4)
y_test shape: (38,)

3. Visualize data

The DataFrame format of pandas can be used to view data in an Excel-like form. We can convert the iris data in numpy format to pandas DataFrame:

import pandas as pd
# columns 表示使用鸢尾花的特征名作为每一列的列名
df = pd.DataFrame(iris_dataset.data, columns=iris_dataset.feature_names)
# 添加一列，列名为 label ，数据为标签数据
df['label'] = iris_dataset.target
df

Next, visualize the data using graphs, but there is a catch. A piece of iris data contains four features, and it is difficult for us to plot a data set with more than three features on the computer. One way to solve this problem is to draw a scatterplot matrix (pair plot), so that all features can be viewed in pairs.

pandas has a function to draw a scatterplot matrix called scatter_matrix. The diagonal of the matrix is a histogram of each feature, and the other positions are scatterplots drawn by pairwise features:

# 将数据data转换为DataFrame格式
# 利用iris_dataset.feature_names中的字符串对数据列进行标记
iris_dataframe = pd.DataFrame(iris_dataset.data, columns=iris_dataset.feature_names)
# 利用DataFrame创建散点图矩阵，按标签target着色
grr = pd.plotting.scatter_matrix(iris_dataframe, c=iris_dataset.target, figsize=(10, 10), marker='o',hist_kwds={'bins': 50}, s=50, alpha=.8)

The parameter c means to color the data points according to the label, that is, the data points belonging to different categories have different colors, figsize means that the size of the graph is 10×10, marker means to use big dots to represent each data point, and hist_kwds is a dictionary related to hist Parameters, s represents the area of each point, and alpha represents the transparency of the point.

4. Supplementary (other data sets)

From the above content, we can know that the iris data set is a three-category data set, and sklearn also provides some commonly used data sets:

Wisconsin Breast Cancer Dataset

The Wisconsin breast cancer dataset (cancer for short), which records clinical measurement data of breast cancer tumors.

Each tumor is labeled as "benign" (indicating a harmless tumor) or "malignant" (indicating a cancerous tumor), and the task is to learn to predict whether a tumor is malignant or not based on measurements of human tissue.

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print("cancer.keys(): \n{}".format(cancer.keys()))

output:

cancer.keys():
dict_keys(['feature_names', 'data', 'DESCR', 'target', 'target_names'])

Datasets included in scikit-learn are usually saved as Bunch objects, which contain real data as well as some dataset information. All you need to know about the Bunch object is that it is very similar to a dictionary, with the added bonus that you can use the dot notation to access the object's values (for example, use bunch.key instead of bunch['key'] ).

Dataset information:

This is a binary classification dataset
This dataset contains a total of 569 data points, each with 30 features
Of the 569 data points, 212 were flagged as malignant and 357 as benign

Boston House Price Dataset

The task associated with this dataset is to predict the median house price in the Boston area in the 1970s using information such as crime rate, proximity to the Charles River, and road accessibility.

from sklearn.datasets import load_boston
boston = load_boston()

Dataset information:

This is a regression dataset
This dataset contains 506 data points and 13 features

sklearn data set - iris iris data set

1. Initial data

2. Divide the data

3. Visualize data

4. Supplementary (other data sets)

Wisconsin Breast Cancer Dataset

Boston House Price Dataset

Guess you like