Getting started with machine learning scikit-learn

  1. overview
  2. Example dataset and simulated data generation function
  3. Classification
  4. regression analysis
  5. clustering
  6. Model Evaluation and Optimization
  7. data preprocessing

Artificial Intelligence (AI for short), Machine Learning (ML for short) and Deep Learning (DL for short) are three popular terms at present. the difference.

Artificial intelligence refers to the use of computers to do intelligent work that only humans could do before. The goal is to hope that machines can think and solve problems like humans, and even do better than humans.

Machine learning is a branch of artificial intelligence, also known as statistical machine learning. Its basic idea is to build a statistical model based on data, and use the model to analyze and predict the data.

Deep learning is a machine learning method based on a multi-layer neural network, and it is a special machine learning implementation method.
insert image description here
Artificial intelligence, machine learning and deep learning diagram

Machine learning models are divided into regression models and classification models.
The results obtained by the regression model are continuous values, such as predicting tomorrow's temperature.
The results obtained by the classification model are discrete values, such as predicting tomorrow's weather (cloudy, sunny, rainy).
The regression model can be converted into a classification model, such as determining whether it is a high temperature weather according to the predicted temperature; classification problems can also be predicted through the regression model, such as the probability of an event occurring.

The process of machine learning can be summarized as:

  • Organize data, that is, process data as required and convert it into a specific format.
  • Model building, that is, using sample data for algorithmic learning to obtain a predictive model.
  • Prediction results, that is, predict the results of input data according to the model, and further evaluate and adjust the model according to the prediction results

Machine learning methods can be divided into two categories: supervised learning and unsupervised learning.
Supervised learning is to have training samples, train the algorithm according to the feature (attribute) information and result information of the training samples, and then use the trained algorithm to classify or return the input data.
Unsupervised learning does not have training samples, and clusters samples directly according to the feature (attribute) information of the input samples. Unsupervised learning is also called clustering.

1.2scikit-learn

scikit-learn is a python package for machine learning, written on the basis of three modules Numpy, Scipy and Matplotlib, originally included in Scipy.
insert image description here
scikit-learn official website (http://scikit-learn.org/stable/)
scikit-learn has 6 main functions:
classification (Classification),
regression analysis (Regression),
clustering (Clustering),
dimensionality reduction (Dimensionality reduction)
model selection (Model selection)
Data preprocessing (Preprocessing)

scikit-learn includes a large number of models, such as models involving classification including decision tree, SVM, KNN, naive Bayesian, random forest, Adaboost, GradientBoosting, Bagging, ExtraTrees, etc.
Users need to choose the appropriate model according to the data characteristics and task goals.
insert image description here
Model selection based on data characteristics and task goals

The name of the scikit-learn package is sklearn, and there are many modules (subpackages) in the package. During application, the corresponding modules are usually imported as needed, such as: from sklearn.cluster import DBSCAN This statement is to import DBSCAN from the cluster subpackage module.

2. Sample data set and simulation data generation function

The scikit-learn package contains some sample datasets for machine learning experiments, including:
iris (iris) dataset
digits (handwritten digits) dataset
Boston house price dataset
Diabetes (diabetes) dataset
...
insert image description here
The datasets module provides Functions to load these datasets

The iris (iris) data set has a total of 150 samples. Each sample records 4 characteristics of iris (sepal length and width, petal length and width) and corresponding categories (three categories in total). The samples of each category are 50, recorded in order of category, the data is used for classification.
insert image description here
The load_iris() function of the datasets module returns the iris dataset object, which has the following attributes:
DESCR, which returns the description information of the dataset.
data, this attribute returns a 150*4 two-dimensional array, recording 4 features of each sample.
target, this property returns a one-dimensional array consisting of 150 elements, recording the corresponding category (0, 1 or 2).


import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
frame = pd.DataFrame(X)
frame['class']= y
frame

The digits (handwritten digits) data set includes 1797 grayscale images of 8 8 pixels. The content of each image is a handwritten digit. The actual number of each image is a classification category. The value of each pixel (raster) in each image is used features for classification. This dataset is used for classification.
insert image description here
The load_digits() function of the datasets module returns the digits dataset object, which has the following attributes:
DESCR, which returns the description information of the dataset.
images, this attribute returns an image array of 1797
8 8, and each element in the array represents an image.
data, this attribute returns a
two-dimensional array of 1797 64, and each row records the raster value of an 8*8 image.
target, this property returns a one-dimensional array composed of 1797 elements, recording the number (image category) corresponding to the actual representation of the image.

from sklearn import datasets
digits = datasets.load_digits()
images = digits.images
data = digits.data
target = digits.target
print(images.shape)
print(data.shape)
print(target.shape)

insert image description here

%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
digits = datasets.load_digits()
fig = plt.figure()
for i in range(1,11):
    ax = fig.add_subplot(3,4,i)
    image = digits.images[i-1]
    plt.imshow(image, cmap=plt.cm.gray_r)
    print(digits.target[i-1])

insert image description here
The Boston housing price data set records the average housing prices and regional-related data in 506 regions (including 13 data such as per capita crime rate, proportion of residential land, proportion of non-retail businesses, etc.), for regression analysis, regional average housing prices as dependent variables, regional Relevant data serve as factors (features) that affect housing prices.

The load_boston() function of the datasets module returns the boston dataset object, which has the following attributes:
DESCR, which returns the description information of the dataset.
data, this attribute returns a 506*13 two-dimensional array, and each row records a different characteristic value of a region.
target, this attribute returns a one-dimensional array composed of 506 elements, recording the housing prices in each area.
feature_names, which returns a one-dimensional array of feature names

import pandas as pd
from sklearn import datasets
boston = datasets.load_boston()
frame = pd.DataFrame(boston.data,
                     columns=boston.feature_names)
frame['MEDV']=boston.target
frame

insert image description here

2.2 Simulation data generation function

The scikit-learn package also provides some functions for generating simulated data. The generated data includes:
data for classification
, data for regression
, data for clustering
...
insert image description here
The make_classification() function is used to generate classification data, The function returns a tuple, the first element is a two-dimensional array of m×n (m is the number of samples, n is the number of features), representing data; the second element is a one-dimensional array composed of m elements (m is the sample number), indicating the target (classification category), the number of classification categories k is less than the number of samples m

In actual classification, not all the features of the data used for classification are related to the classification category, and the features that have no relationship to the classification are called noise. Furthermore, there will be duplication and correlation between features. These characteristics can be simulated when generating simulated data.
insert image description here
The sum of n_informative, n_redundant, and n_repeated is the number of non-noise features, and the rest is the number of noise features.
n_classes*n_clusters_per_class must be less than or equal to 2**n_informative.

import pandas as pd
from sklearn import datasets
dataset = datasets.make_classification(n_samples=10,
                                       n_features=5,
                                       n_classes=2,
                                       random_state=1)
frame = pd.DataFrame(dataset[0])
frame['class']=dataset[1]
frame

insert image description here

import pandas as pd
from sklearn import datasets
dataset = datasets.make_classification(n_samples=10,
                                       n_features=5,
                                       n_redundant=0,
                                       n_repeated=1,
                                       n_classes=2,
                                       n_clusters_per_class=1,
                                       random_state=1)
frame = pd.DataFrame(dataset[0])
frame['class']=dataset[1]
frame

insert image description here
The default value of the random_state parameter is None, that is, the simulation result is generated according to a random value. To obtain the same result, the random_state parameter needs to be set to the same value, such as random_state=1.
The make_regression() function is used to generate regression data. The function returns a tuple. The first element is a two-dimensional array of m×n (m is the number of samples, n is the number of features), representing data, and the second element is m A one-dimensional array of elements (m is the number of samples) or a two-dimensional array of m×k (m is the number of samples, k is the number of targets), representing the target value.
The make_blobs() function is used to generate cluster data. The number of clusters or the position of each center can be set through the centers parameter, and the standard deviation of the cluster can be set through the cluster_std parameter.
The function returns a tuple, the first element is an m×n two-dimensional array (m is the number of samples, n is the number of features), representing data, and the second element is a one-dimensional array composed of m elements (m is the sample number), indicating the number of the cluster.

from sklearn.datasets.samples_generator import make_blobs
from matplotlib import pyplot as plt
dataset = make_blobs(n_samples=750, cluster_std=0.5,random_state=0)
x = dataset[0][:,0]
y = dataset[0][:,1]
plt.scatter(x, y, color='yellowgreen', marker='.')

insert image description here

3. Classification

Classification is to establish a classification model through a learning method, and then use the model to classify the input data, also known as supervised classification.
Classification can be multi-class, such as land use/land cover classification using remote sensing data, or binary, such as object detection (yes or no).

The scikit-learn package provides multiple classes for supervised classification, including support vector machine classification, K-nearest neighbor classification, decision tree classification, etc.

Classes provided by the scikit-learn package for supervised classification

insert image description here
The steps of all supervised classification are the same, and the specific steps are as follows:
use the constructor of the class to instantiate a classification object, such as SVC = svm.SVC(). Different classification objects have corresponding keyword parameters when instantiated.
Use the training samples to train the classification model (using the fit method of the model object), such as SVC.fit(X, y), X is the characteristic value of the training sample, which is a two-dimensional array of m×n, m is the number of training samples, n is the feature number; y is the category value of the training sample, which is a one-dimensional array composed of m elements.

Use the trained classification model object to predict the values ​​of other samples, such as SVC.predict(X), where X is the feature value of the sample to be classified, and it is also a two-dimensional array of m×n, m is the number of samples to be classified, n is the number of features; the prediction result returns a one-dimensional array composed of m elements, and each element represents a classification value.

Guess you like

Origin blog.csdn.net/weixin_40625478/article/details/109622054