Machine Learning-Sklearn Study Notes-General Chapter

Write in front

Sklearn's official document content is quite detailed, but it seems not very friendly for beginners to learn.
This series of "study notes" is compiled with reference to Sklearn's official documents. The structure remains basically unchanged, and the content is seldom deleted (too detailed and "biased"), so that you can refer to and review it later.

Insert picture description here

0. Getting Started

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other tools.

0.1 Fitting and prediction: the basis of the estimator (output)

(Fitting and predicting: estimator basics)

  • A large number of built-in machine learning algorithms and models provided in Scikit-learn are called estimators .
  • Each estimator can use its fit() method to fit the data.
  • fit(X) → y, the fit() function receives two inputs
  • The format of X: (n_samples, n_features) , that is, sample rows and feature rows.
  • The value of y is a real number (regression task) or integer (classification task);
  • Both X and y should be numpy arrays or array-like data types.

Fitting:

>>> from sklearn.ensemble import RandomForestClassifier
>>> clf = RandomForestClassifier(random_state=0)
>>> X = [[ 1,  2,  3],  # 2 samples, 3 features
...      [11, 12, 13]]
>>> y = [0, 1]  # classes of each sample
>>> clf.fit(X, y)
RandomForestClassifier(random_state=0)

prediction:

>>> clf.predict(X)  # predict classes of the training data
array([0, 1])
>>> clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data
array([0, 1])

0.2 Converter and preprocessor (input)

(Transformers and pre-processors)

The general pipeline (pipeline) includes two parts:
Pre-processor as input :
Transform or "imputes?" data
Predictor as output : predict target Value predictor (predictor)

>>> from sklearn.preprocessing import StandardScaler
>>> X = [[0, 15],
...      [1, -10]]
>>> StandardScaler().fit(X).transform(X)
array([[-1.,  1.],
       [ 1., -1.]])

0.3 Pipeline: Linked preprocessor and evaluator

The converter and the evaluator/predictor are integrated into a unified whole (object): Pipeline, namely:

Transformers + estimators (predictors) = Pipeline

Note: Estimators and predictors in Sklearn-learn should be directly equivalent (estimators (predictors))

example:

>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import accuracy_score
...
>>> # create a pipeline object
>>> pipe = make_pipeline(
...     StandardScaler(),
...     LogisticRegression(random_state=0)
... )
...
>>> # load the iris dataset and split it into train and test sets
>>> X, y = load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
...
>>> # fit the whole pipeline
>>> pipe.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression(random_state=0))])
>>> # we can now use it like any other estimator
>>> accuracy_score(pipe.predict(X_test), y_test)
0.97...

0.4 Model evaluation

(Model evaluation)

After the model training is completed, the model should be evaluated immediately, instead of directly predicting the unseen "new data".
What we used above is the train_test_split() method to divide the data set into training set and test set, but there are many other tools in scikit-learn for model verification, especially cross-validation.

Take 5-fold cross-validation as an example:

>>> from sklearn.datasets import make_regression
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.model_selection import cross_validate
...
>>> X, y = make_regression(n_samples=1000, random_state=0)
>>> lr = LinearRegression()
...
>>> result = cross_validate(lr, X, y)  # defaults to 5-fold CV
>>> result['test_score']  # r_squared score is high because dataset is easy
array([1., 1., 1., 1., 1.])

0.5 Automatic parameter search

All evaluators have parameters (or hyperparameters ), and they can be debugged. Under normal circumstances, we do not know how to choose parameter values, because they are determined by the data in our hands .

Sklearn provides some tools that can automatically find the optimal parameter combination (through cross-validation). Let's take the RandomizedSearchCV object as an example. After the search is over, RandomizedSearchCV becomes a role similar to RandomForestRegressor: it has been trained by the optimal parameter combination.

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from sklearn.model_selection import train_test_split
>>> from scipy.stats import randint
...
>>> X, y = fetch_california_housing(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
...
>>> # define the parameter space that will be searched over
>>> param_distributions = {
    
    'n_estimators': randint(1, 5),
...                        'max_depth': randint(5, 10)}
...
>>> # now create a searchCV object and fit it to the data
>>> search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
...                             n_iter=5,
...                             param_distributions=param_distributions,
...                             random_state=0)
>>> search.fit(X_train, y_train)
RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
                   param_distributions={
    
    'max_depth': ...,
                                        'n_estimators': ...},
                   random_state=0)
>>> search.best_params_
{
    
    'max_depth': 9, 'n_estimators': 4}

>>> # the search object now acts like a normal random forest estimator
>>> # with max_depth=9 and n_estimators=4
>>> search.score(X_test, y_test)
0.73...

Note: Search through a pipeline rather than a separate evaluator.

Note:
In practice, you almost always want to search over a pipeline, instead of a single estimator. One of the main reasons is that if you apply a pre-processing step to the whole dataset without using a pipeline, and then perform any kind of cross-validation, you would be breaking the fundamental assumption of independence between training and testing data. Indeed, since you pre-processed the data using the whole dataset, some information about the test sets are available to the train sets. This will lead to over-estimating the generalization power of the estimator (you can read more in this kaggle post).
Using a pipeline for cross-validation and searching will largely keep you from this common pitfall.

1. Pretreatment

Sklearn study notes (1)-data preprocessing

2. Model selection

Sklearn study notes (2) model selection and evaluation

3. Algorithm

3.1 Classification

3.2 Regression

3.3 Clustering

3.4 Dimensionality reduction

API

API Reference

Use sklearn for data mining
Insert picture description here
to be studied~

Write at the end

In the future, we will continue to update the "Learning Summary" series as a digestion and summary of the series.

reference:

  1. Scikit-learn
  2. Scikit-learn in Chinese
  3. Python machine learning notes: learning sklearn library

Guess you like

Origin blog.csdn.net/Robin_Pi/article/details/103971999