Machine Learning (11): Basics and Use of Scikit-learn Library

The full text has a total of more than 15,000 words, and the estimated reading time is about 30 to 50 minutes | It is full of useful information, it is recommended to save it!

Insert image description here

Code download address involved in this article

1. Introduction

1. 1 The development history and definition of Scikit-learn

Development of Scikit-learn began in 2007, initiated by David Cournapeau as part of the Google Summer of Code project. The project subsequently received contributions from many developers, including INRIA (French National Institute of Information and Automation), Waikato University and other institutions.

The project is named Scikit-Learn because the algorithm library is built based on SciPy, and Scikit is the abbreviation of SciPy Kit (a tool suite derived from SciPy).

Scikit-learn is currently the most complete and most influential algorithm library in the field of machine learning. It is based on Numpy, Scipy and matplotlib, and contains a large number of machine learning algorithm implementations, including classification, regression, clustering and dimensionality reduction, etc. It also contains many methods for model evaluation and selection. Scikit-learn's API is designed to be very clear, easy to use and understand, suitable for novices to get started, and also meets the needs of professionals in solving practical problems.

1.2 Understand the differences and connections between algorithm packages, algorithm libraries and algorithm frameworks

Algorithm package: Contains pre-written algorithm implementations for a specific problem or a series of related problems. Algorithm packages can be used to perform specific tasks or operations, such as numerical analysis, machine learning, image processing, etc. Users can directly call these algorithms without writing them from scratch. Such as tabular data analysis package Pandas

Algorithm libraries: Algorithm libraries and algorithm packages are very similar and are often used interchangeably. It also contains a collection of pre-written algorithms used to solve specific problems. It mainly refers to algorithms with a higher degree of encapsulation, more complete implementation of machine learning algorithm functions, and even definition of a type of data structure. Code modules, such as the scientific computing library NumPy

Algorithm framework: The algorithm framework is a larger concept that provides a system for developing, building and implementing algorithms, usually including a set of standard programming interfaces (APIs), tools, libraries and specifications. Its main purpose is to simplify and standardize the development process so that developers can focus more on implementing specific functions or algorithms without having to deal with a large number of infrastructure issues. Such as machine learning algorithm library Scikit-Learn

To understand it in a simple way: use a restaurant metaphor to understand these three concepts:

  • Algorithm Package: Like a specific dish on a restaurant menu. Each dish has a specific preparation method, and various ingredients are combined in a specific way to create a specific dish. For example, if you want to eat fish-flavored shredded pork, you can order this dish directly without telling the chef how to make it.
  • Algorithm Library: Like an entire restaurant menu. The menu contains many dishes, whether you want main course, soup or dessert, you can find it on the menu. You just need to choose what you want from the menu without knowing how to make it.
  • Algorithm Framework: Just like the entire restaurant’s operating model. This includes not only the menu, but also the decoration style of the restaurant, the service attitude of the waiters, the way the food is cooked, and the time the food is served, etc. It provides a convenient way to enjoy the complete dining experience in one place, not just the food itself.

Therefore, in programming, we directly use algorithm packages to solve specific problems, use algorithm libraries to solve a series of problems, and use algorithm frameworks to help us better organize and build code and solve problems more effectively.

2. Scikit-learn official website structure

For most popular open source projects, the official website is an excellent resource for learning. And this is especially true for Scikit-Learn. Even now that top open source projects are prevalent, the Scikit-Learn official website is second to none in the industry in terms of the detailed and complete introduction of relevant content. Whether it is the installation and update of Scikit-Learn, or the use of specific algorithms, or even the source of papers on the core principles of the algorithm and cases of algorithm use, there are detailed introductions on the Scikit-Learn official website. Its official website address:

Scikit-Learn official website , the main functions are introduced as follows:

image-20230703105257138

  1. Navigation Bar

image-20230703103736820

  1. Six functional modules

Scikit-learn divides all evaluators and functions into six categories, namely classification model (Classification), regression model (Regression), clustering model (Clustering), dimensionality reduction method (Dimensionality reduction), model selection (Model selection) and data preprocessing.

There are actually many overlaps in the division of the six functional modules. For many models, they can handle both classification and regression problems, and many clustering algorithms can also be used as dimensionality reduction methods. For example, the linear regression evaluator can be searched from Regression, while the model evaluation index is used. Since the evaluation index ultimately guides model selection, the search for the utility function for calculating the model evaluation index should be entered from the Model selection entrance.

image-20230703104323919

  1. User Guide: A collection of documents for all sklearn content

In the User Guide column at the top, enter the collection page of all sklearn content, which contains all sklearn content sorted in order of use. If you click Other versions on the upper left, you can download the PDF version of the User Guide for all versions of sklearn.

image-20230703105941226

  1. API: Query documents according to the interfaces sorted alphabetically by the second-level modules

If you want to find relevant API documentation based on the name of the evaluator or utility function, you can click on the API column at the top to enter the API query documentation sorted by the first letter of the diode module. The second-level module refers to the linear_model module including linear regression or the metrics module including MSE.

image-20230703110042249

3. Installation and settings

3.1 Installation and configuration of Python environment

If you still don’t know how to install the Python basic environment, we recommend that you read this article, which explains in detail the downloading, installation and startup of Anaconda, the basic operations of Jupyter and its powerful Notebook editing environment, as well as the steps to easily upgrade the Python version, maintain and manage Python. The three-party library is full of useful information!

Several ways to install and deploy Python packages | Full of useful information

3.2 Installation of Scikit-learn

After completing the installation and configuration of the Python environment, you can install Scikit-learn. The corresponding location on the official website is as follows:

image-20230703110330682

Scikit-learn requires Python (>= 3.6) and pip.

Install Scikit-learn's dependency packages, including NumPy and SciPy. If these packages are already installed, you can skip this step. If you don't have it yet, you can install it using the following command:

pip install numpy scipy

Next, you can install Scikit-learn. Use the following command to install:

pip install -U scikit-learn

This command will install or upgrade Scikit-learn to the latest version.

If you are using Anaconda, it is easier to install Scikit-learn, just use the following command:

conda install scikit-learn

To confirm whether Scikit-learn has been successfully installed, you can try to import it in the Python environment:

import sklearn
sklearn.__version__

If no error message appears, Scikit-learn has been successfully installed.

The above are the basic installation steps. Different operating systems and Python environments may have some differences. It needs to be adjusted according to your actual situation. If you encounter any problems during the installation process, you can consult Scikit-learn's official documentation or search for solutions online.

new version update:

pip install --upgrade sklearn

4. Quick Start with Scikit-learn

4.1 Import and processing of data sets

Scikit-learn provides a lot of built-in data sets, and also provides some methods to create data sets. These data sets are often used to demonstrate the use of various machine learning algorithms. These data sets are divided into two types: small-scale toy data sets (Toy Datasets) and large-scale real-world data sets (Real-World Datasets).

Here are a few common toy data sets:

  1. Iris : A data set for classification problems, which contains four features of three types of iris flowers. The goal is to predict the type of iris flowers based on these features.
  2. Digits (Handwritten Digits) : A data set for a multi-classification problem that contains 8x8 pixel images of handwritten digits. The goal is to identify the digits corresponding to these images.
  3. Boston House Prices : This is a data set for a regression problem that contains house prices and 13 other features in various areas of Boston. The goal is to predict house prices.
  4. Breast Cancer : This is a data set for a binary classification problem that contains 30 features of breast tumors. The goal is to predict whether the tumor is benign or malignant.

The dataset-related functions in sklearn are all under the datasets module. You can get an overview of all datasets and methods of creating datasets through the content contained in the datasets module in the API documentation.

image-20230703140835097

image-20230703141003085

image-20230703141104936

To load these datasets in Scikit-learn, you can use sklearn.datasetsthe relevant functions in the module, for example:

from sklearn.datasets import load_iris

iris = load_iris()

This function returns an Bunchobject containing data, targets, and other information. For example, iris.datait is a two-dimensional array containing features, and iris.targetit is a one-dimensional array containing targets.

name describe
data Data set feature matrix
target Dataset label array
feature_names Column names
target_names Name of each category
frame When the generated object is a DataFrame, return the complete DataFrame

The corresponding code can be viewed using the following code:

# 数据集包含四个特征
print("Features: ", iris.feature_names)
# 数据集有三种分类标签
print("Labels: ", iris.target_names)

# 将数据转换为DataFrame以便于查看
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)

# 添加分类标签到DataFrame
iris_df['label'] = iris.target

# 显示数据的前五行
print(iris_df.head())

image-20230703141719555

Scikit-learn also provides some real-world datasets, but due to their large size, they usually require downloading. These datasets can be used for more complex tasks and testing of algorithms. For example, fetch_20newsgroupsthe function can download the 20 Newsgroups text dataset for tasks such as text classification.

4.2 Data set segmentation

In Scikit-learn, the original data set is usually split into a training set and a test set, so that the performance of the model on unseen data can be evaluated. The purpose of data set segmentation is to better evaluate model performance, and to better evaluate model performance is to better select models. Scikit-learn provides functions to help complete this task, train_test_split is in train_test_splitmodel_selection under module.

image-20230703142713957

image-20230703142909817

It can be called and used like this:

from sklearn.model_selection import train_test_split

# 假设X是特征,y是目标
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train_test_splitThe main parameters of the function are:

  • X, y: The data that needs to be split.
  • test_size: Represents the proportion of the test set. In the above example, we use 20% of the data as the test set.
  • random_state: Random seed, which ensures that the data is split in the same way every time the code is run.

In the code, you can use ?the method to view the details of the function:

# 查阅该函数的帮助文档
train_test_split?

image-20230703143118481

There are two parameters here that need attention :

  • The random number seed setting and random_state value are different, and the segmentation results will be different.
  • stratifyParameters are parameters that control the proportion of samples of different categories in the training set and test set. If you want the proportion of 0 and 1 in the split training set and test set to be the same as the original data (1:1), you can add stratify= y.

4.3 Standardization of numerical data

The preprocessing module in Scikit-learn sklearn.preprocessingprovides many practical feature scaling functions, including data normalization (Normalization) and standardization (Standardization). Both techniques are used to change the scale of features so that they are within the same range when training machine learning models.

One thing to note here: From a functional perspective, normalization in Scikit-learn is actually divided into two categories: Standardization and Normalization. Z-Score standardization and 0-1 standardization both belong to the category of Standardization. Normalization specifically refers to the process of scaling a single sample (a row of data) using its norm.

  1. Data normalization : Normalization usually means scaling the data to the range [0, 1], or making all the data range between [-1, 1]. This can be achieved using Scikit-learn MinMaxScaler.

    from sklearn.preprocessing import MinMaxScaler
    
    X = np.arange(30).reshape(5, 6)
    
    X_train, X_test = train_test_split(X)
    
    scaler = MinMaxScaler()
    X_train_normalized = scaler.fit_transform(X_train)
    X_test_normalized = scaler.transform(X_test)
    
    X_test_normalized
    

This code first creates an MinMaxScalerobject, then uses fit_transformmethods to fit and transform the training data, and finally uses transformmethods to transform the test data.

  1. Data standardization : Standardization is to scale the data so that their mean is 0 and the standard deviation is 1. This can be achieved through Scikit-learn StandardScaler.
X = np.arange(30).reshape(5, 6)

X_train, X_test = train_test_split(X)

scaler = StandardScaler()

X_train_standardized = scaler.fit_transform(X_train)

# 利用训练集的均值和方差对测试集进行标准化处理
X_test_standardized = scaler.transform(X_test)

X_test_standardized

One point that needs to be explained: Why should it be used for the training set fit_transformand only for the test level transform? **

This is because: in machine learning, the training set and the test set should be processed separately. Specifically, the model should be trained on the training set, and the test set should simulate real-world data that the model has not seen to evaluate the true performance of the model. Therefore, any form of preprocessing (including feature scaling) should only be done based on the training set data.

When the method is called on the training set fit_transform, fitthe method calculates the mean and standard deviation of the training set data, and then transformthe method uses these calculated parameters (mean and standard deviation) to normalize the training set.

Then, when the method is called on the test set transform, Scikit-learn uses the mean and standard deviation previously calculated on the training set for normalization. The reason for this is that it is assumed that the test set is new data that the model has not seen before, so no information about the test set data (including its mean and standard deviation) can be used to influence the model. In other words, it must be assumed that the test set data is not visible during the preprocessing stage.

In general, when preprocessing data, the training set should use fit_transformmethods, and the test set should only use transformmethods, so as to ensure that the information of the test set is not "leaked" during the preprocessing stage.

4.4 Normalization of numerical data

In Scikit-learn, preprocessing.normalizeanother type of "normalization".

preprocessing.normalizeThe function is to convert the eigenvectors according to the Vector Space Model so that the Euclidean length (L2 norm) of each eigenvector is equal to 1, or the sum of the absolute values ​​of each element (L1 norm ) equals 1. In other words: unlike standardization, normalization in Scikit-learn specifically refers to the process of scaling a single sample (a row of data) to the unit norm (1 norm or 2 norm is the unit norm). This operation is common In the kernel method or the process of measuring the similarity between samples.

Assume vector x = [ x 1 , x 2 , . . . , xn ] T x = [x_1, x_2, ..., x_n]^Tx=[x1,x2,...,xn]T , then the basic calculation formula of the 1-norm of the vector x is:
∣ ∣ x ∣ ∣ 1 = ∣ x 1 ∣ + ∣ x 2 ∣ + . . . + ∣ xn ∣ (1) ||x||_1 = |x_1|+|x_2|+...+|x_n| \tag{1}∣∣x1=x1+x2+...+xn( 1 )
In mathematics, norm (Norm) is a function that maps vectors to non-negative values. Intuitively, the norm can be understood as the "length" or "size" of the vector.

That is, the sum of the absolute values ​​of each component. The 2-norm calculation formula of vector x is:

∣ ∣ x ∣ ∣ 2 = ( ∣ x 1 ∣ 2 + ∣ x 2 ∣ 2 + . . . + ∣ x n ∣ 2 ) (2) ||x||_2=\sqrt{(|x_1|^2+|x_2|^2+...+|x_n|^2)} \tag{2} ∣∣x2=(x12+x22+...+xn2) (2)

That is, the sum of the squares of each component is then squared.

The Normalization process in Scikit-learn actually treats each row of data as a vector, and then uses each row of data to divide the 1-norm or 2-norm of the row of data. The specific norm to divide by is based on the norm parameter entered in the preprocessing.normalize function.

from sklearn.preprocessing import normalize
import numpy as np

# 创建一个numpy数组
X = np.array([[1., -1., 2.],
              [2., 0., 0.],
              [0., 1., -1.]])

# 对数据进行归一化处理,使用默认的L2范数
X_normalized = normalize(X, norm='l2')

In the above code, the feature vector of each row is normalized to the unit norm (length 1). This means that the sum of the squares of all eigenvalues ​​for each sample is 1. You can also normperform L1 norm normalization by setting the parameter to 'l1', so that the absolute value sum of all eigenvalues ​​of each sample is 1.

4.4 Core object type: estimator

Many powerful third-party libraries define their own core object types, which are actually instances of specific classes defined in the source code. For example, the core of NumPy is an array (Array), the core of Pandas is a DataFrame, and the core of PyTorch is a tensor (Tensor). These object types provide powerful tools for data analysis and machine learning.

For Scikit-learn, its core object type is the estimator. Think of an evaluator as a tool that encapsulates various machine learning models. The core of the model training process in Scikit-learn revolves around these evaluators.

In general, the core object types of these different libraries provide convenience for handling specific tasks, allowing you to focus more on solving the problem without having to go deep into the bottom layer to deal with complex details. "

The use of the evaluator is basically divided into two steps. The first is to instantiate the object, and the second is to train the model around certain data.

4.5 Advanced Features-Pipeline

In Scikit-learn, Pipeline is a tool that conveniently organizes multiple steps together, and is often used for data preprocessing and modeling processes that contain multiple steps. Pipeline has great advantages in ensuring that the steps are executed sequentially, the code is clean, and it prevents data leakage when performing cross-validation.

The Pipeline workflow is similar to a production line. Each step is independent, but all steps are connected in series, and the output of the previous step is used as the input of the next step. A typical Pipeline may include steps such as data scaling (such as normalization or standardization), feature selection, dimensionality reduction, and final model training.

Let’s look directly at the code:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# 加载糖尿病数据集
diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, random_state=0)

# 创建一个Pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),  # 第一步是标准化
    ('regressor', LinearRegression())  # 第二步是线性回归
])

# 使用Pipeline进行训练
pipe.fit(X_train, y_train)

# 使用Pipeline进行预测
y_pred = pipe.predict(X_test)

y_pred

In this example, a Pipeline is created, which contains two steps: one is StandardScalerused to standardize the data; the other is LinearRegressionused to perform regression prediction. The method is then called on the training set fitand the Pipeline trains each step in turn (that is, it first normalizes on the data and then uses the normalized data to train the regression model). When the method is called on the test set predict, the Pipeline performs predictions on each step in turn (i.e., normalizes first, then predicts using the trained regression model).

4.6 Model saving

Model persistence is a technique that saves a trained machine learning model to disk and then loads and uses it at a later point in time (perhaps in a different environment). This is very useful because usually training a good model can require a lot of time and computing resources. Once a model is trained, we may want to reuse it in the future rather than retraining it every time it is needed.

In Scikit-learn, you can use Python's built-in library pickle, or jobliba library (a kind of pickle specifically for big data) to save and load models.

Directly to the code: demonstrates how to use joblibsaving and loading models:

pythonCopy codefrom sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from joblib import dump, load

# 加载iris数据集并训练一个随机森林分类器
iris = load_iris()
clf = RandomForestClassifier()
clf.fit(iris.data, iris.target)

# 将模型保存到磁盘
dump(clf, 'randomforest_model.joblib') 

# 在需要的时候加载模型
clf_loaded = load('randomforest_model.joblib') 

# 使用加载的模型进行预测
y_pred = clf_loaded.predict(iris.data)

In the above code, dumpthe function saves the model to the specified file, and loadthe function loads the model from the file. Note that the code to save and load the model will typically not be run in the same script or session, this is just for demonstration.

If the model contains a large number of numpy arrays (for example, models such as neural networks or random forests), it joblibmay be picklemore efficient than using . Therefore, Scikit-learn official documentation recommends using it joblibto save and load models.

5. Practical operation: Use Scikit-learn to implement linear regression modeling

5.1 Modeling process

  • Step 1: Prepare data and generate 1000 basic rules to satisfy y = 2 x 1 − x 2 + 1 y=2x_1-x_2+1y=2x _1x2+1 Distribution regression data set
# 科学计算模块
import numpy as np
import pandas as pd

# 绘图模块
import matplotlib as mpl
import matplotlib.pyplot as plt

# 回归数据创建函数
def arrayGenReg(num_examples = 1000, w = [2, -1, 1], bias = True, delta = 0.01, deg = 1):
    """回归类数据集创建函数。

    :param num_examples: 创建数据集的数据量
    :param w: 包括截距的(如果存在)特征系数向量
    :param bias:是否需要截距
    :param delta:扰动项取值
    :param deg:方程最高项次数
    :return: 生成的特征张和标签张量
    """
    
    if bias == True:
        num_inputs = len(w)-1                                                           # 数据集特征个数
        features_true = np.random.randn(num_examples, num_inputs)                       # 原始特征
        w_true = np.array(w[:-1]).reshape(-1, 1)                                        # 自变量系数
        b_true = np.array(w[-1])                                                        # 截距
        labels_true = np.power(features_true, deg).dot(w_true) + b_true                 # 严格满足人造规律的标签
        features = np.concatenate((features_true, np.ones_like(labels_true)), axis=1)    # 加上全为1的一列之后的特征
    else: 
        num_inputs = len(w)
        features = np.random.randn(num_examples, num_inputs) 
        w_true = np.array(w).reshape(-1, 1)         
        labels_true = np.power(features, deg).dot(w_true)
    labels = labels_true + np.random.normal(size = labels_true.shape) * delta
    return features, labels

The purpose of this code is to create a regression-like dataset. It defines a function arrayGenRegfor generating regression-like data sets with specific patterns. This function generates feature and label data based on the given parameters, and optionally adds an intercept term. The feature data is randomly generated according to the normal distribution, while the label data is calculated according to the set rules, and a disturbance term that obeys the normal distribution is added. The purpose of this function is to facilitate the generation of artificial data sets for regression problems.

  • Step 2: Generate feature and label data based on the function.
# 设置随机数种子
np.random.seed(24)   

# 扰动项取值为0.01
features, labels = arrayGenReg(delta=0.01)

image-20230703114334556

In this step, np.random.seed(24)the random number seed is set to 24 by using. The purpose of this is to ensure that the subsequent random generation process is repeatable, that is, you will get the same sequence of random numbers every time you run the code. Then, call arrayGenRegthe function to generate features and labels for the regression dataset. In this example, set the value of the disturbance term to 0.01, that is delta=0.01.

  • Step3: Draw two subgraphs and observe the distribution of the data set in different feature dimensions.
# 可视化数据分布
plt.subplot(121)
plt.plot(features[:, 0], labels, 'o')
plt.subplot(122)
plt.plot(features[:, 1], labels, 'o')

Plot the relationship between featuresthe first column ( features[:, 0]), the second column ( features[:, 1]) and the label column of the feature matrix respectively .labels

image-20230703114707573

  • Step 4: Call the linear regression estimator in Scikit-learn

First, import the linear regression estimator from the Scikit-learn library and use LinearRegressionthe estimator to perform linear regression modeling.

from sklearn.linear_model import LinearRegression

Then, create a linear regression model object and assign it to modelthe variable named.

model = LinearRegression()

Next, the feature matrix and labels are extracted from the previously generated data set. The feature matrix selects the first two features ( features[:, :2]) and assigns them to Xvariables. Assign the tag array to ya variable.

codeX = features[:, :2]  # 特征矩阵,选择前两个特征
y = labels  # 标签数组

fit()Finally, the model is trained by calling methods in the evaluator :

model.fit(X, y)

Through these steps, the linear regression model will be trained and learn patterns and associations in the data set.

In machine learning, an estimator is an object used to learn data patterns and make predictions. Linear Regression estimator (LinearRegression) is an estimator used to fit linear models.

Instantiating an evaluator creates an evaluator object that can be used. Through instantiation, the parameters and properties of the evaluator can be set for subsequent training and prediction operations . In this code, LinearRegression()an instance of the linear regression evaluator is created by using and assigned to modela variable.

fit()Method is an important method of the evaluator and is used to train the model. During the training process, the evaluator adjusts the parameters of the model by minimizing the loss function based on the provided feature matrix and label data, so that it can better fit the data . Through the training process, the model is able to learn the relationship between features and labels and build a predictive model.

To sum up, by instantiating the evaluator, providing the feature matrix and label data, and calling fit()methods to perform model training, the evaluator can be used to fit the data and obtain a linear regression model that can predict unknown samples.

  • Step 5: View model training parameters
print("自变量参数:", model.coef_)
print("模型截距:", model.intercept_)

The return parameters are as follows:

image-20230703120003507

  • Step 6: Interpretation of results

Independent variable parameters: The independent variable parameters learned by the model are [[1.99961892, -0.99985281]], which is close to [2, -1] in the basic law. This means that the model can learn the rules of data generation well and accurately model the relationship between features.

Model intercept: The intercept learned by the model is [0.99970541], which is close to 1 in the basic law. This means that the output value predicted by the model is still close to 1 even when there are no feature inputs.

Therefore, based on the independent variable parameters and intercept results of the model, it can be concluded that the linear regression model successfully learned the relationship between features in the basic laws and was able to accurately predict unknown samples.

  • Step 7: Use MSE for model evaluation

You can use the Mean Squared Error (MSE) calculation function in the Scikit-learn library to calculate the mean squared error between the predicted value and the true label.

# 在metrics模块下导入MSE计算函数
from sklearn.metrics import mean_squared_error

# 输入数据,进行计算
mean_squared_error(model.predict(X), y)

This completes the modeling process of calling Scikit-learn's linear regression model.

5.2 What are hyperparameters?

Important: Concept you must know: Hyperparameters

Hyperparameters refer to factors that cannot be optimally solved through mathematical processes, but can greatly affect the model form and modeling results. For example, in linear regression, the values ​​of the independent variable coefficients and intercept terms in the equation It is the optimal solution obtained through the least squares method or gradient descent algorithm. Whether to bring in the intercept term, whether to normalize the data, etc., these factors will also affect the model shape and modeling results, but it is Options for "human judgment" to then make a decision, and these are so-called hyperparameters.

In Scikit-learn, the time to set hyperparameters for each evaluator is during the instantiation of the evaluator class. You can view the relevant instructions of the LinearRegression evaluator, where the Parameters part is the relevant instructions of the current model hyperparameters:

image-20230703121933604

In the above Step 4 process, what is used directly is:

model = LinearRegression()

This is because the default parameters are used, and these hyperparameters can be set and modified during the instantiation process. For example, a linear equation model that does not include an intercept term can be created:

model1 = LinearRegression(fit_intercept=False)
model1.get_params()

image-20230703134439570

For an instantiated evaluator, you can obtain the parameters used for modeling through get_params.

During the process of instantiating the model, model hyperparameters must be carefully selected to achieve the expectations of final model training. Different models have different hyperparameters, which is also a very important point in the subsequent learning and modeling process.

5.3 How to find model operation documents on the official website

Finding the relevant evaluator (model) description on the official website is very important for understanding the principles and usage of the model. Take the LinearRegression evaluator as an example:

There are many ways to calculate linear regression parameters. The least squares method can be used to solve the parameters in one step, and it can also be solved iteratively through gradient descent. If you want to learn more about the parameter solving method of the training process, you need to return to the official website. Consult the evaluator's instructions. First we already know this.

LinearRegression is a regression model, so it must be explained in the Regression section of the sklearn official website.

image-20230703135642365

Click in and you can see that in 1.1.1.Ordinary Least Squares of this module, there are instructions about the LinearRegression evaluator. For any evaluator (algorithm model), the description document will first introduce the basic principles of the algorithm, the algorithm formula (often the loss function calculation expression) and a simple example. If necessary, it will also add links to relevant papers proposed by the algorithm, leading Users can get started quickly.

image-20230703135901176

image-20230703140050243

We will also discuss some characteristics of the algorithm (which are often issues that need to be paid attention to during use). For example, for the ordinary least squares method, the biggest problem is that when the feature matrix has severe multicollinearity, the prediction results will appear worse. Big error. Then, the documentation will illustrate a complete usage process of the algorithm, which is an example interspersed in the documentation. Then, the documentation will discuss several points that are often of concern when using the model. For linear regression, two common issues are listed here. One is how to implement non-negative least squares and the calculation of the least squares method. the complexity.

image-20230703140313506

The official documentation of Scikit-learn is very detailed and complete. When using other models, you can fully understand the principles and use of the model by learning and retrieving them in the same way.

6. Summary

This article explains in detail some basic usage of Scikit-learn, including its definition, installation, core object types (evaluators) and key features (such as data preprocessing, data set segmentation, data standardization and normalization), and learns Learn how to implement a linear regression model, including understanding the concept of hyperparameters, and how to save and load the model. I hope that this article can help everyone have a deeper understanding of Scikit-learn.

Finally, thank you for reading this article! If you feel you have gained something, don’t forget to like, collect and follow me. This is my motivation to continue creating. If you have any questions or suggestions, you can leave a message in the comment area and I will try my best to answer and accept your feedback. If there's a particular topic you'd like to know about, please let me know and I'll be happy to write an article about it.

Thank you for your support and look forward to growing with you!

Guess you like

Origin blog.csdn.net/Lvbaby_/article/details/131518929