Mastering Scikit-Learn: Getting Started with Machine Learning Libraries in Python

Overview

Machine Learning is a popular field that has frequently appeared in technology news, research reports, industry analyzes and practical applications in recent years. Machine Learning is affecting our lives at an unprecedented speed. From the development of smart speakers From speech recognition, face unlocking for mobile phone cameras, to assessment in the financial field, and predictive analysis of medical and health. The application of machine learning has already penetrated into every aspect of life. For those of us who are beginners, Scikit-Learn is the best choice.

Master Scikit-Learn

Scikit-Learn, referred to as Sklearn, is an open source machine learning library for Python. Since its birth, it has gradually become a standard library in the field of machine learning. Sklearn provides a rich selection of algorithms, from basic linear regression (Linear Regression), classification, to advanced integration methods and model optimization, covering almost all aspects of machine learning. But more importantly, Scikit-Learn's design philosophy provides users with simple, efficient, and reliable tools to help us complete our own task.

Today, Xiaobai, I will take you to have a comprehensive understanding of the basics of Scikit-Learn. Starting from the installation and configuration, discussing its core components, estimators, model training and evaluation, and then to practical application cases, we will gradually explore this powerful library. Hope At the end of the article, readers will be equipped with the ability to use Scikit-Learn to solve practical problems.

The perfect combination of machine learning and Python

Let’s first explore a question: Why choose Python as the programming language for machine learning? The answer to this question can be viewed from many angles. First of all, Python is a general-purpose high-level programming language with concise and clear syntax, making it suitable for beginners to learn. . In addition, Python has a wealth of open source libraries and frameworks, covering all aspects from data analysis, visualization to deep learning, which gives Python a significant advantage in the fields of data science and machine learning.

Secondly, Python has an active community with a large number of online resources, tutorials and cases, which provides us with valuable learning materials. Not only that, Python’s cross-platform features allow developers to easily deploy and run on different operating systems. own program.

So, why choose Scikit-Learn as the machine learning library in Python? Compared with other libraries, Scikit-Learn has clear advantages. First of all, its API design is unified and clear, whether it is data preprocessing, model training or evaluation, Users can complete it with a few lines of short code. In addition, Scikit-Learn has complete documentation, providing us with a large number of examples and guidance, which greatly reduces the difficulty of learning. Finally, Scikit-Learn is a library implemented in pure Python, which means This means we don't need to install a lot of complicated dependencies or worry about compatibility issues with other libraries.

Core components and structure of Scikit-Learn

Installation and configuration

Scikit-Learn depends onNumpy and SciPy, two libraries that provide scientific computing functions for Python. Therefore, before installing Scikit-Learn , make sure these two libraries are installed.

Install Scikit-Learn:

pip install scikit-learn

conda:

conda install scikit-learn

Verify installation

Check whether the installation is successful:

import sklearn

print(sklearn.__version__)

Data representation and preprocessing

In Scikit-Learn, the data is a Numpy array or a feature matrix. The samples are the rows of the matrix and the features are the columns of the matrix.

Feature matrix and target vector

Feature matrix:

  • Usually expressed as "X", the shape is "[n_samples, n_features"
    Target vector
  • When dealing with supervised learning problems, we will also have an array of labels (or labels), usually denoted as ‘y’

data processing

Scikit-Learn provides a variety of practical tools to help us preprocess data:

  • Scaling: such as “StandardScaler”, helps us standardize data
  • Encoding: such as “OneHotEncoder”, converts categorical features into numbers
  • Filling: Using “SimpleImputer”, handling missing data

estimator

Scikit-Learn's estimator provides a consistent interface for different machine learning applications.

The core concepts are as follows:

  • Estimator: The implementation of all algorithms is an estimator. For example, linear regression is an estimator, and k nearest neighbor is also an estimator.
  • Transformer: an estimator that can perform some transformation on the data
  • Predictor: An estimator that can predict based on input data

Basic steps include:

  1. Select a model and import the corresponding estimator class
  2. Select model parameters, instantiate classes, and set hyperparameter values
  3. Organize data by feature matrix and target vector
  4. Calling the estimator’sfit()method to train the model
  5. Usepredict()method for prediction

example:

"""
@Module Name: Scikit-Learn 估计器.py
@Author: CSDN@我是小白呀
@Date: October 15, 2023

Description:
Scikit-Learn 估计器
"""
from sklearn.linear_model import LinearRegression

# 创建数据
X = [[1], [2], [3]]
y = [2, 4, 6]

# 实例化模型
model = LinearRegression()

# 训练模型
model.fit(X, y)

# 预测
predict = model.predict([[4]])
print("预测结果:", predict)

Output result:

预测结果: [8.]

Model selection

In machine learning, model selection is one of the most critical steps. An appropriate model can greatly improve the accuracy of prediction, while an inappropriate model may cause the prediction results to seriously deviate from the results we want.

Therefore, before choosing a model, we must first clarify the problem we want to solve. Machine learning problems are mainly divided into two categories, classification problems (Classification) and regression problems (Regression).

Think about the nature of the problem

Before choosing a model, we must first clarify the nature of the problem. Some problems may appear to be classification problems at first glance, but in fact it is more appropriate to use a regression model. For example: predicting the passing rate of students' final exams can be regarded as a classification problem (pass/fail Pass), but in fact we can get better results by using regression models to predict scores.

Study the distribution of data

Data Distribution and Features are also crucial to model selection. Different models perform differently on different data. For example, for unbalanced distribution of classified data (some types of data have a much larger amount of data) (Compared with other categories), we need to handle unbalanced data types through up-sampling or sub-sampling.

Determine task complexity

Problem complexity is also an important factor to consider. Simple linear models, such as linear regression or logistic regression, may be suitable for linearly separable problems. But for more complex, nonlinear problems, you may need to use decision trees (Decision trees). Tree), Random Forest and other complex models.

Linear vs Non-Linear:

  • Linear:
    • Linear relationship means that the relationship between two or more variables (Variable) is a linear relationship
    • The linear relationship in mathematics can be expressed as y = w x + b y=wx + b and=wx+b where w is the weight (Weight), b is the bias (Bias)
    • In a linear system, the relationship between input and output is proportional
  • Non-Linear:
    • Nonlinearity means that the relationship between variables is nonlinear, such as a curve
    • Nonlinear equations include, polynomials, exponentials, logarithms
    • In nonlinear systems, the relationship between input and output is complex

Classification problem

Classification problems (Classification) refer to predicting the output variable as a category, such as determining the type of a movie or whether an object is a cat or a dog.

Machine learning models for classification problems include:

  • Linear model: Logistic Regression
  • Nonlinear models: k-nearest neighbor (KNN), decision tree (Decision Tree), support vector machine (SVM)

regression problem

Regression problem (Regression) means that the predicted output variable is a continuous value, such as box office prediction, house price prediction, etc.

Machine learning models for regression problems include:

  • Linear model: Linear Regression
  • Nonlinear models: Decision Tree Regressor, Random Forest

supervised learning

Supervised Learning is a core branch of machine learning. The goal is to learn a model from labeled data and predict the label of unknown data.

Classification algorithm

Logistic Regression: Although there is "regression" in the name, the logistic regression model is a classification model, mainly used for binary classification problems. Logistic regression outputs a probability value through the Sigmoid function to provide the input sample Classification.

example:

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)

Decision Tree: An image-based algorithm that is easy to understand and explain and can be used for classification and regression tasks.

example:

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)

Regression algorithm

Linear Regression: Finds the best-fitting straight line in the data. Predicts a continuous output.

example:

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_train, y_train)

Support Vector Regression (SVR, Support Vector Regression): Uses support vector machines to perform regression.

example:

from sklearn.svm import SVR

svr = SVR()
svr.fit(X_train, y_train)

unsupervised learning

Unsupervised Learning is different from supervised learning, which is to find patterns from unlabeled data.

K-Means: Divide the data into k clusters.

example:

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

Principal Component Analysis (PCA): Reduces the dimensionality of the data while trying to retain as much information as possible.

example:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

Grid Search: Systematically traverse various parameter combinations and determine the minimum effect parameters through cross-validation.

example:

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01]}
grid = GridSearchCV(SVR(), param_grid, refit=True)
grid.fit(X_train, y_train)

Randomized Search: Similar to Grid Search, but instead of trying all parameters, a given number of parameter combinations are randomly sampled.

Model evaluation

In the process of machine learning, it is crucial to choose an appropriate model. Likewise, it is necessary to evaluate the performance of the model to understand whether the model meets our expectations to avoid under-fitting and over-fitting problems.

Training set and validation set

In order to evaluate the effect of the model, we need to have a benchmark. So we need to split the training set and the test set. The training set is used to train the model (Train), leaving part of the data for testing the model, which is the validation set (Valid), and Not all data is used to train the model.

Classification model evaluation

Confusion Matrix: expresses the relationship between the true value and the predicted value of the model, including true examples, negative positive examples, false positive examples, and false negative examples.

example:

from sklearn.metrics import confusion_matrix

y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

Accuracy, Recall, F1 (F1-score): Evaluate different aspects of a classifier, including its accuracy, coverage and balance.

example:

from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred)
print(report)

Regression model evaluation

Mean Square Error (MSE, Mean Square Error): Measures the accuracy of model predictions.

example:

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
print(mse)

R^2 score (R-squared Score): measures the ability of the model to explain the variable. The closer the value is to 1, the better.

example:

from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print(r2)

Cross-Validation divides the data into multiple subsets, and then performs multiple training/test splits on each subset. For example, the common k-fold (k-Fold) cross-validation method divides the data into k Subsets. In each validation, one of the subsets is used as the validation set, and the other subsets are used as the training set. In this way, we can get k different model performance evaluations, and the average of these evaluations can be Provides a more accurate assessment of model performance.

example:

"""
@Module Name: 模型的评估.py
@Author: CSDN@我是小白呀
@Date: October 16, 2023

Description:
模型的评估
"""
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, f1_score

# 加载 iris 数据集
data = load_iris()
X = data.data
y = data.target

# 实例化随机森林
clf = RandomForestClassifier(n_estimators=50, random_state=42)

# 分割训练集 & 验证集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 使用 KFold 进行5折交叉验证
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(clf, X, y, cv=kf)
print("交叉验证平均得分:", scores.mean())

# 训练模型
clf.fit(X_train, y_train)

# 评估指标
y_pred = clf.predict(X_test)
print("精度:", accuracy_score(y_test, y_pred))
print("召回率:", recall_score(y_test, y_pred, average='macro'))  # 多分类问题使用宏平均
print("F1分数:", f1_score(y_test, y_pred, average='macro'))  # 多分类问题使用宏平均

Output result:

交叉验证平均得分: 0.9600000000000002
精度: 1.0
召回率: 1.0
F1分数: 1.0

feature engineering

Feature Engineering is a key step in machine learning. By creating and selecting appropriate features, the performance of the model can be greatly improved.

Feature selection

Feature Selection is the process of selecting features that are relevant to the target variable, while eliminating irrelevant or redundant features.

The benefits of feature selection are:

  1. Reduce model complexity
  2. Reduce the risk of overfitting
  3. Improve model training speed

For example:
Suppose we have a data set for predicting housing prices, which contains many features, such as the number of rooms, geographical location, year of construction, whether it is close to a subway station, etc. However, it may also contain some less relevant features, such as the name of the landlord, whether there is a swimming pool, etc. Through feature selection, we can only select the features most relevant to house prices to train the model.

Code:

"""
@Module Name: 特征选择.py
@Author: CSDN@我是小白呀
@Date: October 16, 2023

Description:
通过波士顿房价数据集, 说明特征选择
"""
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectKBest, f_regression

# 使用 Boston 房价数据集作为示例
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['PRICE'] = data.target

# 假设我们增加了一些不相关的特征
df['LANDLORD_NAME'] = np.random.choice(['Alice', 'Bob', 'Charlie'], df.shape[0])
df['HAS_POOL'] = np.random.choice([0, 1], df.shape[0])

# 将分类特征转化为数字
df['LANDLORD_NAME'] = df['LANDLORD_NAME'].astype('category').cat.codes

# 分割数据
X = df.drop('PRICE', axis=1)
y = df['PRICE']

# 使用 SelectKBest 进行特征选择
# 为了确定与房价最相关的特征, 我们可以使用f _regression 作为评分函数
selector = SelectKBest(score_func=f_regression, k=2)
X_new = selector.fit_transform(X, y)

# 打印被选中的特征
selected_features = pd.DataFrame(selector.inverse_transform(X_new),
                                 columns=X.columns)
selected_columns = selected_features.columns[selected_features.var() != 0]
print('选择的特征:', selected_columns)

Output result:

选择的特征: Index(['RM', 'LSTAT'], dtype='object')

Feature extraction

Feature Extraction is the process of converting raw data into a set of representative, fewer features. Unlike feature selection, feature extraction creates new features. Principal Component Analysis (PCA) is a commonly used feature extraction method.

example:

"""
@Module Name: 特征提取.py
@Author: CSDN@我是小白呀
@Date: October 16, 2023

Description:
人脸识别数据集, 说明特征提取
"""
from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# 加载人脸数据集
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
X = lfw_people.data
n_samples, n_features = X.shape

# 原始图像的维度
h, w = lfw_people.images.shape[1:3]

# PCA 转换, 提取 150 个主要成分
n_components = 150
pca = PCA(n_components=n_components, whiten=True).fit(X)
X_pca = pca.transform(X)

# 可视化主要成分的效果
def plot_gallery(images, titles, h, w, n_row=5, n_col=5):
    plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())

eigenfaces = pca.components_.reshape((n_components, h, w))
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)

# 绘图
plt.show()

Output result:
PCA

Guess you like

Origin blog.csdn.net/weixin_46274168/article/details/133847852