Machine Learning with Python: A Beginner's Guide to Scikit-learn

1. Introduction to Scikit-learn

1. What is Scikit-learn?

Scikit-learn is a machine learning tool library based on the Python language. It provides APIs for common machine learning tasks such as classification, regression, and clustering, as well as many commonly used data preprocessing tools and data visualization tools. Scikit-learn is designed to work with NumPy, SciPy, and matplotlib tools, so integration with these libraries is easy.

2. Advantages and application scenarios of Scikit-learn

Scikit-learn provides rich, mature and easy-to-use algorithms and tools for various machine learning tasks. Scikit-learn enables a complete machine learning workflow from data preprocessing to model selection, training, and evaluation. It is also widely used in data mining, predictive modeling, machine vision, natural language processing and other fields.

3. Scikit-learn installation

You can use the pip command to install Scikit-learn with the following command:

pip install -U scikit-learn

2. Data preparation

1. Data characteristics

In this link, you need to first explore and describe the data according to your specific task to determine which characteristics and standards can be used to build the model. You can use the Pandas library to load the dataset into a DataFrame, and then use methods such as head and describe to understand the basic situation of the data:

import pandas as pd

# 导入数据集到 DataFrame
df = pd.read_csv('data.csv')

# 预览数据
print(df.head())

# 描述数据
print(df.describe())

2. Data cleaning

In the data cleaning phase we remove useless columns, deal with missing data and outliers, etc. You can use methods such as drop and fillna of the Pandas library to process data:

import pandas as pd

# 导入数据集到 DataFrame
df = pd.read_csv('data.csv')

# 删除无用的列
df = df.drop(['id', 'timestamp'], axis=1)

# 处理缺失的数据
df = df.fillna(df.mean())

# 处理异常值
df = df[(df['value'] >= 0) & (df['value'] <= 100)]

3. Data partitioning

In machine learning tasks, it is usually necessary to divide the data set into training set and test set. You can use the Scikit-learn library for data partitioning:

from sklearn.model_selection import train_test_split
import pandas as pd

# 导入数据集到 DataFrame
df = pd.read_csv('data.csv')

# 删除无用的列
df = df.drop(['id', 'timestamp'], axis=1)

# 处理缺失的数据
df = df.fillna(df.mean())

# 处理异常值
df = df[(df['value'] >= 0) & (df['value'] <= 100)]

# 划分数据集为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(df.drop('label', axis=1), df['label'], random_state=42)

3. Model training

1. Model selection

In the model selection process, it is necessary to select the most suitable model according to the nature of the task, the distribution of data, and the performance requirements. Scikit-learn provides many commonly used machine learning algorithms, from which we can choose the one suitable for our task, such as:

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

clf1 = SVC()
clf2 = KNeighborsClassifier()
clf3 = RandomForestClassifier()

2. Model training

After selecting the model, you need to use the training data to train the model. Scikit-learn provides the fit method for model training:

from sklearn.svm import SVC
import pandas as pd

# 导入数据集到 DataFrame
df = pd.read_csv('data.csv')

# 删除无用的列
df = df.drop(['id', 'timestamp'], axis=1)

# 处理缺失的数据
df = df.fillna(df.mean())

# 处理异常值
df = df[(df['value'] >= 0) & (df['value'] <= 100)]

# 划分数据集为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(df.drop('label', axis=1), df['label'], random_state=42)

# 初始化 SVM 模型
clf = SVC(kernel='linear', C=1)

# 训练模型
clf.fit(X_train, y_train)

3. Model Evaluation

In the model evaluation part, a test set is needed to evaluate the performance of the model. Scikit-learn provides evaluation methods such as score method and confusion matrix:

from sklearn.metrics import accuracy_score, confusion_matrix

# 用测试集评估模型的性能
y_pred = clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion Matrix:', confusion_matrix(y_test, y_pred))

4. Machine Learning Algorithms

1. Supervised learning algorithm

1.1 Linear regression

A linear regression model is a machine learning algorithm for modeling linear relationships. It can be used to predict continuous numerical variables such as sales, stock prices, etc. Here is an example of implementing linear regression using the scikit-learn library:

from sklearn.linear_model import LinearRegression

# 建立线性回归模型
model = LinearRegression()

# 将数据划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# 训练模型
model.fit(X_train, y_train)

# 预测结果
y_pred = model.predict(X_test)

1.2 Logistic regression

A logistic regression model is a machine learning algorithm for modeling binary classification problems. It can be used to predict the probability of an event occurring. Here is an example of implementing logistic regression using the scikit-learn library:

from sklearn.linear_model import LogisticRegression

# 建立逻辑回归模型
model = LogisticRegression()

# 将数据划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# 训练模型
model.fit(X_train, y_train)

# 预测结果
y_pred = model.predict(X_test)

1.3 Decision tree

A decision tree is a machine learning algorithm that makes decisions based on a tree structure. It can be used for classification and regression problems and its advantage is that it is easy to understand and explain. Here is an example of a decision tree implemented using the scikit-learn library:

from sklearn.tree import DecisionTreeClassifier

# 建立决策树模型
model = DecisionTreeClassifier(max_depth=2)

# 将数据划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# 训练模型
model.fit(X_train, y_train)

# 预测结果
y_pred = model.predict(X_test)

1.4 Support Vector Machine

A support vector machine is a machine learning algorithm for classification based on margin maximization. It can be used for both classification and regression problems and its advantages lie in high accuracy and robustness. Here is an example of a support vector machine implemented using the scikit-learn library:

from sklearn.svm import SVC

# 建立支持向量机模型
model = SVC(kernel='linear')

# 将数据划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# 训练模型
model.fit(X_train, y_train)

# 预测结果
y_pred = model.predict(X_test)

1.5 Random Forest

Random Forest is a machine learning algorithm based on decision trees for classification and regression. It can be used to handle high-dimensional and large-scale datasets. Here is an example of random forest implementation using scikit-learn library:

from sklearn.ensemble import RandomForestClassifier

# 建立随机森林模型
model = RandomForestClassifier(n_estimators=100)

# 将数据划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# 训练模型
model.fit(X_train, y_train)

# 预测结果
y_pred = model.predict(X_test)

2. Unsupervised Learning Algorithms

2.1 Principal component analysis

Principal Component Analysis is a machine learning algorithm for dimensionality reduction of data. It converts high-dimensional data to low-dimensional data through linear transformation and preserves most of the variability. Here is an example of principal component analysis using the scikit-learn library:

from sklearn.decomposition import PCA

# 建立主成分分析模型
model = PCA(n_components=2)

# 将数据转化为低维度
X_pca = model.fit_transform(X)

2.2 Cluster analysis

Cluster analysis is a machine learning algorithm used to group datasets. It can be used to discover different patterns and groups in datasets. Here is an example of cluster analysis using the scikit-learn library:

from sklearn.cluster import KMeans

# 建立聚类分析模型
model = KMeans(n_clusters=3)

# 训练模型
model.fit(X)

# 预测结果
y_pred = model.predict(X)

5. Practical cases

1. Classification problems

An example of an iris classification problem using a decision tree algorithm:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 导入鸢尾花数据集
iris = load_iris()

# 建立决策树分类模型
model = DecisionTreeClassifier()

# 将数据划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

# 训练模型
model.fit(X_train, y_train)

# 预测结果
y_pred = model.predict(X_test)

# 计算准确率
print('Accuracy:', accuracy_score(y_test, y_pred))

2. Regression problem

An example of a Boston house price regression problem using the random forest algorithm:

from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# 导入波士顿房价数据集
boston = load_boston()

# 建立随机森林回归模型
model = RandomForestRegressor()

# 将数据划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=0)

# 训练模型
model.fit(X_train, y_train)

# 预测结果
y_pred = model.predict(X_test)

# 计算平均绝对误差
print('MAE:', mean_absolute_error(y_test, y_pred))

6. Advanced Scikit-learn

1. Assembly line

When the data we need to process is very large, it is easy to import the data, and select and train different models. In these processes, it may be necessary to preprocess the data, such as standard deviation normalization or normalization. Scikit-learn provides a Pipeline API that allows us to integrate the entire process with code for easy calling.

Here is an example pipeline:

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris

# 导入鸢尾花数据集
iris = load_iris()

# 建立Pipeline
pipeline = Pipeline([
    ('reduce_dim', PCA()),
    ('classify', LogisticRegression())
])

# 建立参数搜索空间
param_grid = {
    
    
    'reduce_dim__n_components': [2, 4, 8],
    'classify__C': [0.1, 1, 10]
}

# 在流水线中使用GridSearchCV调整参数
grid = GridSearchCV(pipeline, cv=5, n_jobs=-1, param_grid=param_grid)
grid.fit(iris.data, iris.target)

# 输出最佳参数
print(grid.best_params_)

2. Model tuning

When we use a model in Scikit-learn, we need to properly adjust the hyperparameters of the model to obtain the best performance. Different adjustment methods are provided in Scikit-learn, such as Grid Search and Random Search. Which method to use depends on the size and requirements of your data.

Here is an example of tuning a model using Grid Search:

from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

# 导入波士顿房价数据集
boston = load_boston()

# 建立随机森林回归模型
model = RandomForestRegressor()

# 建立参数搜索空间
param_grid = {
    
    
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 2, 5]
}

# 在模型中使用GridSearchCV调整参数
grid = GridSearchCV(model, cv=5, n_jobs=-1, param_grid=param_grid)
grid.fit(boston.data, boston.target)

# 预测结果
y_pred = grid.predict(boston.data)

# 计算平均绝对误差
print('MAE:', mean_absolute_error(boston.target, y_pred))

3. Feature Selection

Not all features are of equal importance in practical applications. Some features may be more predictive than others. Feature selection is a technique used to select the most important features. Scikit-learn provides many feature selection tools, such as SelectKBest, Recursive Feature Elimination, and SelectFromModel.

Here is an example of selecting features using SelectKBest:

from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# 导入波士顿房价数据集
boston = load_boston()

# 选择最重要的5个特征
selector = SelectKBest(f_regression, k=5)

# 将数据集分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=0)

# 选择特征
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# 建立线性回归模型
model = LinearRegression()

# 训练模型
model.fit(X_train_selected, y_train)

# 预测结果
y_pred = model.predict(X_test_selected)

# 计算平均绝对误差
print('MAE:', mean_absolute_error(y_test, y_pred))

7. Summary and review

1. Advantages and disadvantages of Scikit-learn

Scikit-learn has the following advantages:

  • It has a comprehensive toolkit for general machine learning problems;
  • Almost all algorithms can be used through a unified API, enabling users to better understand and use these algorithms;
  • Has a lot of documentation and examples, making it easier to use Scikit-learn;
  • It supports distributed computing very well and can be easily extended to large-scale data sets;
  • The code for Scikit-learn is open source.

Scikit-learn also has some disadvantages:

  • Because it relies on Python, Scikit-learn is less efficient than languages ​​such as C++ or Java;
  • It does not yet fully support large-scale deep learning models.

2. Future development direction

Along with artificial intelligence, machine learning and data science will also make further developments. Scikit-learn will continue to be one of the important tools for many people to get started with machine learning. In the future, we can expect more algorithms to be added to the Scikit-learn toolkit, and it will better support large-scale and high-performance computing.

Guess you like

Origin blog.csdn.net/u010349629/article/details/130663015