[Master Python in 100 Days] Day73: Detailed explanation of Python machine learning introductory algorithms and code examples

Table of contents

1. Supervised learning algorithm:

1.1 Linear Regression:

1.2 Logistic Regression:

1.3 Decision Tree:

1.4 Support Vector Machine:

1.5 Random Forest:

2. Unsupervised learning algorithm:

2.1 Clustering algorithm (Clustering):

2.2 Principal component analysis (PCA):

2.3 K-means Clustering:

3. Integrated learning algorithm:

3.1 Random Forest:

3.2 Gradient Boosting:

3.3 AdaBoost（Adaptive Boosting）：

1. Supervised learning algorithm:

Linear Regression: Used to model linear relationships between continuous variables. Example: Predicting house prices.

Logistic Regression: used to model binary classification problems. Example: Determine whether an email is spam or legitimate.

Decision Tree: Classification or regression by building a tree structure. Example: Predicting users who will purchase a product.

Support Vector Machine: Classifies data by finding an optimal hyperplane. Example: Predicting whether a tumor is malignant.

Random Forest: An integrated algorithm based on multiple decision trees, making predictions through voting. Example: Predicting whether a customer will churn.

1.1 Linear Regression:

Detailed explanation: Linear regression is used to model linear relationships between continuous variables.
Sample code:

from sklearn.linear_model import LinearRegression

# 准备训练数据
X_train = [[1], [2], [3], [4], [5]]  # 自变量的训练数据
y_train = [2, 4, 6, 8, 10]           # 因变量的训练数据

# 创建模型对象
model = LinearRegression()

# 拟合模型
model.fit(X_train, y_train)

# 预测
X_test = [[6], [7], [8]]             # 自变量的测试数据
y_pred = model.predict(X_test)       # 预测因变量

# 输出预测结果
print("预测结果：", y_pred)

1.2 Logistic Regression:

Detailed explanation: Logistic regression is used to build a model for binary classification problems, and the output is a probability value.
Sample code

from sklearn.linear_model import LogisticRegression

# 准备训练数据
X_train = [[1, 2], [2, 1], [2, 3], [4, 5]]    # 自变量的训练数据
y_train = [0, 0, 1, 1]                       # 因变量的训练数据

# 创建模型对象
model = LogisticRegression()

# 拟合模型
model.fit(X_train, y_train)

# 预测
X_test = [[3, 4], [1, 1], [5, 6]]            # 自变量的测试数据
y_pred = model.predict(X_test)               # 预测因变量

# 输出预测结果
print("预测结果：", y_pred)

1.3 Decision Tree:

Detailed explanation: Decision trees perform classification or regression by building a tree structure and select the best features for division.
Sample code:

from sklearn.tree import DecisionTreeClassifier

# 准备训练数据
X_train = [[1, 2], [2, 1], [2, 3], [4, 5]]    # 自变量的训练数据
y_train = [0, 0, 1, 1]                       # 因变量的训练数据

# 创建模型对象
model = DecisionTreeClassifier()

# 拟合模型
model.fit(X_train, y_train)

# 预测
X_test = [[3, 4], [1, 1], [5, 6]]            # 自变量的测试数据
y_pred = model.predict(X_test)               # 预测因变量

# 输出预测结果
print("预测结果：", y_pred)

1.4 Support Vector Machine:

Detailed explanation: Support vector machine classifies data by finding an optimal hyperplane and can handle linear and nonlinear problems.
Sample code:

from sklearn.svm import SVC

# 准备训练数据
X_train = [[1, 2], [2, 1], [2, 3], [4, 5]]    # 自变量的训练数据
y_train = [0, 0, 1, 1]                       # 因变量的训练数据

# 创建模型对象
model = SVC()

# 拟合模型
model.fit(X_train, y_train)

# 预测
X_test = [[3, 4], [1, 1], [5, 6]]            # 自变量的测试数据
y_pred = model.predict(X_test)               # 预测因变量

# 输出预测结果
print("预测结果：", y_pred)

1.5 Random Forest:

Random Forest is an ensemble learning algorithm based on decision trees. It improves the accuracy and generalization ability of the model by training multiple decision trees and integrating their prediction results.

Random forest consists of multiple decision trees, each tree is independent and unrelated to each other.

When each decision tree is constructed, random sampling with replacement (bootstrap) is performed from the original training set to form a new training set, and the decision tree is constructed using this training set.

In the process of building a decision tree, for the feature selection of each node, the random forest randomly selects a part of the features from all the features.

Finally, random forests perform classification or regression by integrating the prediction results of all decision trees and using voting or averaging.

Sample code:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 加载数据集
data = load_iris()
X = data.data
y = data.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建随机森林模型
model = RandomForestClassifier(n_estimators=100, random_state=42)

# 训练模型
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 评估模型
accuracy = accuracy_score(y_test, y_pred)
print("准确度：", accuracy)

2. Unsupervised learning algorithm:

Clustering algorithm: clustering similar data points together. Example: Conduct an analysis of market segments.

Principal component analysis (PCA): Reduce the dimensionality of high-dimensional data to a low-dimensional space through linear transformation. Example: Feature extraction in image recognition.

K-means Clustering: Divide the data into K clusters so that each data point belongs to the nearest cluster. Example: Segment your customers.

2.1 Clustering algorithm (Clustering):

Detailed explanation: Clustering algorithm is an unsupervised learning method that gathers similar data points together to form several clusters. Clustering algorithms can help us discover the inherent structure and patterns in data sets and are used in fields such as market segmentation, recommendation systems, and image segmentation.
Sample code:

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# 创建示例数据
X, y = make_blobs(n_samples=100, centers=3, random_state=42)

# 创建聚类模型
model = KMeans(n_clusters=3, random_state=42)

# 进行聚类
y_pred = model.fit_predict(X)

# 输出聚类结果
print("聚类结果：", y_pred)

2.2 Principal component analysis (PCA):

Detailed explanation: Principal component analysis is a commonly used dimensionality reduction method that converts high-dimensional data into a low-dimensional space through linear transformation. It finds the direction of maximum variance in the data, which is the first principal component, then finds the second principal component in the direction orthogonal to the first principal component, and so on. Principal component analysis can be used to reduce feature dimensions, remove redundant information, and visualize high-dimensional data.
Sample code:

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# 加载鸢尾花数据集
data = load_iris()
X = data.data

# 创建PCA模型
model = PCA(n_components=2)

# 进行主成分分析
X_pca = model.fit_transform(X)

# 输出降维后的数据
print("降维后的数据：", X_pca)

2.3 K-means Clustering:

Detailed explanation: K-means clustering is an iterative clustering algorithm. It divides the data set into K clusters and makes each data point belong to the center point of the nearest cluster. K-means clustering is commonly used in data analysis, image segmentation, text mining and other tasks.
Sample code:

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# 创建示例数据
X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)

# 创建K均值聚类模型
model = KMeans(n_clusters=4, random_state=42)

# 进行聚类
model.fit(X)

# 获取聚类标签
y_pred = model.labels_

# 输出聚类结果
print("聚类结果：", y_pred)

3. Integrated learning algorithm:

Random Forest: Classification or regression through the integration of multiple decision trees. Example: Predicting house prices

Gradient Boosting: Optimize the overall model by training multiple weak models and gradually fitting the residuals. Example: Forecast sales.

AdaBoost (Adaptive Boosting): Train multiple weak classifiers and weight voting by gradually adjusting sample weights. Example: Face recognition.

3.1 Random Forest:

Detailed explanation: Random forest performs classification or regression through the integration of multiple decision trees. Each decision tree is constructed by selecting a subset of features in a random manner, and using bootstrap sampling to construct a training data set for each tree. Finally, random forests arrive at the final prediction by voting (classification) or averaging (regression) the results of decision trees.
Sample code:

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 加载数据集
data = fetch_california_housing()
X = data.data
y = data.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建随机森林模型
model = RandomForestRegressor(n_estimators=100, random_state=42)

# 训练模型
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 评估模型
mse = mean_squared_error(y_test, y_pred)
print("均方误差：", mse)

3.2 Gradient Boosting:

Detailed explanation: Gradient boosting tree optimizes the overall model by training multiple weak models and gradually fitting the residuals of the current model. In each round of training, the residual of the previous round becomes the target variable of the current round. Finally, the gradient boosting tree weights and sums multiple weak models to obtain the final prediction result.
Sample code:

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 加载数据集
data = fetch_california_housing()
X = data.data
y = data.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建梯度提升树模型
model = GradientBoostingRegressor(n_estimators=100, random_state=42)

# 训练模型
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 评估模型
mse = mean_squared_error(y_test, y_pred)
print("均方误差：", mse)

3.3 AdaBoost（Adaptive Boosting）：

Detailed explanation: AdaBoost performs classification by gradually adjusting sample weights, training multiple weak classifiers, and weighted voting. Each weak classifier weight depends on its classification error in previous rounds. Finally, AdaBoost combines multiple weak classifiers into a strong classifier.
Sample code:

from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 加载数据集
data = fetch_olivetti_faces()
X = data.data
y = data.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建AdaBoost模型
model = AdaBoostClassifier(n_estimators=100, random_state=42)

# 训练模型
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 评估模型
accuracy = accuracy_score(y_test, y_pred)
print("准确度：", accuracy)

[Master Python in 100 Days] Day73: Detailed explanation of Python machine learning introductory algorithms and code examples

1. Supervised learning algorithm:

1.1 Linear Regression:

1.2 Logistic Regression:

1.3 Decision Tree:

1.4 Support Vector Machine:

1.5 Random Forest:

2. Unsupervised learning algorithm:

2.1 Clustering algorithm (Clustering):

2.2 Principal component analysis (PCA):

2.3 K-means Clustering:

3. Integrated learning algorithm:

3.1 Random Forest:

3.2 Gradient Boosting:

3.3 AdaBoost（Adaptive Boosting）：

Guess you like