Classification algorithm

Organized from the dark horse machine learning tutorial

sklearn transformers and estimators

converter

Converters are used in the steps of feature engineering processing. The general usage process is as follows:

instantiate a converter
Calling fit_transform()
actually fit_transform() can be divided into two steps. Here we take standardization as an example: $\displaystyle x^, = \frac{xE(x)}{\sigma }$
1. fit: Calculate the variance and mean of each column
2. transfer: Convert the value of the column according to the value obtained by fit

Estimator

In sklearn, it is a class of api that implements the algorithm.

Estimator for classification:
- sklearn.neighbors K-近邻
- sklearn.naive_bayes Bayesian
- sklearn.linear_model.LogisticRegression logistic regression
- sklearn.tree decision trees and random forests
Estimator for regression:
- sklearn.linear_model.LinearRegression Linear regression
- sklearn.linear_model.Ridge Ridge Regression
Estimator for unsupervised learning:
- sklearn.cluster.KMeans clustering

manual

Instantiate an estimator class
estimator.fit(x_train, y_train)->Generate a model based on the eigenvalues and target values of the training set
Model evaluation
1. Direct evaluation of true and predicted values
2. Calculate the accuracy: estimator.score(x_train, y_train)

KNN algorithm

K-Nearest Neighbor Algorithm: A sample also belongs to a category if most of the K most similar (that is, the closest neighbors in the feature space) samples in the feature space belong to a category.

How to Calculate Feature Distances

Introduce several distance calculation formulas:

Euclidean distance: $\sqrt{(a_1-b_1)^2+(a_2-b_2)^2+(a_2 -b_2)^2}$
Taxi: $a_1-b_1 | + | a_2-b_2 | + | a_3-b_3 |$ ;
Minkowski distance: $(\sum_{i=1}^n |x_i-y_i|^p)^{\frac{1}{p} }$

It can be found that both Euclidean distance and Manhattan distance are special cases of Minkowski distance

How to handle data

The K nearest neighbors are more sensitive to the value of K:

If the value of K is too small, it is easily affected by outliers
If the value of K is too large, it will be easily affected by sample imbalance

At the same time, data of different dimensions will have more influence on data of large dimensions.
To sum up: we need to standardize the data in advance

api takes iris as an example

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
# 获取数据
iris = load_iris()

# 划分数据集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=23)

# 标准化
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
# 训练集和测试集做相同处理（很重要！）
x_test = transfer.transform(x_test)

# KNN算法预估器  建立模型
estimator = KNeighborsClassifier(n_neighbors=3)
estimator.fit(x_train, y_train)

# 模型评估
# 1 直接对比真实值和预估值
y_predict = estimator.predict(x_test)
print(y_predict == y_test)

# 计算准确率
score = estimator.score(x_test, y_test)
print(score)

The results are affected by the partition of the dataset (accuracy of 1 when the random number seed is 23)

summary

advantage
- Simple, easy to understand, easy to implement, no training required
shortcoming
- Lazy algorithm, large amount of computation and high memory overhead when classifying test samples
- Results are affected by the K value
scenes to be used
- Small Data Scenario

Model selection and tuning

In the KNN algorithm, we need to determine a K value. Model selection and tuning can help us find a suitable K value.

Cross-validation

It can make the model more accurate
. Take 4-fold cross-validation as an example: the training set is divided into four groups, one of which is used as the validation set, after four trainings, different validation sets are replaced each time, and the average value is taken as the final result
insert image description here

Hyperparameter Search - Grid Search

Usually, many parameters need to be specified manually (such as the K value in KNN), and such parameters are called hyperparameters .
We can record the values of these hyperparameters in the grid, and then traverse the grid data to find the optimal one .

api

from sklearn.model_selection import GridSearchCV
# KNN算法预估器  建立模型
estimator = KNeighborsClassifier(n_neighbors=3)
# 添加网格搜索交叉验证
param_dict = {
    
    "n_neighbors":[1, 3, 5, 7, 9, 11]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=10)
# 查看最佳参数等
print("最佳参数：\n", estimator.best_params_)
print("最佳结果：\n", estimator.best_score_)
print("最佳估计器：\n", estimator.best_estimator_)
print("交叉验证结果：\n", estimator.cv_results_)

Why is the best result better than the accuracy?

Because these two are evaluations on the training set and the test machine respectively

Naive Bayes Algorithm

Premise assumption: features are independent of each other

$\frac {P(B|A)P(A)}{P(B)}$

Usage scenario: text classification

api

from sklearn.naive_bayes import MultinomialNB
# 贝叶斯
estimator = MultinomialNB()
estimator.fit(x_train, y_train)

summary

advantage:
- Classification efficiency is stable
- Insensitive to real data, often used for text classification
- High classification accuracy and fast speed
shortcoming:
- The assumption of sample attribute independence, if the feature correlation is strong, the effect is not ideal

decision tree

The classification principle is similar to the if-else in the program, the key is to find the most efficient decision order.

How to Find the Most Efficient Decision Order

It mainly involves two points:

Select the most suitable feature for branching among many input features
How to find a threshold as the best split point from the grouping features

For this purpose, there is a definition of information entropy : used to measure the uncertainty of
random variables Entropy of random variables:
$ H(X) = - \sum_{i=1}^np(X_i)log_2 p(X_i)$
for a number of d The sample set D of $C_{k}$ , then the entropy of the sample set is:
$ H(D) = - \sum_{i=1}^k \frac{C_i}{d}log_2\frac{C_i}{d}$

According to the information entropy, we can classify by three decision tree algorithms, which will be shared in another article.

api

from sklearn.tree import DecisionTreeClassifier, export_graphviz

# 决策树
estimator = DecisionTreeClassifier(criterion="entropy")
estimator.fit(x_train, y_train)
# 计算准确率
score = estimator.score(x_test, y_test)
print(score)
# 决策树可视化
export_graphviz(estimator, out_file="tree.dot", feature_names=iris.feature_names)

summary

advantage
- Simple to understand and explain, tree visualization
shortcoming
- Cannot create trees with more complex data (overfitting)
Improve
- pruning
- random forest

random forest

A random forest is a classifier that contains multiple decision trees, and the output is determined by the mode of all classes.

Principle process

Randomness manifests itself in two ways:

Training set random: bootstrap sampling (random with replacement)
Feature random: randomly extract m features from M features (M >> m can achieve the purpose of dimensionality reduction)

api

Because there are several hyperparameters, we can add grid search to find the best results

from sklearn.ensemble import RandomForestClassifier

estimator = RandomForestClassifier()
# 网格搜索
param_dict = {
    
    "n_estimators": [120, 200, 300, 500, 800, 1200],"max_depth": [5, 8, 15, 25, 30]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)

summary

Among all current algorithms, it has excellent accuracy
Can run efficiently on large datasets, can handle input samples of high-dimensional features, and does not require dimensionality reduction (because random samples are features)
Ability to assess the importance of individual features in a classification problem (also because of random sampling of features)

sklearn classification algorithm

Classification algorithm

sklearn transformers and estimators

converter

Estimator

manual

KNN algorithm

How to Calculate Feature Distances

How to handle data

api takes iris as an example

summary

Model selection and tuning

Cross-validation

Hyperparameter Search - Grid Search

api

Why is the best result better than the accuracy?

Naive Bayes Algorithm

api

summary

decision tree

How to Find the Most Efficient Decision Order

api

summary

random forest

Principle process

api

summary

Guess you like