sklearn classification algorithm

Classification algorithm

Organized from the dark horse machine learning tutorial

sklearn transformers and estimators

converter

Converters are used in the steps of feature engineering processing. The general usage process is as follows:

  1. instantiate a converter
  2. Calling fit_transform()
    actually fit_transform() can be divided into two steps. Here we take standardization as an example: x , = x − E ( x ) σ \displaystyle x^, = \frac{xE(x)}{\sigma }x,=σxE ( x )
    1. fit: Calculate the variance and mean of each column
    2. transfer: Convert the value of the column according to the value obtained by fit

Estimator

In sklearn, it is a class of api that implements the algorithm.

  • Estimator for classification:
    • sklearn.neighbors K-近邻
    • sklearn.naive_bayes Bayesian
    • sklearn.linear_model.LogisticRegression logistic regression
    • sklearn.tree decision trees and random forests
  • Estimator for regression:
    • sklearn.linear_model.LinearRegression Linear regression
    • sklearn.linear_model.Ridge Ridge Regression
  • Estimator for unsupervised learning:
    • sklearn.cluster.KMeans clustering

manual

  1. Instantiate an estimator class
  2. estimator.fit(x_train, y_train)->Generate a model based on the eigenvalues ​​and target values ​​of the training set
  3. Model evaluation
    1. Direct evaluation of true and predicted values
    2. Calculate the accuracy: estimator.score(x_train, y_train)

KNN algorithm

K-Nearest Neighbor Algorithm: A sample also belongs to a category if most of the K most similar (that is, the closest neighbors in the feature space) samples in the feature space belong to a category.

How to Calculate Feature Distances

Introduce several distance calculation formulas:

  1. Euclidean distance: ( a 1 − b 1 ) 2 + ( a 2 − b 2 ) 2 + ( a 2 − b 2 ) 2 \sqrt{(a_1-b_1)^2+(a_2-b_2)^2+(a_2 -b_2)^2}(a1b1)2+(a2b2)2+(a2b2)2
  2. Taxi: ∣ a 1 − b 1 ∣ + ∣ a 2 − b 2 ∣ + ∣ a 3 − b 3 ∣ | a_1-b_1 | + | a_2-b_2 | + | a_3-b_3 |a1b1+a2b2+a3b3;
  3. Minkowski distance: ( ∑ i = 1 n ∣ xi − yi ∣ p ) 1 p (\sum_{i=1}^n |x_i-y_i|^p)^{\frac{1}{p} }(i=1nxiYip)p1

It can be found that both Euclidean distance and Manhattan distance are special cases of Minkowski distance

How to handle data

The K nearest neighbors are more sensitive to the value of K:

  • If the value of K is too small, it is easily affected by outliers
  • If the value of K is too large, it will be easily affected by sample imbalance

At the same time, data of different dimensions will have more influence on data of large dimensions.
To sum up: we need to standardize the data in advance

api takes iris as an example

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
# 获取数据
iris = load_iris()

# 划分数据集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=23)

# 标准化
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
# 训练集和测试集做相同处理(很重要!)
x_test = transfer.transform(x_test)

# KNN算法预估器  建立模型
estimator = KNeighborsClassifier(n_neighbors=3)
estimator.fit(x_train, y_train)

# 模型评估
# 1 直接对比真实值和预估值
y_predict = estimator.predict(x_test)
print(y_predict == y_test)

# 计算准确率
score = estimator.score(x_test, y_test)
print(score)

The results are affected by the partition of the dataset (accuracy of 1 when the random number seed is 23)

summary

  • advantage
    • Simple, easy to understand, easy to implement, no training required
  • shortcoming
    • Lazy algorithm, large amount of computation and high memory overhead when classifying test samples
    • Results are affected by the K value
  • scenes to be used
    • Small Data Scenario

Model selection and tuning

In the KNN algorithm, we need to determine a K value. Model selection and tuning can help us find a suitable K value.

Cross-validation

It can make the model more accurate
. Take 4-fold cross-validation as an example: the training set is divided into four groups, one of which is used as the validation set, after four trainings, different validation sets are replaced each time, and the average value is taken as the final result
insert image description here

Hyperparameter Search - Grid Search

Usually, many parameters need to be specified manually (such as the K value in KNN), and such parameters are called hyperparameters .
We can record the values ​​of these hyperparameters in the grid, and then traverse the grid data to find the optimal one .

api

from sklearn.model_selection import GridSearchCV
# KNN算法预估器  建立模型
estimator = KNeighborsClassifier(n_neighbors=3)
# 添加网格搜索交叉验证
param_dict = {
    
    "n_neighbors":[1, 3, 5, 7, 9, 11]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=10)
# 查看最佳参数等
print("最佳参数:\n", estimator.best_params_)
print("最佳结果:\n", estimator.best_score_)
print("最佳估计器:\n", estimator.best_estimator_)
print("交叉验证结果:\n", estimator.cv_results_)

Why is the best result better than the accuracy?

Because these two are evaluations on the training set and the test machine respectively

Naive Bayes Algorithm

Premise assumption: features are independent of each other

P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B) = \frac {P(B|A)P(A)}{P(B)}P(AB)=P(B)P(BA)P(A)

Usage scenario: text classification

api

from sklearn.naive_bayes import MultinomialNB
# 贝叶斯
estimator = MultinomialNB()
estimator.fit(x_train, y_train)

summary

  • advantage:
    • Classification efficiency is stable
    • Insensitive to real data, often used for text classification
    • High classification accuracy and fast speed
  • shortcoming:
    • The assumption of sample attribute independence, if the feature correlation is strong, the effect is not ideal

decision tree

The classification principle is similar to the if-else in the program, the key is to find the most efficient decision order.

How to Find the Most Efficient Decision Order

It mainly involves two points:

  1. Select the most suitable feature for branching among many input features
  2. How to find a threshold as the best split point from the grouping features

For this purpose, there is a definition of information entropy : used to measure the uncertainty of
random variables Entropy of random variables:
$ H(X) = - \sum_{i=1}^np(X_i)log_2 p(X_i)$
for a number of d The sample set D of theCk, then the entropy of the sample set is:
$ H(D) = - \sum_{i=1}^k \frac{C_i}{d}log_2\frac{C_i}{d}$

According to the information entropy, we can classify by three decision tree algorithms, which will be shared in another article.

api

from sklearn.tree import DecisionTreeClassifier, export_graphviz

# 决策树
estimator = DecisionTreeClassifier(criterion="entropy")
estimator.fit(x_train, y_train)
# 计算准确率
score = estimator.score(x_test, y_test)
print(score)
# 决策树可视化
export_graphviz(estimator, out_file="tree.dot", feature_names=iris.feature_names)

summary

  • advantage
    • Simple to understand and explain, tree visualization
  • shortcoming
    • Cannot create trees with more complex data (overfitting)
  • Improve
    • pruning
    • random forest

random forest

A random forest is a classifier that contains multiple decision trees, and the output is determined by the mode of all classes.

Principle process

Randomness manifests itself in two ways:

  1. Training set random: bootstrap sampling (random with replacement)
  2. Feature random: randomly extract m features from M features (M >> m can achieve the purpose of dimensionality reduction)

api

Because there are several hyperparameters, we can add grid search to find the best results

from sklearn.ensemble import RandomForestClassifier

estimator = RandomForestClassifier()
# 网格搜索
param_dict = {
    
    "n_estimators": [120, 200, 300, 500, 800, 1200],"max_depth": [5, 8, 15, 25, 30]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)

summary

  • Among all current algorithms, it has excellent accuracy
  • Can run efficiently on large datasets, can handle input samples of high-dimensional features, and does not require dimensionality reduction (because random samples are features)
  • Ability to assess the importance of individual features in a classification problem (also because of random sampling of features)

Guess you like

Origin blog.csdn.net/qq_43550173/article/details/116614775