Classification algorithm
Organized from the dark horse machine learning tutorial
sklearn transformers and estimators
converter
Converters are used in the steps of feature engineering processing. The general usage process is as follows:
- instantiate a converter
- Calling fit_transform()
actually fit_transform() can be divided into two steps. Here we take standardization as an example: x , = x − E ( x ) σ \displaystyle x^, = \frac{xE(x)}{\sigma }x,=σx−E ( x )- fit: Calculate the variance and mean of each column
- transfer: Convert the value of the column according to the value obtained by fit
Estimator
In sklearn, it is a class of api that implements the algorithm.
- Estimator for classification:
- sklearn.neighbors K-近邻
- sklearn.naive_bayes Bayesian
- sklearn.linear_model.LogisticRegression logistic regression
- sklearn.tree decision trees and random forests
- Estimator for regression:
- sklearn.linear_model.LinearRegression Linear regression
- sklearn.linear_model.Ridge Ridge Regression
- Estimator for unsupervised learning:
- sklearn.cluster.KMeans clustering
manual
- Instantiate an estimator class
- estimator.fit(x_train, y_train)->Generate a model based on the eigenvalues and target values of the training set
- Model evaluation
- Direct evaluation of true and predicted values
- Calculate the accuracy: estimator.score(x_train, y_train)
KNN algorithm
K-Nearest Neighbor Algorithm: A sample also belongs to a category if most of the K most similar (that is, the closest neighbors in the feature space) samples in the feature space belong to a category.
How to Calculate Feature Distances
Introduce several distance calculation formulas:
- Euclidean distance: ( a 1 − b 1 ) 2 + ( a 2 − b 2 ) 2 + ( a 2 − b 2 ) 2 \sqrt{(a_1-b_1)^2+(a_2-b_2)^2+(a_2 -b_2)^2}(a1−b1)2+(a2−b2)2+(a2−b2)2
- Taxi: ∣ a 1 − b 1 ∣ + ∣ a 2 − b 2 ∣ + ∣ a 3 − b 3 ∣ | a_1-b_1 | + | a_2-b_2 | + | a_3-b_3 |∣a1−b1∣+∣a2−b2∣+∣a3−b3∣;
- Minkowski distance: ( ∑ i = 1 n ∣ xi − yi ∣ p ) 1 p (\sum_{i=1}^n |x_i-y_i|^p)^{\frac{1}{p} }(∑i=1n∣xi−Yi∣p)p1
It can be found that both Euclidean distance and Manhattan distance are special cases of Minkowski distance
How to handle data
The K nearest neighbors are more sensitive to the value of K:
- If the value of K is too small, it is easily affected by outliers
- If the value of K is too large, it will be easily affected by sample imbalance
At the same time, data of different dimensions will have more influence on data of large dimensions.
To sum up: we need to standardize the data in advance
api takes iris as an example
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
# 获取数据
iris = load_iris()
# 划分数据集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=23)
# 标准化
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
# 训练集和测试集做相同处理(很重要!)
x_test = transfer.transform(x_test)
# KNN算法预估器 建立模型
estimator = KNeighborsClassifier(n_neighbors=3)
estimator.fit(x_train, y_train)
# 模型评估
# 1 直接对比真实值和预估值
y_predict = estimator.predict(x_test)
print(y_predict == y_test)
# 计算准确率
score = estimator.score(x_test, y_test)
print(score)
The results are affected by the partition of the dataset (accuracy of 1 when the random number seed is 23)
summary
- advantage
- Simple, easy to understand, easy to implement, no training required
- shortcoming
- Lazy algorithm, large amount of computation and high memory overhead when classifying test samples
- Results are affected by the K value
- scenes to be used
- Small Data Scenario
Model selection and tuning
In the KNN algorithm, we need to determine a K value. Model selection and tuning can help us find a suitable K value.
Cross-validation
It can make the model more accurate
. Take 4-fold cross-validation as an example: the training set is divided into four groups, one of which is used as the validation set, after four trainings, different validation sets are replaced each time, and the average value is taken as the final result
Hyperparameter Search - Grid Search
Usually, many parameters need to be specified manually (such as the K value in KNN), and such parameters are called hyperparameters .
We can record the values of these hyperparameters in the grid, and then traverse the grid data to find the optimal one .
api
from sklearn.model_selection import GridSearchCV
# KNN算法预估器 建立模型
estimator = KNeighborsClassifier(n_neighbors=3)
# 添加网格搜索交叉验证
param_dict = {
"n_neighbors":[1, 3, 5, 7, 9, 11]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=10)
# 查看最佳参数等
print("最佳参数:\n", estimator.best_params_)
print("最佳结果:\n", estimator.best_score_)
print("最佳估计器:\n", estimator.best_estimator_)
print("交叉验证结果:\n", estimator.cv_results_)
Why is the best result better than the accuracy?
Because these two are evaluations on the training set and the test machine respectively
Naive Bayes Algorithm
Premise assumption: features are independent of each other
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B) = \frac {P(B|A)P(A)}{P(B)}P(A∣B)=P(B)P(B∣A)P(A)
Usage scenario: text classification
api
from sklearn.naive_bayes import MultinomialNB
# 贝叶斯
estimator = MultinomialNB()
estimator.fit(x_train, y_train)
summary
- advantage:
- Classification efficiency is stable
- Insensitive to real data, often used for text classification
- High classification accuracy and fast speed
- shortcoming:
- The assumption of sample attribute independence, if the feature correlation is strong, the effect is not ideal
decision tree
The classification principle is similar to the if-else in the program, the key is to find the most efficient decision order.
How to Find the Most Efficient Decision Order
It mainly involves two points:
- Select the most suitable feature for branching among many input features
- How to find a threshold as the best split point from the grouping features
For this purpose, there is a definition of information entropy : used to measure the uncertainty of
random variables Entropy of random variables:
$ H(X) = - \sum_{i=1}^np(X_i)log_2 p(X_i)$
for a number of d The sample set D of theCk, then the entropy of the sample set is:
$ H(D) = - \sum_{i=1}^k \frac{C_i}{d}log_2\frac{C_i}{d}$
According to the information entropy, we can classify by three decision tree algorithms, which will be shared in another article.
api
from sklearn.tree import DecisionTreeClassifier, export_graphviz
# 决策树
estimator = DecisionTreeClassifier(criterion="entropy")
estimator.fit(x_train, y_train)
# 计算准确率
score = estimator.score(x_test, y_test)
print(score)
# 决策树可视化
export_graphviz(estimator, out_file="tree.dot", feature_names=iris.feature_names)
summary
- advantage
- Simple to understand and explain, tree visualization
- shortcoming
- Cannot create trees with more complex data (overfitting)
- Improve
- pruning
- random forest
random forest
A random forest is a classifier that contains multiple decision trees, and the output is determined by the mode of all classes.
Principle process
Randomness manifests itself in two ways:
- Training set random: bootstrap sampling (random with replacement)
- Feature random: randomly extract m features from M features (M >> m can achieve the purpose of dimensionality reduction)
api
Because there are several hyperparameters, we can add grid search to find the best results
from sklearn.ensemble import RandomForestClassifier
estimator = RandomForestClassifier()
# 网格搜索
param_dict = {
"n_estimators": [120, 200, 300, 500, 800, 1200],"max_depth": [5, 8, 15, 25, 30]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
summary
- Among all current algorithms, it has excellent accuracy
- Can run efficiently on large datasets, can handle input samples of high-dimensional features, and does not require dimensionality reduction (because random samples are features)
- Ability to assess the importance of individual features in a classification problem (also because of random sampling of features)