Face Dataset Classification Based on Random Forest Algorithm

1. About the author

Li Jiamin, female, School of Electronic Information, Xi'an Polytechnic University, graduate student of class 2021
Research direction: pattern recognition and artificial intelligence
Email: [email protected]

Wu Yanzi , female, School of Electronic Information, Xi'an Polytechnic University, 2021 graduate student, Zhang Hongwei Artificial Intelligence Research Group
Research direction: Pattern Recognition and Artificial Intelligence
Email: [email protected]

2. Introduction to knowledge about theory

random forest

Random forest is actually a special bagging method that uses decision trees as models in bagging. First, use the bootstrap method to generate m training sets. Then, for each training set, construct a decision tree. When the nodes find features for splitting, it is not necessary to find all the features that can maximize the indicators (such as information gain). , but randomly extracts a part of the features from the features, finds the optimal solution among the extracted features, applies it to the nodes, and splits. The random forest method has bagging, that is, the idea of ​​​​integration, which is actually equivalent to sampling both samples and features (if the training data is regarded as a matrix, as is common in practice, then it is a row sum Columns are sampled), so overfitting can be avoided.
In short: Random Forest builds multiple decision trees and merges them together for more accurate and stable predictions.
insert image description here

3. Experimental procedure

3.1 Dataset introduction

Labeled Faces in the Wild is a facial photo database designed for the study of unconstrained face recognition problems. The dataset contains more than 13,000 face images collected from the network. Each face is marked with the name of the person in the picture. The 1680 people in the graph have two or more different photos in the dataset. This dataset is a collection of JPEG pictures of famous people collected on the Internet.
insert image description here>Introduction to datasets, preparations and packages to be installed
Code strives to be well commented, aimed at beginner groups
Using python, code blocks on black background

3.2 Experimental code

import numpy as np
from sklearn.datasets import fetch_lfw_people

X, labels = fetch_lfw_people(return_X_y=True, min_faces_per_person=230)
print(X.shape, labels.shape)
print(len(set(labels)))  # 类别数

from sklearn.decomposition import PCA

pca = PCA(n_components=200, random_state=2022)  # PCA降维,10304维变100X = pca.fit_transform(X)  # 执行降维

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, labels, random_state=2022, test_size=0.25, shuffle=True)
# 分割数据集,测试集占25%

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, n_jobs=-1,
                            random_state=2022,
                            max_depth=9)  # 随机森林模型 n_estimators决策树数量 n_jobs使用所有cpu核心计算 random_state随机种子,结果可重现
rf.fit(X_train, y_train)  # 训练
print("训练集准确率", rf.score(X_train, y_train))
print("测试集准确率", rf.score(X_test, y_test))

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

pre = rf.predict(X_test)  # 预测 类别
print("测试集准确率", accuracy_score(y_test, pre))
print("测试集召回率", recall_score(y_test, pre, average="weighted"))
print("测试集精确率", precision_score(y_test, pre, average="weighted"))
print("测试集f1", f1_score(y_test, pre, average="weighted"))

from sklearn.model_selection import KFold

metrics = []  # 评估指标
kf = KFold(n_splits=5, shuffle=True, random_state=2022)  # n_splits折数 shuffle打乱 random_state随机种子,结果可重现
for train, test in kf.split(X):
    X_train, X_test, y_train, y_test = X[train], X[test], labels[train], labels[test]  # 重新划分数据集
    rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=2022, max_depth=9)
    rf.fit(X_train, y_train)
    pre = rf.predict(X_test)  # 预测 类别
    metrics.append([accuracy_score(y_test, pre),
                    recall_score(y_test, pre, average="weighted"),
                    precision_score(y_test, pre, average="weighted"),
                    f1_score(y_test, pre, average="weighted"),
                    ])
metrics = np.mean(metrics, axis=0)  # 求平均值

print("5折交叉验证 准确率、召回率、精确率和f1分别为", metrics)

3.3 Running Results

insert image description here

3.3 Experimental summary

An advantage of random forest is that it can be used for regression and classification tasks, and it is easy to see the relative importance of the input features of the model. Random Forest is also considered a very convenient and easy-to-use algorithm because it is the default hyperparameter that usually produces a good prediction result. The number of hyperparameters is also not that large, and what they represent is intuitive and easy to understand. A big problem in machine learning is overfitting, but most of the time this is not so easy for random forest classifiers. Because as long as there are enough trees in the forest, the classifier will not overfit the model.
The main limitation of random forests is that using a large number of trees makes the algorithm very slow and cannot make real-time predictions. In general, these algorithms are fast to train and very slow to predict. More accurate predictions require more trees, which will result in slower models. In most real-world applications, the random forest algorithm is fast enough, but there will definitely be cases where real-time requirements are very high, so other methods can only be preferred.

refer to

The principle of
random forest algorithm. The working principle of random forest algorithm .

Guess you like

Origin blog.csdn.net/m0_37758063/article/details/123644263