Computer Vision: Scene Recognition

Please move to this link to download the complete program

scene recognition

In this project, I will train and test 15 scene databases (Bedroom, Coast, Forest, Highway, Industrial, InsideCity, Kitchen, LivingRoom, Mountain, Office, OpenCountry, Store, Street, Suburb, TallBuilding), with the help of HOG Feature extraction builds a bag-of-words model, and uses an ensemble learning classifier to classify scenes into one of 15 categories.

image classification

Image classification is an important problem in machine vision. Its basic concept is to automatically divide images into specific conceptual categories through algorithms. The image classification algorithm is generally divided into two stages of training and testing, and its basic process is shown in the figure below.
Image Identification

feature extraction

HOG is the abbreviation of Histogram of Oriented Gradient. It is a feature descriptor used for target detection in computer vision and image processing. The steps of feature extraction are as follows: (1) Image grayscale; (2)
Gamma
correction The method standardizes the color space of the input image;
(3) calculates the gradient (including size and direction) of each pixel of the image;
(4) divides the image into cells, each cell is composed of several pixels, and counts the Gradient histogram, forming the feature descriptor of each cell;
(5) Several cells form a block, concatenating the features of all cells in a block and normalizing the gradient histogram to form the feature descriptor of the block; (6
) The design window combines the feature descriptors of all blocks by sliding between blocks to generate HOG feature vectors.

bag of words model

The bag-of-words model is a popular image classification technique inspired by the recognition and classification of text in natural language processing. The bag-of-words model converts an image into several local features, ignores the spatial information of local features in the image, and classifies based on the histogram of visual word frequency. The "vocabulary" of visual words is established by clustering a large number of local features, and the classifier is trained using the training set after the "vocabulary" is constructed.

Ensemble Learning Classifier

The method of pedestrian detection using HOG feature extraction and SVM classifier was proposed by French researcher Dalal on CVPR in 2005. It has been widely used in image recognition. Although many pedestrian detection algorithms have been proposed, they are basically Based on the idea of ​​"HOG+SVM".
In machine learning, we can gather a variety of learner algorithms, let different algorithms predict the same problem, and finally the minority obeys the majority. This method is called integrated learning. Ensemble learning combines multiple base learners to obtain better generalization performance than a single learner.
In designing the scene recognition algorithm based on the bag-of-words model, in order to integrate different classifiers, the linear support vector machine, random forest and histogram gradient boosting classifier with high scene recognition accuracy are used for voting. An ensemble learning classifier.

algorithm design

For this project, I extract features from images based on HOG, use the bag-of-words model to construct a vocabulary, and design an integrated learning classifier composed of random forest classifiers, histogram gradient boosting classifiers, and linear support vector machine classifiers. Scene recognition algorithm.
The scene recognition algorithm is mainly divided into the following five steps:
the first step is to extract the local feature vector (HOG) of the image. In order to achieve a good classification effect, the extracted feature vectors need to have different degrees of invariance, such as rotation, scaling, translation, illumination and other invariance; in the second step,
using the feature vector set obtained in the previous step, through Mini Batch K- Means clustering algorithm (a clustering model that can keep the clustering accuracy as much as possible but can greatly reduce the calculation time. Due to the large training samples and the large number of features, the traditional K-means clustering will lead to calculation time It is too long, and the Mini Batch K-Means algorithm samples the data in small batches based on K-means) extracts representative vectors as words to form a "vocabulary"; the third step is to
image To count visual words, it is generally judged whether the similarity between a local area of ​​an image and a certain word exceeds a certain threshold. In this way, the image can be represented as a distribution of words, that is, the representation of the image is completed.
The fourth step is to design and train an integrated learning classifier, and use the distribution of words in the image for image classification.
The fifth step is to establish a confusion matrix, score and visualize the recognition algorithm.

Result analysis

Use multiple classifiers for model training and test their accuracy on the test set. After comparison, the classifier I designed has the highest accuracy.
Nearest Neighbor Classifier: Accuracy (55.0%)
Random Forest Classifier: Accuracy (69.1%)
Histogram Gradient Boosting Classifier: Accuracy (72.1%)
Linear SVM Classifier: Accuracy (72.7%)
Ours: Accuracy (74.2%)

The confusion matrix is ​​the most basic, intuitive and simplest method to measure the accuracy of the classification model. It counts the number of observations in the wrong category and the correct category of the classification model respectively, and then displays the results in a table. The figure below shows the confusion matrix diagram of the classification results of ours model. The brighter the color of the squares on the diagonal of the confusion matrix and the darker the color of other squares, the better the classification results. It can be seen from the figure that the recognition accuracy of Street is the highest, and the recognition accuracy of Kitchen is the lowest.

insert image description here
Using the model on the test set, visualize its classification results as shown in the figure below, which summarizes the accuracy of each category, and by showing examples with correct and incorrect recognition results (including true TP, false positive FP and false negative Example FN) to visualize the results.
insert image description here

Summary and Outlook

Summarize

For 15 scene databases (Bedroom, Coast, Forest, Highway, Industrial, InsideCity, Kitchen, LivingRoom, Mountain, Office, OpenCountry, Store, Street, Suburb, TallBuilding), I extracted features from images based on HOG, using the bag of words model The scene recognition algorithm was designed by constructing a vocabulary and an ensemble learning classifier composed of random forest classifier, histogram gradient boosting classifier and linear support vector machine classifier, and the accuracy rate of the final algorithm reached 74.2%.

Outlook

In order to improve the accuracy of the algorithm, the following aspects can be considered:
(1) In terms of features, consider adding spatial location information or extracting multiple features to build a vocabulary of visual words.
(2) In terms of model improvement, consider further increasing the data set and use deep learning methods for scene recognition.

Guess you like

Origin blog.csdn.net/weixin_43603658/article/details/128855484