Movie video portrait scene classification algorithm based on OpenCV (source code & tutorial)

1. Research Background

In recent years, with the rapid development of multimedia technology, video data has also shown explosive growth. Content-based video retrieval has become an urgent need at present. Especially in the field of film and video, simple playback can no longer meet the growing needs of users. How to accurately and quickly retrieve movie video clips that reflect the audience's emotional changes according to the user's needs has become a current research hotspot. In this environment, the detection of film and video scene-specific scales came into being. Based on the systematic study of the film director's creative techniques, using the principle that the combination of different scene shots can affect the current audience's visual psychology, a scene-specific scale recognition and detection method based on film video is designed. Through the analysis of previous research results, in view of the problem that the classification features of current video scene scenes are mostly based on sports videos and the effect on movie videos is not good, combined with the knowledge of film domain, the local motion occupancy rate, camera motion and similarity between shots are constructed. Combined with other common features, the Bayesian classifier is used to classify movie shots into long shot, medium shot and close shot, and it is compared and analyzed with other classification algorithms. According to the relationship between scene changes and audience emotions, five scene scales that can stimulate audience emotions are designed, and the detection of scene scales is realized on the basis of scene recognition. Experimental results show that the new features better reflect the characteristics of different scenes. Compared with other existing methods, the recognition of distant view and near view has different degrees of improvement in accuracy and recall. However, due to the fuzziness of the medium-ground lens itself and the characteristics of the construction can not reflect the characteristics of the medium-ground lens, the recognition rate of the medium-ground lens is not satisfactory. The scene recognition directly affects the detection effect of the scene scale, and the effect of scene recognition needs to be improved in general, so the focus of future research is how to improve the recognition accuracy of the shot scene

2. Function introduction

4.png

3. Scene classification process & result presentation

1.png

3.png

4. Video display

Video portrait scene classification algorithm based on OpenCV (source code & tutorial)_哔哩哔哩_bilibili

5. Scene classification of film works

In order to meet the above needs, we studied the knowledge of related fields such as film shooting art and visual psychology, and found that the scene scale of the film can effectively induce the emotional changes of the audience. Therefore, by detecting and identifying scene-specific scales in video clips, it is possible to find clips of audience emotional changes in movie videos. This can be applied in the fields of video retrieval and video-on-demand, and is also one of the key technologies for movie summary generation. It has relatively high academic value and broad application prospects, but it is also challenging.
From the perspective of photography, scene distinction refers to the change of the image of the subject on the screen due to the movement of the camera relative to the stationary object or the movement of the object itself. There is no strict standard for the division of scene, and it is usually divided into distant view Shot (Long Shot), Medium Shot (Medium Shot), and Close-up Shot (Close-up Shot). In some specific occasions, more detailed division may be required. For example, close-up shots may be further divided into general close-up and close-up shots. Large close-up shots. The following are the definitions of the three scenes:

(1) Vision

Long shot is a shot with a longer duration, which reflects more information. A long shot is a long shot, which is mainly used in movie videos to explain the time, place and background information of the scene. It is generally used as a transition shot at the beginning or end of the entire movie scene. In long shots, the human body takes up less area.

(2) Medium shot

Medium shots are the most commonly used type of shot in film video. It is not like a long-range lens that focuses on the whole and ignores details, nor does it emphasize the details and obliterate the whole like a close-up lens, but puts the overall information in a secondary position. Using the middle ground to express a character occupies a larger space than the distant scene, generally including the range above the knees of the human body. Some shot the shot above the shoulders of the human body as a medium shot. Although this division is somewhat subjective, the medium shot can clearly see the shape of the human body.

(3) Close view

If you use the characters in the picture as a reference, the close-up shot only shoots the part above the neck of the character, or a certain detail of the human body. Compared with medium shots, close-up shots reflect the details of people or objects more, and its function is to highlight these details from the surrounding environment. Close-up shots are often more expressive and attract the attention of the audience.
Different scenes bring different feelings to the audience. For example, the long shots often explain the time, place or background, giving people an objective feeling; the close-up shots describe more details and tend to attract the attention of the audience. Therefore, the shot combination sequence of long shot-medium shot-medium shot-close shot makes the audience's emotions shift from objectivity to concentration; The tense emotions are eased; and the close-up-close-up-close-up scene of multiple continuous close-ups may be a fighting scene that some audiences love. The combination of shots in certain scenes can often arouse the emotional resonance of the audience. Composed of different scenes, the combination of shots that can arouse people's emotional changes is called the Scene Scale. These shots showing different scenes are like piano keys with different high and low tones, playing notes like clouds and flowing water on the elegant modern piano of the movie.
image.png

6. Algorithm flow chart

image.png

7. Facial features

The method proposed in this blog is an important part of object detection, and face detection technology has long received great attention. In movie videos, people are the most frequently appearing elements in movie scenes. It is not an exaggeration to think that movies are an art that expresses human behavior. Therefore, if the position and size of a human face or even a human body in a movie can be determined, it will be relatively easy to judge the scene of the movie video lens. Face detection means to detect whether there is a face in the video frame, and face positioning is the process of finding the exact position of the face object from the screen under the premise of the presence of the face. Therefore, face localization is based on face detection. What this paper is concerned with is the appearance, size and position of the face in the picture, which requires the use of an algorithm combining face detection and face positioning in the specific implementation. In the specific implementation, the adaboost algorithm is adopted. The reason for adopting this algorithm is that the accuracy rate of the algorithm is high and the speed is fast.
Human is the element that appears the most in movie videos, and human face is the most distinctive feature of human beings. In most cases, the whole body of the person does not necessarily appear in the picture, but the upper body including the face often appears, and the face generally appears in a prominent position in the picture. Usually, in close-range dialogue shots, the proportion of human faces is relatively large, and they are generally frontal, so they are easier to be detected. For medium shots, there are usually multiple faces in one frame, and the proportion of faces is relatively small, and the position is not necessarily in the center of the frame. Long-range shots generally do not show human faces, but mostly show some temporal and spatial transition information. Even if some faces appear in long-range shots, it is usually difficult to be detected.
To sum up, in order to identify the scene of the lens, it is necessary to obtain some face information in a video frame, specifically the position where the face appears, the size of the face, and the number of faces. The figure below gives some examples of face detection.
image.png

8. Code implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_classification
from mpl_toolkits.mplot3d import Axes3D
 
def LDA(X, y):
    X1 = np.array([X[i] for i in range(len(X)) if y[i] == 0])
    X2 = np.array([X[i] for i in range(len(X)) if y[i] == 1])
 
    len1 = len(X1)
    len2 = len(X2)
 
    mju1 = np.mean(X1, axis=0)#求中心点
    mju2 = np.mean(X2, axis=0)
 
    cov1 = np.dot((X1 - mju1).T, (X1 - mju1))
    cov2 = np.dot((X2 - mju2).T, (X2 - mju2))
    Sw = cov1 + cov2
    
    w = np.dot(np.mat(Sw).I, (mju1 - mju2).reshape((len(mju1), 1)))  # 计算w
    X1_new = func(X1, w)
    X2_new = func(X2, w)
    y1_new = [1 for i in range(len1)]
    y2_new = [2 for i in range(len2)]
    return X1_new, X2_new, y1_new, y2_new
 
 
def func(x, w):
    return np.dot((x), w)
 
 
if '__main__' == __name__:
    X, y = make_classification(n_samples=500, n_features=2, n_redundant=0, n_classes=2,
                               n_informative=1, n_clusters_per_class=1, class_sep=0.5, random_state=10)
 
    X1_new, X2_new, y1_new, y2_new = LDA(X, y)
 
    plt.scatter(X[:, 0], X[:, 1], marker='o', c=y)
    plt.show()
 
    plt.plot(X1_new, y1_new, 'b*')
    plt.plot(X2_new, y2_new, 'ro')
    plt.show()

9. System Integration

Source code & Environment Deployment Video Tutorial & Custom UI interface in the picture below
5.png
Refer to the blog "OpenCV-Based Movie Video Portrait Scene Classification Algorithm (Source Code & Tutorial)"

10. References

[1] Chen Ming, Huang Xinyuan. Scene Changes in Film Audiovisual Language. TV Subtitles, Special Effects and Animation, 2006,
12(8):70 -73
[2]P Xu, L Xie, SF Chang, et al. Algorithms and System for Segmentation and
Structure Analysis in Soccer Video. In Proceedings of IEEEInternationalConference on Multimedia & Expo, Dhaka, 2001:721 -724
[3] He Huanhuan. Extraction of Highlights from Football Video Based on Emotional Incentives. Master's Thesis. Wuhan: Huazhong
University of Science and Technology Library, 2008
[4] R Dahyot, N Rea, AC Kokaram. Video Shot Segmentation and Classification. In
Proceedings of The International Society for Optical Engineering, Lugano, 2003: 404- 413
[5]MCTien, HT Chen, YW Chen, et al. Shot classification of basketball videos and its
application in shooting position extraction. In Proceedings of IEEE International Conference on Acoustics, Honolulu, 2007:1085-1088
[6] Yuan Yuming, Wan Chunru. The application of edge feature in automatic sports
genre classification. In Proceedings of IEEE Conference on Cybernetics and Intelligent Systems, Singapore, 2004: 1132- 1135
[7] Zhao Xipi, Li Qiqi, Wang Xiukun, etc. Classification of close-up shots in football videos. Journal of South China University of Technology
(Natural Science Edition), 2007, 35(9): 70 -73
[8] Zhou Yihua, Cao Yuanda, Li Jian et al. Soccer Video Shot Classification Method Based on Color and Edge Distribution.
Journal of Beijing Institute of Technology, 2005, 25(12): 1079 - 1082
[9] Maria Zapata Ferrer, Mauro Barbieri, Hans Weda. Automatic classification of field of
view in video. In Proceedings of IEEE International Conference on Multimedia and Expo, Toronto, 2006: 1609-1612
[10]Ren Jinchang, Chen Juan, Jiang Jianmin, et al. Knowledge-supported segmentation
and semantic contents extraction from MPEG videos for highlight-based annotation, indexing and retrieval. In Proceedings of Advanced Intelligent Computing Theories and Applications, Shanghai, 2008:258-265 [11] Fu Wei. Research on Shot-Based Elements Extraction in Cartoon Video Sequences
. Master's degree thesis. Zhejiang: Zhejiang
University Library, 2004
[12] Weng Ying, Jiang Jianmin. Camera motion analysis towards semantic-based video
retrieval in compressed domain. In Proceedings of Computer Artificial Intelligence, Genoa, 2007: 276 - 279
[13 ]Ling-Yu Duan,Min Xu, Qi Tian. A Unified Framework for Semantic Shot
Classification in Sports Video.IEEE Transactions on multimedia, 2005,7(6):1066- 1083
[14]Yu Xiao-Dong,Duan Ling-Yu , Qi Tian. Shot classification of sports video based
on features in motion vector field. In Proceedings of Advances in MultimediaInformation Processing, Taiwan, 2002:253 - 260
[15]TJ Atherton, DJ Kerbyson. Size invariant circle detection. Image and Vision
Computing,1999,17(11):795-803
[16] Wu Chuan, Ma Yufei, He Yuwen, etc. An Event Detection Method Based on Semantic Reasoning in Sports Videos. TV Subtitles,
Special Effects and Animation, 2003,43(4): 507 - 509
[17]Tjondronegoro Dian W, Chen Yi-Ping Phoebe,Pham Binh. Classification of
self-consumable highlights for soccer video summaries. In Proceedings of IEEE International Conference on Multimedia and Expo, Santa Clara, 2004:579 - 582[18]Duan Ling-Yu, Xu Min, Tian Qi. Semantic Shot Classification in Sports Video. In
Proceedings of The International Society for Optical Engineering, Santa Clara, 2003: 300- 313
[19]Horn BKP, Schunck BG. Determining optical flow. Artificial Intelligence, 1981,
17(1):185 - 204
[20]Su Yeping, Sun Ming-Ting, Hsu Vincent, et al.Global motion estimation from
coarsely sampled motion vector field and the applications[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2005,15(2): 232-241
[21] Gonzalez. Digital Image Processing. Second Edition. Beijing: Electronic Industry Press ,2004:423-425[22] Xihua Hu, Weizhong Liu, Lixin Zheng. Analysis and Research on Block Matching Algorithm for Motion Estimation [J. TV Technology, 2005,
12(1):4 - 6
[23]Lee IB, Park KS. Measurement of ocular torsion using iterative Lucas-Kanade
optical flow method. In Proceedings of Annual International Conference of the IEEE Engineering in Medicine and Biology, Shanghai, 2005:6433-6436
[24]YF Ma, L Lu, HJ Zhang, et al.A User Attention Model for Video Summarization.
In Proceedings of the tenth ACM international conference on Multimedia,Juan lesPins, 2002:533 - 542
[25]Chang Shih-Fu,Sikora Thomas,Puri Atul. Overview of the MPEG-7 standard.
IEEE transactions on circuits and systems for video technology, 2001,11(6): 688 -695
[26]Xue Xiangyang, Zhu Xingquan,Xiao Youneng, et al. Using mutual relationship
between motion vectors for qualitative camera motion classification in MPEG video.

Guess you like

Origin blog.csdn.net/m0_73650382/article/details/127671907