Design and implementation of movie sharing social platform based on Spark cluster

Collect and follow to avoid getting lost


Preface

  This paper mainly aims at the problem of choice problems caused by the vast amount of information on the Internet. It designs and implements a movie sharing social platform based on Spark cluster. The platform uses the SpringBoot framework to build a Web platform, and uses Spark cluster and collaborative filtering algorithm to process the data. Calculates and recommends movies that users may like, filters unwanted spam messages, provides personalized services to users, and uses Socket technology to implement online real-time chat, allowing users to experience good online interaction, and adds the function of movie reviews, allowing Before choosing a movie to watch, users can quickly get a reasonable judgment on whether the movie is suitable for them to watch through movie ratings and other people's reviews. It solves the problem of user selection and has obvious improvements over traditional movie websites in terms of intelligent recommendations.

Keywords: recommendation system; collaborative filtering algorithm; Spark; Socket

1. Research Roadmap

Insert image description here

Figure 1.1 Research Roadmap
Related theories: logistic regression, K nearest neighbor method, decision tree.
Data preprocessing: Handle missing values ​​and parameter variables in the data.
Model training: The three models are trained separately.
Parameter comparison: Compare the result parameters between the three models and analyze the superiority of the three models.

2. Development environment

2.1 Introduction to framework and basic technology

2.1.1 Introduction to Spark and Hadoop

  Spark is a general-purpose memory parallel computing framework that provides fast iterative calculations for large amounts of data. Although it has not been released for a long time, it has become one of the current mainstream cluster computing platforms and can be well applied to large-scale data processing programs. Improve the problem of high latency. Another important cluster computing platform is the open source distributed computing platform Hadoop. MapReduce in Hadoop plays a key role in the development of Spark. The Spark computing module removes the dross and extracts the essence, inheriting the advantages of distributed computing. , On this basis, the use of RDD flexibly improves the calculation method, directly calculates data in the memory, and can complete the iteration function. It has improved the way that MapReduce can only save calculation results to the disk, greatly improving the calculation efficiency. According to official statistics, under the premise of reading data from the disk, Saprk can improve the calculation speed compared to Hadoop MapReduce. To more than 10 times, the speed can even be increased to 100 times under the premise of reading data from the memory. Secondly, a major feature or advantage of Spark is the RDD elastic distributed data set [6]. Spark regards it as the most basic unit. They are read-only collections that cannot be changed. The meaning of elastic data is that these data can Even if they are lost, these collections can still be reconstructed and restored through the data derivation process, so they are elastic, thus ensuring Spark's fault tolerance. The advantages of RDD don't stop there. RDD does not store real data. Instead, it uses abstract data sets to store indexes, uses the index to locate the real storage of data, and then uses interfaces to get the data for calculation and processing. Features: More data is stored in the memory that can be directly processed without wasting hard disk space, which greatly saves IO overhead. Only when the memory space is exhausted, the data is put into the hard disk. Taking advantage of RDD, Spark far surpasses Hadoop in performance. This system uses a platform where Spark is integrated into the Hadoop ecosystem for data calculation.

2.2.1 Collaborative filtering recommendation algorithm

  Collaborative filtering recommendation algorithm [8] (CF, Collaborative Filtering) based on the page information the user has browsed, gives highly rated items and often concerned item information to calculate the user's behavioral habits, interests and hobbies Etc., an algorithm that recommends items that the user may be interested in. After obtaining the original data, preprocess the data to obtain the user-item rating matrix, and then use the algorithm to calculate. Through calculation, the predicted recommendation score of the current user's preference for other items is obtained. Based on this score, items with high scores are recommended to the user. , which is likely to meet the user’s preferences. This algorithm can be well applied to commercial e-commerce software platforms, etc., to increase users' click-through rate and purchase rate, thereby increasing turnover.
  Comprised of different models and calculation methods, the collaborative filtering algorithm creates different models based on these two different model processes, which can be divided into item-based collaborative filtering algorithm (ItemCF) and user-based Collaborative filtering algorithm (UserCF).
The item-based collaborative filtering algorithm is to express the attitude towards the items by obtaining and counting each user's ratings and evaluation operations on the items selected in the platform, and by counting the user's preference for different items. Bad judgment, calculating the similarity between items, and then recommending items with higher similarity scores to the user's favorite items. An example is shown in Table 2-1 below:
Table 2-1 Item-based collaborative filtering user-item rating matrix
Insert image description here

  Through users’ simple ratings of the same item, it is found that users who like item A have a high probability of also liking item C, so item C is judged to be a similar item to item A, and the user has liked item A, and is Users who have no contact or understanding of item C recommend item C. What is used here is the user's preference for a particular advantage of the item. This advantage is judged to be the preference point of a specific user group, and this is used as a basis to count other users who have this item. Interesting items are classified and scored, and items with higher scores are recommended. The above is a brief introduction to the item-based collaborative filtering algorithm.
  The user-based collaborative filtering algorithm performs preliminary processing on the data obtained by the system to obtain a concise user rating table for items, divides the data and converts it into a matrix for calculation, and obtains the spatial vector of each user through statistics. The model calculates the spatial distance between vectors, determines whether the user is a neighbor user, and then calculates the recommendation score of items that the user may like with higher ratings from users with the same hobbies, and recommends items with higher scores to the user. An example is shown in Table 2-2 below:
Table 2-2 User-based collaborative filtering user-item rating matrix
Insert image description here

  Based on the above simple evaluation analysis of users’ preferences for items, it can be found that user No. 1 and user No. 3 have the same judgment results for the same items, and can be called adjacent neighbor users. Therefore, U1’s users who have not evaluated U1 but U1’s Item C that neighboring user U3 likes is recommended to U1, and vice versa.

2.2.2 Content recommendation algorithm

  Content-Based Recommendations algorithm [9] (CB, Content-Based Recommendations) is based on attribute analysis and feature statistics of items in the system platform, mining user preferences, refining them into item characteristics, and then recording the user's historical behavior , counts users' valuation scores for specific attributes, calculates user preferences, and then matches items with the same preferences as the user for recommendations.
  There are two ways to extract features from item content. The first is labeling features. For example, the type of movie is a kind of labeling feature. The type can be used to extract concisely and clearly. feature. The second type is the high-frequency word feature. High-frequency words are extracted for the content described by the item. The word vector with a larger weight represents the more important position in the content, so the high-frequency words with a larger weight are counted as the item. Features, common feature extraction algorithms are nearest neighbor method (KNN, K-NearestNeighbor), decision tree algorithm (DT, Decision Tree), Naive Bayes algorithm (NB, Naive Bayes), etc., because this system does not select content recommendation Algorithm, I will not introduce too much detailed calculation algorithm here. The advantage of the content recommendation algorithm is that users are independent and do not rely on other users' data for calculations. However, the disadvantage of this algorithm is that it has high text requirements and requires detailed feature extraction of item descriptions, which has certain limitations. An example is shown in Table 2-3 below:
Table 2-3 User-item rating matrix based on content recommendation
Insert image description here

  As can be seen from the above table, item A and item D are items with similar (identical) attributes. You can recommend item D to users who like item A. Similarly, you can also recommend item A to users who like item D. Here In the table, item D can be recommended to user A. This is an introduction to a simple model for recommendation based on item content.

2.2.3 Similar recommendation algorithm

  The similar recommendation algorithm is mainly aimed at the problem that users have few operating behaviors and lack of recommendation basis. For an item that the user has paid attention to, find a similar set and recommend it to the user. Based on the detailed attributes of the item as an attribute vector, the items are compared. Compare between items, calculate the sum of vector similarity weights, obtain the similarity weights of items, and recommend the set with higher weights to the user. However, the disadvantage is that the accuracy of the recommendation results is not high. The recommendation results are based on the user's historical behavior records. Only by accumulating a certain amount of user behavior record data can the accuracy of the recommendation results be improved.

2.2.4 Association rule recommendation algorithm

  The association rule recommendation algorithm [10] is simply to speculate on the correlation between events. If event A occurs, there is a high probability that event B will also occur. Then event A and event B have certain association rules. Statistics of such rules require a certain scale of user historical behavior data. The most famous example is "diapers and beer." Diapers and beer, which seem to be unrelated, are the products with the highest transaction rate. By analyzing user historical behavior data, we can get After such an association, as long as beer products and diaper products are recommended to customers together, the sales of both products can be promoted at the same time, providing customers with better services and increasing sales. The association rule recommendation algorithm uses this principle to recommend users. This algorithm requires the use of many algorithms of probability theory, data mining, and statistical discovery of more frequently occurring event sets among events. According to the proportion of weights Further deducing the operations associated with it, the algorithm often used in calculating association rules is the Apriori algorithm.

3. Web social platform design

  

4.1 Overall system framework

This system is mainly divided into two core modules, offline computing recommendation system and Web page social system, and four different layers, data layer, computing layer, result layer and presentation layer. Different modules provide different services. Each layer They all have corresponding job responsibilities and transmit data in an orderly manner. The specific system architecture is shown in Figure 4-1 below:
Insert image description here

Figure 4-1 System architecture diagram

4.2 System programming

4.2.1 Functional design

The functions of the WEB design of this system are as follows:
(1) User module: Users can register and have their own account when entering the website, log in to the account, edit their personal information, and change Create an avatar, write a profile, become friends with other users, and log out of your account.
(2) Home page module: Users who have not registered an account can still view popular movies, use the query module, and blog module on the home page. Users who have registered and logged in normally can use more user modules, Rating module and exclusive recommendation module.
(3) Query module: Users can enter keywords in the search box to conduct a global search for the movie information they want to find, as well as advanced filtering search, select year, region, type, etc. Perform advanced filtering to find movies that match your criteria.
(4) Rating module: Users can select movies they have watched, rate and express their thoughts on the movies, and make a subjective evaluation of the movies. They can also choose movies they have not watched and view them. Other people's subjective evaluations, so as to make certain judgments about the movie.
(5) Recommendation module: By using this system, after users rate movies, the system collects user behavior records and makes personalized movie recommendations for different users. Each user has Exclusive recommendation module.
(6) Details module: Users can click on the movie picture, name, etc. to view detailed information about the movie, including cast, director, region, release year and month, overall rating, plot Synopsis, etc., also includes blog module, including release date, large picture display, full text display, etc.
(7) Blog module: Users can publish blogs on this platform, insert pictures, view other people's blogs, post comments and other operations.
(8) Chat module: When users encounter film reviews of interest on this platform and want to discuss with users who have the same opinions, in addition to posting replies under the comments, they can also chat with the users Chat conversation.
The specific details are shown in Figure 4-2 below:
Insert image description here

Figure 4-2 System functional architecture diagram

4.2.1 Database design

This topic chose to use MySql database technology as data support. According to the data logic of the system, the following database tables were designed. The specific structure is shown in Figure 4-3 below:
Insert image description here

Figure 4-3 Overall database architecture diagram

4. System implementation

 &emsp

5.2 System function implementation

5.2.1 User module functions

Login and registration are the most basic functions in the interactive platform. To input basic information, users only need to fill in the relevant key data according to the prompts. The front end will perform standard checks on each attribute format, set the account and password, and use MD5 for password encryption. , stored in the database, and the key field of the identification, also called the primary key, is set to an auto-increment format, which is used as the DNA of the user account. When logging in, the correctness of the account and password is verified. After the verification is successful, , take out the user object and store it in the session, thereby completing the storage of the user account, that is, registration, and the login function of object persistence. The interface design is shown in Figure 5-1:
Insert image description here

Figure 5-1 Login interface

5.2.2 Home page module function

The homepage mainly displays different movie lists, including popular movie lists, latest movie lists, featured movie lists, and recommended movie lists. Only users who have logged in to their account can display the recommended movie list. This module needs to generate a user-specific movie list for the user. Recommended modules will not be displayed if you are not logged in. The principle of implementation is to use the Spring framework for page jumps, data transmission, database operations, etc. The main flow chart is shown in Figure 5-2 below:
Insert image description here

Figure 5-2 Page jump business process
The homepage also contains other small module links, display links for all movie lists, blog display links, advanced filtering functions, and queries Functions, personal center, etc. The homepage display is shown in Figure 5-3 below:
Insert image description here

Figure 5-3 Partial display of the home page

5.2.3 Details module function

This module includes the movie details interface, blog details interface, comment area to realize page turning display function, and the function to post comments. The specific pages are shown in Figures 5-4 and 5-5 below:
Insert image description here

Figure 5-4 Partial display of the movie details page
Insert image description here

Figure 5-5 Partial display of blog details page

Table of contents

Table of Contents
Chapter 1 Introduction 1
1.1 Research background and significance 1
1.2 Domestic and foreign movie recommendation systems Development status 2
1.3 Structure of this article 3
1.4 Summary of this chapter 4
Chapter 2 Introduction to basic technology 5< a i=7> 2.1 Introduction to framework and basic technologies 5 2.1.1 Introduction to Spark and Hadoop 5 2.1.2 Introduction to SpringBoot framework 5 2.1.3 WebSocket technology principle 6 2.2 Introduction to recommendation system algorithm 6 2.2.1 Collaborative filtering recommendation algorithm 6 2.2.2 Content recommendation algorithm 7 2.2.3 Similarity recommendation algorithm 8 2.2.4 Association rule recommendation algorithm 8 2.2.5 Recommended system evaluation index 8 2.3 Similarity calculation formula 8 2.3.1 Euclidean distance 8 2.3.2 Manhattan distance9 2.3.3 Cosine similarity9 2.4 Summary of this chapter9 Chapter 3 Offline Recommendation System Design 10 3.1 Cluster construction design 10 3.2 Recommended algorithm calculation steps 11 3.2.1 Sample processing 11 3.2.2 Model construction 12 3.3 Principle of item-based collaborative filtering algorithm 13 3.4 Summary of this chapter 14 Chapter 4 Web Social Platform Design 15 4.1 General System Framework 15 4.2 System Programming 15 4.2.1 Function Design 15 4.2.1 Database design 17 4.3 Summary of this chapter 20 Chapter 5 Specific function implementation and testing 21 5.4 Summary of this chapter 31 Acknowledgments 34 Reference Document 33 6.2 Outlook of the paper 32 6.1 Summary of thesis 32 Chapter 6 Summary and Outlook 32 5.3 System test 29 5.2.6 Recommended result calculation function 28 5.2. 5 Chat module function 26 5.2.4 Search module function 24 5.2.3 Details module function 23 5.2.2 Home page module function 22 5.2.1 User module function 21 5.2 System function implementation 21 5.1 Environment construction 21











































Guess you like

Origin blog.csdn.net/QQ2743785109/article/details/134063314