Python-based massive Douban movies, data acquisition, data preprocessing, data analysis, visualization, large-screen design projects (including database)

insert image description here

Project Introduction

If you need the code or documentation of this project and all resources, or deploy and debug, you can private message the blogger! ! ! ! ! ! ! ! ! !

This paper crawls the data of the Douban movie website based on the Python web crawler, analyzes the webpage structure of the Douban website reasonably, and designs the rules to obtain the JSON data packets of the movie data, and adopts the normal distribution delay measure to analyze the data. Make lots of fetches. And use Python's Pandas data analysis library to preprocess the acquired data, clean unstructured data into clean data, and facilitate subsequent big data analysis, respectively perform null value detection and processing on the data, and string constraints , the expansion of field values, and the cleaning of data.

Then store the cleaned data in a MySQL structured database for big data analysis, combined with multi-dimensional fields, perform structured analysis on movie data, and analysis of preference, trend analysis, etc., and finally pass the analysis results through Pyecharts web page visualization is displayed, and large-screen visualization is designed to achieve an integrated effect.

This study uses an automated crawler program to obtain a large amount of Douban movie data, and through data cleaning and preprocessing, the cleaned data is stored in the MySQL database. By calling the structured data in the database, it analyzes the multi-dimensional data such as the distribution of movie rating indicators, user preference, movie review text, and region, and uses the pyecharts front-end visualization library for display.

Finally, use the Page module of pyecharts to uniformly display the visual display of the same analysis dimension, and build a large-screen visualization based on Douban movie analysis. This study provides valuable reference and support for the film industry through data analysis and visualization techniques.

Research Background

Douban Movies is currently one of the most popular movie review websites in China. Users can perform movie ratings, comments, collections, etc. on the website. Since the data on Douban Movies is very rich, analysis and visualization of these data will help us gain a deep understanding of the development trend of the movie market and user evaluation preferences, and provide useful references for movie production and marketing.

With the continuous development of society, the film industry is also growing, and more and more people begin to pay attention to the cultural and commercial value of films. The ratings and reviews on Douban Movies have become one of the important criteria for measuring the quality and popularity of movies. By analyzing and visualizing the data on Douban Movies, we can gain an in-depth understanding of users' evaluations and preferences for movies, explore the development trends and business opportunities of the movie market, and provide more targeted suggestions and strategies for movie production and marketing.

The main purpose of this research is to analyze the movie data on Douban Movies, explore the ratings and user reviews of different types of movies, and also use data analysis and visualization tools in Python, such as Pandas, Matplotlib, and Seaborn, etc., to analyze these data Processing and visualization, in order to display the analysis results more intuitively.

omitted here...

Analysis of Research Status at Home and Abroad

Douban Movie is one of the largest movie communities in China. The platform has a large amount of movie information, so it has become one of the most popular platforms for many movie lovers. At the same time, Douban Movies is also an important movie evaluation platform, where users can rate and comment on movies. Therefore, the data analysis and visualization research of Douban movies has become one of the popular research directions.

omitted here...

Research purposes

This research aims to analyze and visualize Douban movie data through Python to explore the characteristics of Douban movie viewers, movie evaluation, movie duration, movie type, etc., and put forward some useful conclusions and suggestions based on this.
research content:

1. Data Acquisition and Cleaning

This study will use Python crawlers to collect data from the movie information on the Douban movie website, and ensure the accuracy and integrity of the data through data cleaning and processing. The main content of data collection includes movie name, director, actor, rating, number of reviews, movie type, country/region of production, release date, duration, etc.

2. Analysis of the characteristics of moviegoers

Through the data collection and processing of user information on the Douban Movie website, this study will explore the characteristics of Douban movie viewers in terms of gender, age, region, occupation, etc., in order to understand the viewing preferences and evaluations of different groups of people.

3. Movie evaluation analysis

omitted here...

Significance

With the popularity of Internet technology and smart phones, movies have become an indispensable part of modern people's entertainment life. As a very famous movie evaluation platform, Douban Movie has a large amount of user evaluation data and movie information, which can provide important reference and decision support for movie lovers. Therefore, the analysis and visualization research on Douban movie data has high research value and practical significance.

This study aims to gain an in-depth understanding of the changing trends, popularity, and user evaluations of the film market through the analysis and visualization of Douban film data, and provide useful references and suggestions for film practitioners, film lovers, and film researchers. .

Overall Study Design

In this topic, the data analysis and visualization research of Douban Movie in Python, through the use of Python to design crawler programs, including automatic crawler programs, and writing intelligent delay functions, ensure the effective acquisition of a large amount of data in Douban Movies.

After automatic acquisition of the Douban movie data page, Python’s pandas and numpy libraries are used to clean and preprocess the data, including multi-dimensional data field cleaning and expansion, and the cleaned data is stored in the MySQL database. Data analysis thinking, call the structured data in the database, analyze the data of different dimensions, such as the distribution of movie rating indicators, user preference analysis, movie review text analysis, regional analysis, etc., and then use the pyecharts front-end visualization library to display, Draw multi-dimensional visual charts, discuss with actual analysis results, and provide data analysis conclusions.

Finally, the visual display of the same analysis dimension is displayed uniformly through the Page module of pyecharts, and a large-screen visualization based on Douban movie analysis is constructed.

The specific steps and plans are as follows:

1. Design an automated crawler program for Douban Movies to automatically obtain movie data

It is necessary to design an automated crawler program. For the Douban website, because its anti-crawling measures are relatively strict, the data on the movie page is displayed using the principle of dynamic loading. The preliminary analysis needs to use JSON data to obtain the URL of Douban movies, and then pass Requesting to a specific movie page is parsing and locating specific movie field data.

When obtaining data, it is necessary to simulate the browser to request the website, add the request header, and then analyze the parameters in different JSON data packets. After discovering the specific rules, you can set the corresponding program to obtain the data set. If the IP visits the website frequently, it will not only bring load pressure to the target website, but also be recognized as a malicious crawler by the website. Therefore, when designing the crawler program, it is necessary to add a delay function, and use the idea of ​​​​normal distribution to simulate the speed of human clicks and visits. The frequency of the website can enhance the stability of the crawler.

Secondly, sometimes when acquiring data, there will be some fields that do not exist in some movies, so in order to ensure the stable, robust and continuous operation of the program, it is necessary to set up an intelligent crawler. Preliminary analysis requires a judgment on the data value of the field. If no data is obtained, a null value is automatically assigned to avoid program interruption.

2. Clean and preprocess the crawled data, including multi-dimensional data field cleaning and expansion

Since there are some irregular fields in the large amount of data we obtain, such as actors, release time, movie duration, etc. These fields contain other Chinese characters, we need to clean them in a structured way to ensure the validity of the data and facilitate For subsequent analysis, there are some null values ​​in the data, which need to be processed and then saved as a new data.

Secondly, when processing the time field, after removing the Chinese field, then expand the data field, such as year, month, day, week number, etc., which can facilitate subsequent data analysis, increase the dimension of analysis, and ensure the effective processing of data. .

3. Store the cleaned data in the MySQL database

Store the preprocessed data in MySQL, which is convenient for subsequent management and call data. As a structured database, MySQL can store a large amount of data, and can help us use SQL statements for query and data analysis, which is very efficient specialty.

4. Douban movie data field multi-dimensional data analysis

Using data analysis thinking, calling the structured data in the database, and conducting in-depth analysis of Douban movie data from multiple dimensions, such as movie score analysis, regional analysis, time dimension analysis, movie type analysis, etc.

5. Use the pyecharts visualization library to draw various multi-dimensional charts

Use Python's pyecharts, a third-party visualization library, to call the data in the database, and use the method of front-end visual analysis to display the visualization on the web page to achieve a cool interactive graphic display, which is convenient for us to discover the rules and give users or others Personnel provide decision-making support basis.

6. Discuss the analysis results and provide data analysis conclusions

Analyze the analysis data, put forward correlation conclusions, and conclusions of regularity, provide some descriptive analysis conclusions for the Douban film industry, and highlight the main position of data analysis in this.

7. Big screen visualization

Through the visual display of the same analysis dimension, through the Page module of pyecharts, the visualization is displayed in a unified manner, and a large-screen visualization based on Douban movie analysis is constructed.

In short, this research uses Python web crawler technology and big data analysis technology, through reasonable data acquisition, data cleaning and preprocessing, data storage and data analysis and other links, and finally realized the analysis of Douban movies from data acquisition to data analysis. Cleaning, loading the data into the hive warehouse, performing big data analysis and visual display on it.

insert image description here

Introduction to Web Crawlers

A web crawler is an automated tool that automatically searches and grabs information on the Internet. It can automatically browse the web, extract data and save it to the local computer for subsequent data analysis, mining and processing. In today's era of information explosion, web crawlers have become one of the important means for people to obtain and process information.

The principle of a web crawler is to send a request to a website through the HTTP or HTTPS protocol, and obtain the HTML source code returned by the website. Then, by parsing the HTML source code, the web crawler can extract various information in the web page, including text, pictures, links, audio, video, and so on. Web crawlers can automatically crawl the entire website or specific webpages according to their own needs and set rules, thereby realizing automated data acquisition.

Web crawlers have a wide range of application scenarios. For example, search engines need to automatically crawl and index various web pages on the Internet through web crawlers, so that users can more easily search for the information they need; e-commerce websites need to automatically obtain competitors' product information and prices through web crawlers Information, in order to make a more reasonable price strategy; news media need to automatically grab news information through web crawlers, organize and classify, so as to provide better news services and so on.

omitted here...

Douban movie data collection

After analyzing the structure of the webpage, the following detailed information can be obtained by crawling the data of each movie, as shown in the figure below. It should be noted that since the data of Douban Movies is dynamic, it is necessary to continuously click to obtain complete information. After analysis, it is found that when transmitting data, the website will contain a JSON data packet, which contains data fields, that is, page information.
insert image description here

insert image description here
insert image description here

The next step is to use Python to write a web crawler program and implement anti-crawling measures, including request headers and parameter settings. The program will clean the JSON data, obtain the URL we need, and traverse the words under the data. In the program design process, there are the following innovations:

  1. Intelligent crawling module: In order to avoid too frequent visits, the program will automatically delay and simulate the behavior of human clicking on the website. If a data field is empty, the program will automatically assign the value to be empty and prompt for an empty value.
  2. Write data in real time: The program will write the data source into the CSV file in real time, so as to avoid that the program cannot crawl normally at a certain moment, resulting in the failure to write the previous data.
  3. Program structure that can be used for reference: the program has a clear structure, strong logical thinking, and strong reference.
  4. Intelligent anti-crawling measures: The program has added intelligent anti-crawling measures, combined with the anti-crawling measures of the website, to ensure that the program will not be banned from IP.

insert image description here

insert image description here

data preprocessing

The data obtained based on crawlers basically meets the conditions for big data analysis, but some fields need further processing. For example, the comma in the movie name needs to be removed, because we use the csv format for segmentation when importing into the hive warehouse. If it is not processed during data preprocessing, the imported data will be misplaced and the analysis results will be affected. In addition, some field values ​​need to be expanded and constrained. For example, the duration of a movie contains Chinese, and there are a large number of people in the actor information, so the data can be collected in the following ways

Preprocessing and structuring:

First of all, for the movie_name field, the comma characters contained in it need to be removed to avoid misalignment in subsequent data loading.
Secondly, for the yanyuanData field, although it is suspected to be a list, it is not a list in essence, so you need to use the Replace method to remove the brackets. Then, use Python's Split function to split according to a specific sequence. Since this field is actor information, the method of counting the number of actors can be used to replace the original field value to facilitate subsequent exploration and analysis. Next, the actor information is stored as a text file, which is convenient for text analysis and visualization, and the names in it are displayed using a word cloud.

insert image description here

Big Data Analysis and Visualization

Structural analysis of Douban film reviews

After querying the Douban movie data using SQL statements and visually displaying it through Pyecharts, it was found that the number of scores of 8-9 points was the largest, showing a normal distribution. On the contrary, the number of low-scoring movies is relatively small, and the number of movies with a score higher than 9 is relatively small, but there are still a certain number of high-scoring movies.

insert image description here
insert image description here

When doing the analysis, it turns out that Mainland China, Hong Kong, and Taiwan are divided into different countries, when in fact they all belong to the same country. Therefore, we need to use the case statement to merge these regions and unify them as "China", and then analyze the proportion of movies with a score greater than 9.0 in each country.

insert image description here

We found that the top three countries with ratings higher than 9.0 are the United States, Japan, and China. Therefore, when we choose to watch Douban movies, we can give priority to movies from these countries.
For different genres of movies, we can query the highest rated, most popular and least popular movies in each genre. That is, we can find the highest and lowest rated movies in each genre.

The following data analysis will only show pictures, if necessary, you can private message the blogger! ! ! ! !

insert image description here
insert image description here

insert image description here
insert image description here
insert image description here

insert image description here

There are about 20 movie data visualization analysis and conclusions

Big screen visualization

Large-screen visualization refers to a data display method that displays a large amount of data on a large screen through visual methods such as charts, tables, and maps. Using the page component of pyecharts to realize large-screen visualization has the following advantages:

(1) Data visualization is intuitive and easy to understand: displaying data through charts and other means allows users to more intuitively understand data distribution and trends, quickly obtain data insights, and avoid tedious data analysis processes.

(2) A variety of chart display methods: pyecharts supports a variety of commonly used chart display methods, such as line charts, histograms, pie charts, etc., and also supports map display and heat map display, which can meet the needs of different users for data display.

(3) Strong customizability: The page components of pyecharts can be flexibly customized, and the page layout and style can be customized to meet the different needs and preferences of users.

(4) Update data in real time: Using the page component of pyecharts, you can update data in real time by timing refresh or asynchronously requesting data, so that users can keep abreast of data changes.

(5) Easy to use: The page component of pyecharts is easy to use, and complex visual pages can be realized through simple code writing, allowing users to focus more on data analysis and insight mining.

insert image description here
insert image description hereinsert image description here

insert image description here

text visualization

We saved the names of movie actors and movie descriptions in the data we grabbed earlier. Now, we will perform text word cloud analysis on this data. This analysis has the following significance:
(1) Marketing and promotion: Relevant organizations in the entertainment industry, such as film companies or movie theaters, may make word clouds of actors’ names and display them on promotional posters or websites to attract the audience’s attention and increase the popularity of the film. Reputation.
(2) Film review analysis: Film critics or enthusiasts can analyze the word cloud of movie actor names to help them understand the movie's cast and predict the quality and style of the movie.
(3) Social media analysis: Some fans may use the actor's name to make a word cloud and share it with others through social media to show their love and support for a certain actor or movie.
(4) Academic research: Scholars can understand the development trend of the film industry, the popularity of actors, and the industry structure by analyzing the word cloud of the names of film actors for further research and discussion.

We have written an intelligent word cloud display program, and users can input parameters to display the word cloud.
insert image description here
insert image description hereinsert image description here

omitted here...

Summarize

This study uses Python web crawler to capture data from Douban movie website, and uses Pandas data analysis library to preprocess and clean the data. Through multi-dimensional analysis and visual display of the cleaned data, valuable conclusions and insights have been obtained.

First, we analyzed the distribution of movie rating indicators. Through the statistics and visualization of movie rating data, we found that the ratings of Douban movies showed a normal distribution feature, and the vast majority of movie ratings were concentrated between 7 and 8 points. In addition, we also analyzed the relationship between movie ratings and box office, and found that the correlation between movie ratings and box office is weak, and movies with high box office do not necessarily have high ratings.

Secondly, we analyzed the degree of user preference, using text sentiment analysis technology to analyze the sentiment of movie review texts, and compared the degree of user preference for different types of movies. The results show that Douban users prefer literary films and documentaries, while they have lower ratings for comedies and action films. In addition, we also analyzed the geographical distribution of users, and found that the geographical distribution of Douban users is relatively wide, and there are certain differences in the preferences of different regions for movie genres.

Finally, we use the pyecharts front-end visualization library to visualize the analysis results, and use the Page module to build a large-screen visualization based on Douban movie analysis. In this way, not only can the analysis results be displayed more intuitively, but also multi-dimensional visual display and unified display can be realized.

To sum up, this study obtained a large amount of Douban movie data through an automated crawler program, and through data cleaning and preprocessing, the unstructured data was cleaned into structured data and stored in the MySQL database. Through multi-dimensional analysis and visual display of the cleaned data, we have obtained in-depth understanding and insight into the Douban movie market, user reviews, and movie content. At the same time, the Python language and data visualization tools used in this research also have high universality and application value, which can provide reference and inspiration for data analysis and research in other fields.

every word

Meeting a new beginning is the best memory!

Guess you like

Origin blog.csdn.net/weixin_47723732/article/details/131416752