Big data movie visualization

This project takes movie data as the theme and takes data collection, processing, analysis and data visualization as the project process, which can realize the offline processing and calculation of millions of movie data.

Project link: https://github.com/GoAlers/Bigdata-movie

Development environment: IDEA+Pycharm+Centos7.0+hadoop2.8+hive2.3.0+hbase2.0.0+mysql5.7+sqoop

  • 1. Responsible for the installation and deployment of the Hadoop big data platform, basic configuration, parameter optimization, and related component deployment and management.
  • 2. Build a distributed big data platform, responsible for the crawler to collect massive movie data and data cleaning, make word map clouds, matplotlib charts and Echarts data visualization.
  • 3. Use Python/Java to write MR programming to calculate specific movie statistics offline; use sqoop tool to transfer data, and use hive for relevant data analysis.
  • 4. Use machine learning algorithms and related libraries to realize movie review sentiment analysis and predict user rating range and box office, and count the topN of the total score.

1. Data collection (pachong.py), cleaning: collect Douban movie data, grab the ranking of movie box office total revenue (take the top 20), delete redundant and empty words, use Python's PyMysql library to connect to the local Mysql database and import it The movies table can save data locally for visual display of data, or import the data into the Hive data warehouse tool of big data for big data analysis.

Sort Movie name Types of Total box office (ten thousand) Per game Release date
1 Wolf Warrior 2 action 567928 38 2017/7/27
2 Nezha's Demon Boy Comes to the World Animation 501324 24 2019/7/26
3 Wandering earth Science fiction 468433 29 2019/2/5
4 Avengers 4: Endgame action 425024 23 2019/4/24
5 Operation Red Sea action 365079 33 2018/2/16
6 Detective Chinatown 2 comedy 339769 39 2018/2/16
7 Mermaid comedy 339211 44 2016/2/8
8 Me and my motherland Plot 317152 36 2019/9/30
9 I am not a medicine god Plot 309996 27 2018/7/5
10 Chinese captain Plot 291229 27 2019/9/30

2. Data visualization: Data visualization can make data more intuitive and more conducive to analysis. It can be said that visualization technology is the most important content of data analysis and mining. As an open source project based on the Python language, Matplotlib aims to provide a data plotting package for Python to achieve professional and rich plotting functions.

(1) Movie box office ranking

(2) Movie rating ranking douanscore.py

(3) Echarts recently released movies

Echarts is mainly used for data visualization and display. It is an open source JavaScript library that is compatible with most existing browsers. In Python, Echarts is packaged into Pyecharts, a data visualization tool library. It provides intuitive, rich, and customizable data visualization charts, including conventional line charts, histograms, scatter charts, and pie charts.

(4) Short commentary information for the film "囧Mom"

The New Year’s Day movie "囧 Mom" ​​premiered on the Internet. So far, its Douban movie has scored 6.0 points. Case analysis is carried out through the Douban popular short comment of the movie "囧 Mom", and data crawling is carried out with Octopus software as a data collection tool. The collected fields include user name, rating, number of likes, and comment content, and use regular expressions to match field tags. According to the rating star system provided by Douban Movies, five such as recommended, recommended, okay, poor, and very poor are displayed. Rating, the full score is five stars, and the data format is as follows: 

(7) Python word frequency statistics wordcount.py

3. Big data analysis:

The most important part of big data processing is data analysis. Data analysis is usually divided into two types: batch processing and stream processing. Batch processing is the unified processing of massive offline data within a period of time, corresponding to the processing framework Mapreduce, Spark, etc.; stream processing is for dynamic real-time data processing, that is, processing the data while receiving it, corresponding processing Frameworks include Storm, Spark Streaming, Flink, etc. This article focuses on offline calculation and introduces movie data analysis.

(1) Mapreduce offline calculation (mapreduce_hive file)

Mapreduce programming word frequency statistics mainly uses the wordcount idea, and realizes word frequency statistics by dividing words and sentences according to prescribed formats. The statistical data is the release information of historical movies. The map stage is mainly responsible for word segmentation statistics. The map stage maps each string into a key and value pair, and maps words into (words, 1) form by row. The Shuffle process will The results are partitioned and sorted, and then combined and written to the disk according to the output of the same partition. Finally, a partitioned file is obtained. Finally, the reduce stage will summarize and count the number of each word, and the data will eventually be stored on HDFS . This article takes movie words as statistical objects to realize the function of word frequency statistics. The flow chart of word frequency statistics is as follows:

Map stage code:

import sys

for line in sys.stdin:
ss = line.strip().split(' ')
for s in ss:
if s.strip() != "":
print "%s\t%s" % (s, 1)

Reduce phase code:

import sys
current_word = None
count_pool = []
sum = 0

for line in sys.stdin:
word, val = line.strip().split('\t')
if current_word == None:
current_word = word
if current_word != word:
for count in count_pool:
sum += count
print "%s\t%s" % (current_word, sum)
current_word = word
count_pool = []
sum = 0
count_pool.append(int(val))
for count in count_pool:
sum += count
print "%s\t%s" % (current_word, str(sum))

(2) Hive data warehouse

Hive is a Hadoop-based data warehouse tool, mainly used to solve the data statistics of massive structured logs. It can map structured data files into a table, and query and statistically analyze the data in the table through SQL-like statements. Use Sqoop data transfer tool to import Mysql database information to Hive data warehouse.

Using Hive can realize massive data analysis, and supports custom functions, eliminating the need for MapReduce programming. This article conducts statistics on historical Douban movie data. After the data is cleaned, null values ​​and redundant items are deleted, more than 100,000 movie data are obtained. Some data formats are as follows:

(3) Movie type and box office statistics movietype.py

(5) The relationship between director and film type director.py

(6) Movie box office forecast (E-ticket box office forecast.xls)

(7) Movie score prediction scorepredict.py

Use the machine learning sklearn library to build a regression model, randomly select 5 users to calculate the score to predict the user's rating range for a new film, and output the maximum, minimum, and average ratings.

#encoding:utf-8
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from sklearn.linear_model import LinearRegression


#Drawing part data = pd.read_csv('lianxi/film-csv.txt',encoding ='utf-8',delimiter =';')
#Read file data = data.iloc[:,:-1] # Remove the illegal data in the file
data = data.drop(0).drop_duplicates().reset_index().drop('index',axis = 1) #Because the first line removes empty data and resets the index
# print data

t = [] #Split the movie type according to multiple dividers
for i in range(len(data)):
a = re.split(u' / |/|,|, | |',data[u'movie Type'][i])
for j in a:
t.append(j)
t = set(t) #Remove the repeated type
tt = []
for i in t: #Remove the non-standard type to get all existing The type of
if (len(i)<=2)|(i==u'
合家欢'): tt.append(i)

# Score prediction
id = [1050,1114,1048,1488,1102] #Five user id
data1 = pd.read_csv('lianxi/score.log',delimiter=',',encoding ='utf-8',header =0,names = [u'movie name',u'userid',u'score'])
data1 = data1[data1[u'userid'].isin(id)] #Remove five user related data

data1[u'movie name'] = data1[u'movie name'].str.strip() #Remove the space of the movie name
all = [] #Used to save the prediction results
for k in range(len(id)):
#Cycle five times modeling for prediction dfp1 = data1[data1[u'userid']==id[k]].reset_index().drop('index',axis = 1)
datamerge = pd.merge(data,dfp1 ,on=u'电影名') #Using merge to merge the movie details with the new user rating
lst = []
lsd = []
lsr = []
for i in range(len(datamerge)): #Separate the movie type And the director and the corresponding box office
for j in tt:
if j in datamerge[u'movie type'][i]:
d = re.split(u',|,|/|',datamerge[u'director'][ i])
for k in d:
lsd.append(k.replace(u'', u''))
lst.append(j.replace(u'', u''))
lsr.append(datamerge[u' score'][i])
lsd1 = list(set(lsd))
for i in range(len(lsd1)): # Convert the movie type and box office into a continuous amount for machine training
for j in range(len(lsd)):
if lsd1[i] == lsd[j]:
lsd[j] = i + 1
for i in range(len(tt)):
for j in range(len(lst)):
if tt[i] == lst[j]:
lst[j] = i + 1
lsd = pd.DataFrame(lsd, columns=[u'导演'])
lst = pd.DataFrame(lst, columns=[u'影片类型'])
lsr = pd.DataFrame(lsr, columns=[u'评分'])

a = pd.concat([lsd, lst, lsr], axis=1)
print(a)

trainx = a.iloc[:, 0:2] # Film type and director as feature quantities
trainy = a.iloc[:, 2:3] # Score as sample value
l =
LinearRegression( ) # Modeling l.fit(trainx , trainy) # training

anstest = pd.DataFrame([[5,10]],columns=[u'director',
u'film type']) ans = l.predict(anstest)
#predict all.append(ans[0][0] ) #Get the result
print (u'score maximum value is'+'%.2f'%max(all))
#output print (u'score minimum value is'+'%.2f'%min(all))
print (The median value of u'score is'+'%.2f'%np.median(all))
print (The average value of u'score is'+'%.2f'%np.mean(all))

 

 

Guess you like

Origin blog.csdn.net/qq_36816848/article/details/112861158