Based on Hadoop Douban movie data analysis (comprehensive experiment)

Hadoop is an important distributed architecture for processing big data. It is very important to be proficient in every component and knowledge point. With the huge amount of information generated by modern society, big data is no longer just the field of investigation: it is a powerful force that changes business practices and marketing strategies. According to BCG, big data can help dispersed retailers increase sales by 3% to 4%.

Hadoop was launched by the Apache Software Foundation in 2006. It is a set of open source software that can process and store data across computer clusters. Hadoop is mainly developed as an analysis tool. Facts have proved that it is particularly effective for big data analysis. It can handle structured and unstructured data, has massive storage capabilities, and allows to handle almost unlimited parallel tasks.

Hadoop consists of four main modules:

Distributed file system one is also called HDFS, which can store data across a network of linked storage devices; MapReduce one reads, converts and analyzes data from the database; Hadoop Common is a set of tools and libraries that can supplement other modules and ensure that it is compatible with users’ computers System compatibility; YARN a cluster system manager.

The cluster storage system can run on many devices at the same time, so it can speed up data processing. This makes Hadoop essential for any project that must process large data sets. Moreover, the framework has great flexibility and can be extended to the needs of any company.

Uses of Hadoop:

Customer analysis can provide personalized services, quotations and advertisements based on insights from user data; enterprise projects can effectively manage and process data stored on various servers; data lake-Hadoop supports the creation of original data from different information streams Extend storage, which can be structured and analyzed later.

The following shows a Hadoop comprehensive experiment as an important resource for reviewing Hadoop

Insert picture description here

Introduction to the experiment

Douban users evaluate the movies they "watched" from "very bad" to "highly recommended" every day. Douban uses algorithmic analysis to generate Douban Movie Top based on the number of people who have watched each movie and the evaluation of the movie. 250.
In order to analyze the development trend of the film industry, you need to do a statistical analysis of this information.
The data format of Douban website is a text file (must be imported into hive for processing).
The content of the file is as follows:
Insert picture description here

The indicators to be analyzed are as follows:
1. What type of movie has the highest average rating.
Required output: average genre score
2, which country is the king of bad movies (a country with an average score less than 6 points).
Required output: National average score.
All statistical indicators need to be exported to hbase for easy query. Export to 2 tables (one indicator and one table), and display the result data you wrote in the hbase shell.

In addition, you need to leave a log for this operation and upload your own operation records under /log in hdfs.
The format of the operation record is:

Number name Operation time
01 2020-12-21 10:52:12

Download the data set here

Prepare the environment

Start hdfs

start-all.sh

Start hive

hive

Insert picture description here
Create database and data table

create database douban;
use douban;
create table `douban`.`data`  (
  `id` varchar(255) ,
  `name` varchar(255) ,
  `nop` varchar(255) ,
  `typle` varchar(255),
  `pop` varchar(255),
  `rtime` varchar(255),
  `longtime` varchar(255),
  `ageyear` varchar(255),
  `grade` varchar(255),
  `plocation` varchar(255)
) 
row format delimited fields terminated by ','
stored as textfile;
分别代表:id,名字,投票人数,类型,产地,上映时间,时长,年代,评分,首映地点

Insert picture description here
Insert picture description here
Start habase

start-hbase.sh
hbase shell

Insert picture description here
Data import and load and view

LOAD DATA LOCAL INPATH '/home/hadoop/douban_movie.txt' INTO TABLE data;
select * from data;

Insert picture description here
data analysis

What type of movie has the highest average rating.
Required output: average score of type

select typle,AVG(grade) as t from data GROUP BY typle ORDER BY t DESC LIMIT 1;

Insert picture description here

哪个国家是烂片之王(平均评分小于6分的国家)。
要求输出:国家  平均分
select pop,AVG(grade) as t from data GROUP BY pop HAVING t<6;

Insert picture description here

Export data to hbase (manually insert it)

The trouble of comparing the mutual conductance between hive and hbase, we will introduce it in detail in subsequent articles

Create data sheet 1

create 'result','info'

put 'result','1','info:西部','9.1'

Insert picture description here
Insert picture description here
We found that hbase cannot be displayed in Chinese, then we will use English for the next one

create 'result_1','info'
put 'result_1','1','info:moxige,bolan','5.8,5.7'
scan 'result_1'

Insert picture description here
Upload log records to /log under hdfs

First create a folder in hdfs

hdfs dfs -mkdir /log
vi data
编号		姓名		操作时间     
01			王小王		2020-12-21 10:52:12    

Insert picture description here
Upload log record

hdfs dfs -copyFromLocal /home/hadoop/data /log/
hdfs dfs -ls /
hdfs dfs  -cat /log/data

Insert picture description here
OK, the experiment is over here, and finally I wish you all a happy Christmas Eve!
Insert picture description here

One word per text

The beginning and the end are chemical reactions together. May you have a warm self every day, come on

Guess you like

Origin blog.csdn.net/weixin_47723732/article/details/111657756