Hadoop introductory study notes - 8. Comprehensive cases of data analysis

Video course address: https://www.bilibili.com/video/BV1WY4y197g7
Course material link: https://pan.baidu.com/s/15KpnWeKpvExpKmOC8xjmtQ?pwd=5ay8

Hadoop introductory study notes (summary)

8. Comprehensive cases of data analysis

8.1. Requirements analysis

8.1.1. Background introduction

There will be a large number of users online every day on the chat platform, and a large amount of chat data will appear.Statistical analysis of chat data, can better build accurate user portraits for users, provide users with better services and achieve high ROI platform operation and promotion, and provide accurate data support for the company's development decisions.
We will complete the statistical analysis of relevant indicators based on the user data of a social platform App and use BI tools to visually display the indicators.

8.1.2. Goals

Implement chat data statistical analysis and build based on Hadoop and HiveChat data analysis report
Insert image description here

8.1.3. Requirements

  • Statistics of total news volume today
  • Statistics of message volume, sending and receiving users per hour today
  • Statistics on the amount of message data sent by each region today
  • Count the number of users who sent and received messages today
  • Statistics of the Top 10 users who sent the most messages today
  • Statistics of the Top 10 users who received the most messages today
  • Statistics on the distribution of mobile phone models of the sender
  • Statistics on the distribution of the sender’s device operating system

8.1.4. Data content

  • Data size: 300,000 pieces of data
  • Column separator:Hive default delimiter '\001'
  • Data dictionary and sample data

Insert image description here

8.2. Loading data

1. Create database table

-- 如果数据库已存在就删除
drop database if exists db_msg cascade;

-- 创建数据库
CREATE database db_msg;

-- 选择数据库
use db_msg;

--如果表已存在就删除
drop table if exists db_msg.tb_msg_source;

-- 建表
create table db_msg.tb_msg_source(
msg_time string comment "消息发送时间",
sender_name string comment "发送人昵称",
sender_account string comment "发送人账号",
sender_sex string comment "发送人性别",
sender_ip string comment "发送人ip地址",
sender_os string comment "发送人操作系统",
sender_phonetype string comment "发送人手机型号",
sender_network string comment "发送人网络类型",
sender_gps string comment "发送人的GPS定位",
receiver_name string comment "接收人昵称",
receiver_ip string comment "接收人IP",
receiver_account string comment "接收人账号",
receiver_os string comment "接收人操作系统",
receiver_phonetype string comment "接收人手机型号",
receiver_network string comment "接收人网络类型",
receiver_gps string comment "接收人的GPS定位",
receiver_sex string comment "接收人性别",
msg_type string comment "消息类型",
distance string comment "双方距离",
message string comment "消息内容"
);

2. Data import:
Upload the chat_data-30W.csv file in the course materials to the /home/hadoop directory of the node1 server;

Execute the following commands in the Linux system:

# 切换工作目录
cd /home/hadoop

# 在HDFS系统中创建/chatdemo/data目录
hadoop fs -mkdir -p /chatdemo/data

# 将chat_data-30W.csv文件从Linux上传到HDFS系统中
hadoop fs -put chat_data-30W.csv /chatdemo/data

Execute the following command in DBeaver:

-- 从HDFS系统中加载数据到Hive表
load data inpath '/chatdemo/data/chat_data-30W.csv' into table tb_msg_source;

-- 验证数据加载
SELECT * FROM tb_msg_source tablesample(100 rows);

-- 验证表中的数据条数
SELECT COUNT(*) from tb_msg_source tms ; 

8.3. ETL data cleaning and transformation

Since there are some non-compliant data in the original data, the data needs to be cleaned.
1. Problems with original data

  • Question 1: In the current data, some data fields (such as sender_gps) are empty and are not legal data;
  • Question 2: In the requirement, the number of messages per day and hour needs to be counted, but there are no day and hour fields in the data, only the overall time field, which is difficult to process;
  • Question 3: In the requirement, it is necessary to construct a visual map of the region based on longitude and latitude, but the GPS longitude and latitude in the data is a field, which is difficult to process;

2. Data cleaning needs

  • Requirement 1: Filter illegal data with empty fields;
  • Requirement 2: Construct day and hour fields through time fields;
  • Requirement 3: Extract longitude and latitude from GPS longitude and latitude;
  • Requirement 4: Save the results of ETL to a new Hive table.
-- 创建存储清洗后数据的表
create table db_msg.tb_msg_etl(
msg_time string comment "消息发送时间",
sender_name string comment "发送人昵称",
sender_account string comment "发送人账号",
sender_sex string comment "发送人性别",
sender_ip string comment "发送人ip地址",
sender_os string comment "发送人操作系统",
sender_phonetype string comment "发送人手机型号",
sender_network string comment "发送人网络类型",
sender_gps string comment "发送人的GPS定位",
receiver_name string comment "接收人昵称",
receiver_ip string comment "接收人IP",
receiver_account string comment "接收人账号",
receiver_os string comment "接收人操作系统",
receiver_phonetype string comment "接收人手机型号",
receiver_network string comment "接收人网络类型",
receiver_gps string comment "接收人的GPS定位",
receiver_sex string comment "接收人性别",
msg_type string comment "消息类型",
distance string comment "双方距离",
message string comment "消息内容",
msg_day string comment "消息日",
msg_hour string comment "消息小时",
sender_lng double comment "经度",
sender_lat double comment "纬度"
);


-- 按照需求对原始数据表中的数据进行过滤,然后插入新建的表中
INSERT OVERWRITE TABLE db_msg.tb_msg_etl
SELECT 
	*,
	DATE(msg_time) as msg_day,
	HOUR(msg_time) as msg_hour,
	SPLIT(sender_gps, ',')[0] as sender_lng,
	SPLIT(sender_gps, ',')[1] as sender_lat
FROM db_msg.tb_msg_source
WHERE LENGTH(sender_gps) > 0;

After the execution is completed, open the tb_msg_etl table and you can see the following data
Insert image description here

Extended knowledge: ETL
queries data from the table tb_msg_source, performs data filtering and transformation, and writes the results to: the operation in the tb_msg_etl table. This operation is essentially a simple ETL behavior.
ETL:

  • E, Extract, extract
  • T, Transform, conversion
  • L, Load, load

Extract data from A (E), perform data conversion and filtering (T), and load the results into B (L), which is ETL.

8.4. Indicator statistics

1. Indicator 1: Statistics of the total number of messages sent daily

-- 统计每日消息总量
CREATE table db_msg.tb_rs_total_msg_cnt comment '每日消息总量' as
SELECT msg_day, COUNT(*) as total_msg_cnt  FROM db_msg.tb_msg_etl GROUP BY msg_day;

2. Indicator 2: Statistics of message volume, number of sending and receiving users per hour

-- 统计每小时消息量、发送和接收用户数
CREATE table db_msg.tb_rs_hour_msg_cnt comment '每小时消息量情况' as
SELECT
	msg_hour,
	COUNT(*) as total_msg_cnt,
	COUNT(DISTINCT sender_account) as sender_user_cnt,
	COUNT(DISTINCT receiver_account) as receiver_user_cnt
FROM
	db_msg.tb_msg_etl
GROUP BY
	msg_hour;

3. Indicator 3: Statistics of the total number of messages sent in each region every day

-- 每日各地区发送消息总量
CREATE table db_msg.tb_rs_loc_cnt comment '每日各地区发送消息总量' as
SELECT
	msg_day,
	sender_lng,
	sender_lat,
	COUNT(*) as total_msg_cnt
FROM db_msg.tb_msg_etl 
GROUP BY msg_day, sender_lng, sender_lat;

4. Indicator 4: Statistics of daily sending and receiving users

-- 每日发送和接收用户数
CREATE table db_msg.tb_rs_user_cnt comment '每日发送消息和接收消息人数' as
SELECT 
	msg_day,
	COUNT(DISTINCT sender_account) as sender_user_cnt,
	COUNT(DISTINCT receiver_account) as receiver_user_cnt 
FROM db_msg.tb_msg_etl
GROUP BY msg_day;

5. Indicator 5: Statistics of the TOP10 users who sent the most messages

-- 发送消息条数最多的前10个用户
CREATE table db_msg.tb_rs_sneder_user_top10 comment '发送消息条数最多的10个用户' as
SELECT 
	sender_name,
	COUNT(*) as sender_msg_cnt 
FROM db_msg.tb_msg_etl
GROUP BY sender_name
SORT BY sender_msg_cnt DESC 
LIMIT 10;

6. Indicator 6: Statistics of the TOP10 users who received the most messages

-- 接收消息条数最多的10个用户
CREATE table db_msg.tb_rs_receiver_user_top10 comment '接收消息条数最多的10个用户' as
SELECT 
	receiver_name,
	COUNT(*) as receiver_msg_cnt 
FROM db_msg.tb_msg_etl
GROUP BY receiver_name 
SORT BY receiver_msg_cnt DESC 
LIMIT 10;

7. Indicator 7: Statistics of the sender’s mobile phone model

-- 统计发送人的手机型号
CREATE table db_msg.tb_rs_sender_phone comment '发送人手机型号分布' as
SELECT 
	sender_phonetype,
	COUNT(*) as cnt
FROM db_msg.tb_msg_etl
GROUP BY sender_phonetype;

8. Indicator 8: Statistics on the distribution of the sender’s device operating system

-- 统计发送人的设备操作系统分布情况
CREATE table db_msg.tb_rs_sender_os comment '发送人设备操作系统分布' as
SELECT 
	sender_os,
	COUNT(*) as cnt
FROM db_msg.tb_msg_etl tme 
GROUP BY sender_os;

8.5. FineBI installation & configuration

8.5.1. Download and installation of FineBI

1. Open FineBI official https://www.finebi.com/ , register and download the FineBI personal trial version client;
Insert image description here
Insert image description here

2. Install the client you just downloaded on your local physical machine (the same operation as installing other software). After the installation is complete, start the FineBI client;
Insert image description here

3. After starting, enter the activation code provided by the FineBI official website, and then click the "Use BI" button. At this time, the FineBI client starts to start (this process may be long and you need to wait patiently. During the process, openJDK may pop up to request firewall permissions. Agree is required);
Insert image description here
Insert image description here
4. After the FineBI client is successfully started, the browser will automatically open and the http://localhost:37799/webroot/decision/login/initialization web page will be opened to enter the configuration page. At this time, the management of the BI software can be configured. User account and password;
Insert image description here
5. After the account is set up, you need to configure the FineBI database. FineBI is similar to Hive and also has metadata that needs to be managed. For personal use, you can use FineBI's built-in database. If it is used in a production environment, it is recommended to use an external database. Database;
Insert image description here
6. After clicking "Log in directly", the BI system will automatically jump to the login interface, enter the administrator account and password just set to log in;
Insert image description here
7. After logging in to the FineBI system, you can find some built-in templates in its directory And sample data, as well as novice guide, etc., can be used as a reference to configure the template you need;
Insert image description here
at this point, the FineBI client has been installed.

8.5.2. Configure the connection between FineBI and Hive

1. Next, you need to configure the isolation plug-in for FineBI to connect to Hive. Enter the FineBI system, go to "Management System-Plug-in Management-My Plug-in-Install from Local", then select the course materials in the FineBI folder fr-plugin-hive-driver-loader-3.0.zip, and then the system will install the Hive isolation plug-in;
Insert image description here
Insert image description here
Insert image description here
2. Then, use Notepad to open it Files webapps\webroot\WEB-INF\embed\finedbin the directory under the FineBI installation directory db.script, modify them INSERT INTO FINE_CONF_ENTITY VALUES('SystemConfig.driverUpload','false')to INSERT INTO FINE_CONF_ENTITY VALUES('SystemConfig.driverUpload','true'). Only in this way can the Hive driver be installed.
Insert image description here

3. Then, restart the FineBI client, close the FineBI client first, and then reopen the FineBI client on the desktop;
Insert image description here
Insert image description here
4. After logging in to the system again, you need to install the Hive driver first, and open the FineBI official help manual, Hadoop Hive data connection chapter :https://help.fanruan.com/finebi/doc-view-301.html , and download the corresponding version of the driver package and log jar package according to its instructions;
Insert image description here
5. After the download is completed, put all the jars in the two compressed packages Unzip the file into a folder;
Insert image description here
6. In the system, click "Management System-Data Connection-Data Connection Management-Driver Management" to enter the driver management interface;
Insert image description here
7. Click the "New Driver" button in the driver management interface and fill in the Driver name, and then click the "OK" button;
Insert image description here
8. Then click the "Upload File" button and select all the jar files you just decompressed to upload;
Insert image description here
9. After the upload is completed, select the Hive driver in the "Driver" column, and then click on the upper right corner "Save" button to complete adding the Hive driver. After the addition is successful, click the "Exit Driver Management" button in the upper left corner;
Insert image description here
10. Click the "New Data Connection" button in the data connection management interface to open the new data connection interface;
Insert image description here
11. Select the "All" tab in the opened page. Then click "Hadoop Hive";
Insert image description here
12. Fill in the relevant information of the virtual machine Hive service (i.e. hiveserver2 service) on the Hadoop Hive page. After completing the filling, click the "Test Connection" button in the upper right corner. You will see the "Connection Successful" prompt indicating the configuration. Success, then click the "Save" button in the upper right corner, and the Hive connection is created.

Insert image description here
Insert image description here
At this point, the data connection configuration from FineBI to Hive is completed. The visualization panel will be configured later.

8.6. Visual display

The goal of this section is to use FineBI to configure the following visual dashboard.
Insert image description here
1. Create a report. After logging in to the system, click "Public Data - New Folder" to create the folder used in this case, and then name the folder "Hive Data Analysis".
Insert image description here
2. Select the newly created "Hive Data Analysis" folder, then click the "New Data Set" button above and select "Database Table".
Insert image description here
3. Then select the 8-indicator data table created in the previous chapter, and then click the "OK" button in the upper right corner;
Insert image description here
4. After clicking "OK", you can see that the previous table appears under the "Hive Data Analysis" folder. The selected table (named after the table comment);
Insert image description here
5. Click on each table in turn, and then click the "Update Data" button to pull the data from Hive;
Insert image description here
6. Click on "My Analysis - New Folder", Name the new folder "Hive Data Analysis";
Insert image description here
7. Select "Hive Data Analysis" and click "New Analysis Topic". The analysis topic page will open in another browser window;
Insert image description here
8. Select "Hive Data Analysis" on the analysis topic page. "Public data" - "The number of people sending and receiving messages every day" in the newly created Hive data analysis data set, and then click the "OK" button to build the data;
Insert image description here
9. After the construction is completed, click the "Components" tab below , enter the component configuration, select "KPI indicator card", then drag the "sender_user_cnt" field on the left to the "Text" column, and then click the configuration button of the text column;
Insert image description here
10. In the pop-up text column configuration, cancel "Fix font size", then edit the content and change the prefix of the content to "Number of people sending messages:";
Insert image description here
Insert image description here
11. Rename the component Tab to "Number of people sending messages", and then click "Add instrument" at the bottom of the page "Board" button to add a dashboard;
Insert image description here
12. Then in the dashboard, drag the newly configured "Number of people sending messages" component to the dashboard, adjust the position and size, click the drop-down button next to the component, and cancel Check "Show Title";
Insert image description here
13. Then click "Dashboard Style" in the upper right corner and select "Default Dark" to modify the background of the entire data dashboard;
Insert image description here
14. Then in the same way, create a new component of "Number of Received Messages" , and place it on the dashboard;
Insert image description here
15. Select the "Data" Tab, then click the "+" button above, then select "Public Data-Hive Data Analysis-Total Daily Messages", and then click OK;
Insert image description here
16. Then add the "Total Number of Messages" component, refer to the above to complete the component configuration, and place it on the dashboard;
Insert image description here
Insert image description here
17. Follow a similar method to create the "TOP10 Users Who Send the Most Messages" component;
Insert image description here
Insert image description here
Insert image description here
Insert image description here
18. Follow a similar method to create the "Sending User Operating System" Proportion" component;
Insert image description here
Insert image description here
Insert image description here
Insert image description here
19. Create the map component according to a similar method;
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
20. Create the "TOP10 users who received the most messages" component according to a similar method;
Insert image description here
Insert image description here
Insert image description here
Insert image description here
21. Create the "Distribution of mobile phone models of sending users" component according to a similar method;
Insert image description here
Insert image description here
Insert image description here
Insert image description here
22. Follow a similar method Create the Hourly Message Volume Trend component;
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
that's the end of this course.

Guess you like

Origin blog.csdn.net/whh306318848/article/details/135361970