Video course address: https://www.bilibili.com/video/BV1WY4y197g7
Course material link: https://pan.baidu.com/s/15KpnWeKpvExpKmOC8xjmtQ?pwd=5ay8
Hadoop introductory study notes (summary)
Table of contents
8. Comprehensive cases of data analysis
8.1. Requirements analysis
8.1.1. Background introduction
There will be a large number of users online every day on the chat platform, and a large amount of chat data will appear.Statistical analysis of chat data, can better build accurate user portraits for users, provide users with better services and achieve high ROI platform operation and promotion, and provide accurate data support for the company's development decisions.
We will complete the statistical analysis of relevant indicators based on the user data of a social platform App and use BI tools to visually display the indicators.
8.1.2. Goals
Implement chat data statistical analysis and build based on Hadoop and HiveChat data analysis report
8.1.3. Requirements
- Statistics of total news volume today
- Statistics of message volume, sending and receiving users per hour today
- Statistics on the amount of message data sent by each region today
- Count the number of users who sent and received messages today
- Statistics of the Top 10 users who sent the most messages today
- Statistics of the Top 10 users who received the most messages today
- Statistics on the distribution of mobile phone models of the sender
- Statistics on the distribution of the sender’s device operating system
8.1.4. Data content
- Data size: 300,000 pieces of data
- Column separator:Hive default delimiter '\001'
- Data dictionary and sample data
8.2. Loading data
1. Create database table
-- 如果数据库已存在就删除
drop database if exists db_msg cascade;
-- 创建数据库
CREATE database db_msg;
-- 选择数据库
use db_msg;
--如果表已存在就删除
drop table if exists db_msg.tb_msg_source;
-- 建表
create table db_msg.tb_msg_source(
msg_time string comment "消息发送时间",
sender_name string comment "发送人昵称",
sender_account string comment "发送人账号",
sender_sex string comment "发送人性别",
sender_ip string comment "发送人ip地址",
sender_os string comment "发送人操作系统",
sender_phonetype string comment "发送人手机型号",
sender_network string comment "发送人网络类型",
sender_gps string comment "发送人的GPS定位",
receiver_name string comment "接收人昵称",
receiver_ip string comment "接收人IP",
receiver_account string comment "接收人账号",
receiver_os string comment "接收人操作系统",
receiver_phonetype string comment "接收人手机型号",
receiver_network string comment "接收人网络类型",
receiver_gps string comment "接收人的GPS定位",
receiver_sex string comment "接收人性别",
msg_type string comment "消息类型",
distance string comment "双方距离",
message string comment "消息内容"
);
2. Data import:
Upload the chat_data-30W.csv file in the course materials to the /home/hadoop directory of the node1 server;
Execute the following commands in the Linux system:
# 切换工作目录
cd /home/hadoop
# 在HDFS系统中创建/chatdemo/data目录
hadoop fs -mkdir -p /chatdemo/data
# 将chat_data-30W.csv文件从Linux上传到HDFS系统中
hadoop fs -put chat_data-30W.csv /chatdemo/data
Execute the following command in DBeaver:
-- 从HDFS系统中加载数据到Hive表
load data inpath '/chatdemo/data/chat_data-30W.csv' into table tb_msg_source;
-- 验证数据加载
SELECT * FROM tb_msg_source tablesample(100 rows);
-- 验证表中的数据条数
SELECT COUNT(*) from tb_msg_source tms ;
8.3. ETL data cleaning and transformation
Since there are some non-compliant data in the original data, the data needs to be cleaned.
1. Problems with original data
- Question 1: In the current data, some data fields (such as sender_gps) are empty and are not legal data;
- Question 2: In the requirement, the number of messages per day and hour needs to be counted, but there are no day and hour fields in the data, only the overall time field, which is difficult to process;
- Question 3: In the requirement, it is necessary to construct a visual map of the region based on longitude and latitude, but the GPS longitude and latitude in the data is a field, which is difficult to process;
2. Data cleaning needs
- Requirement 1: Filter illegal data with empty fields;
- Requirement 2: Construct day and hour fields through time fields;
- Requirement 3: Extract longitude and latitude from GPS longitude and latitude;
- Requirement 4: Save the results of ETL to a new Hive table.
-- 创建存储清洗后数据的表
create table db_msg.tb_msg_etl(
msg_time string comment "消息发送时间",
sender_name string comment "发送人昵称",
sender_account string comment "发送人账号",
sender_sex string comment "发送人性别",
sender_ip string comment "发送人ip地址",
sender_os string comment "发送人操作系统",
sender_phonetype string comment "发送人手机型号",
sender_network string comment "发送人网络类型",
sender_gps string comment "发送人的GPS定位",
receiver_name string comment "接收人昵称",
receiver_ip string comment "接收人IP",
receiver_account string comment "接收人账号",
receiver_os string comment "接收人操作系统",
receiver_phonetype string comment "接收人手机型号",
receiver_network string comment "接收人网络类型",
receiver_gps string comment "接收人的GPS定位",
receiver_sex string comment "接收人性别",
msg_type string comment "消息类型",
distance string comment "双方距离",
message string comment "消息内容",
msg_day string comment "消息日",
msg_hour string comment "消息小时",
sender_lng double comment "经度",
sender_lat double comment "纬度"
);
-- 按照需求对原始数据表中的数据进行过滤,然后插入新建的表中
INSERT OVERWRITE TABLE db_msg.tb_msg_etl
SELECT
*,
DATE(msg_time) as msg_day,
HOUR(msg_time) as msg_hour,
SPLIT(sender_gps, ',')[0] as sender_lng,
SPLIT(sender_gps, ',')[1] as sender_lat
FROM db_msg.tb_msg_source
WHERE LENGTH(sender_gps) > 0;
After the execution is completed, open the tb_msg_etl table and you can see the following data
Extended knowledge: ETL
queries data from the table tb_msg_source, performs data filtering and transformation, and writes the results to: the operation in the tb_msg_etl table. This operation is essentially a simple ETL behavior.
ETL:
- E, Extract, extract
- T, Transform, conversion
- L, Load, load
Extract data from A (E), perform data conversion and filtering (T), and load the results into B (L), which is ETL.
8.4. Indicator statistics
1. Indicator 1: Statistics of the total number of messages sent daily
-- 统计每日消息总量
CREATE table db_msg.tb_rs_total_msg_cnt comment '每日消息总量' as
SELECT msg_day, COUNT(*) as total_msg_cnt FROM db_msg.tb_msg_etl GROUP BY msg_day;
2. Indicator 2: Statistics of message volume, number of sending and receiving users per hour
-- 统计每小时消息量、发送和接收用户数
CREATE table db_msg.tb_rs_hour_msg_cnt comment '每小时消息量情况' as
SELECT
msg_hour,
COUNT(*) as total_msg_cnt,
COUNT(DISTINCT sender_account) as sender_user_cnt,
COUNT(DISTINCT receiver_account) as receiver_user_cnt
FROM
db_msg.tb_msg_etl
GROUP BY
msg_hour;
3. Indicator 3: Statistics of the total number of messages sent in each region every day
-- 每日各地区发送消息总量
CREATE table db_msg.tb_rs_loc_cnt comment '每日各地区发送消息总量' as
SELECT
msg_day,
sender_lng,
sender_lat,
COUNT(*) as total_msg_cnt
FROM db_msg.tb_msg_etl
GROUP BY msg_day, sender_lng, sender_lat;
4. Indicator 4: Statistics of daily sending and receiving users
-- 每日发送和接收用户数
CREATE table db_msg.tb_rs_user_cnt comment '每日发送消息和接收消息人数' as
SELECT
msg_day,
COUNT(DISTINCT sender_account) as sender_user_cnt,
COUNT(DISTINCT receiver_account) as receiver_user_cnt
FROM db_msg.tb_msg_etl
GROUP BY msg_day;
5. Indicator 5: Statistics of the TOP10 users who sent the most messages
-- 发送消息条数最多的前10个用户
CREATE table db_msg.tb_rs_sneder_user_top10 comment '发送消息条数最多的10个用户' as
SELECT
sender_name,
COUNT(*) as sender_msg_cnt
FROM db_msg.tb_msg_etl
GROUP BY sender_name
SORT BY sender_msg_cnt DESC
LIMIT 10;
6. Indicator 6: Statistics of the TOP10 users who received the most messages
-- 接收消息条数最多的10个用户
CREATE table db_msg.tb_rs_receiver_user_top10 comment '接收消息条数最多的10个用户' as
SELECT
receiver_name,
COUNT(*) as receiver_msg_cnt
FROM db_msg.tb_msg_etl
GROUP BY receiver_name
SORT BY receiver_msg_cnt DESC
LIMIT 10;
7. Indicator 7: Statistics of the sender’s mobile phone model
-- 统计发送人的手机型号
CREATE table db_msg.tb_rs_sender_phone comment '发送人手机型号分布' as
SELECT
sender_phonetype,
COUNT(*) as cnt
FROM db_msg.tb_msg_etl
GROUP BY sender_phonetype;
8. Indicator 8: Statistics on the distribution of the sender’s device operating system
-- 统计发送人的设备操作系统分布情况
CREATE table db_msg.tb_rs_sender_os comment '发送人设备操作系统分布' as
SELECT
sender_os,
COUNT(*) as cnt
FROM db_msg.tb_msg_etl tme
GROUP BY sender_os;
8.5. FineBI installation & configuration
8.5.1. Download and installation of FineBI
1. Open FineBI official https://www.finebi.com/ , register and download the FineBI personal trial version client;
2. Install the client you just downloaded on your local physical machine (the same operation as installing other software). After the installation is complete, start the FineBI client;
3. After starting, enter the activation code provided by the FineBI official website, and then click the "Use BI" button. At this time, the FineBI client starts to start (this process may be long and you need to wait patiently. During the process, openJDK may pop up to request firewall permissions. Agree is required);
4. After the FineBI client is successfully started, the browser will automatically open and the http://localhost:37799/webroot/decision/login/initialization web page will be opened to enter the configuration page. At this time, the management of the BI software can be configured. User account and password;
5. After the account is set up, you need to configure the FineBI database. FineBI is similar to Hive and also has metadata that needs to be managed. For personal use, you can use FineBI's built-in database. If it is used in a production environment, it is recommended to use an external database. Database;
6. After clicking "Log in directly", the BI system will automatically jump to the login interface, enter the administrator account and password just set to log in;
7. After logging in to the FineBI system, you can find some built-in templates in its directory And sample data, as well as novice guide, etc., can be used as a reference to configure the template you need;
at this point, the FineBI client has been installed.
8.5.2. Configure the connection between FineBI and Hive
1. Next, you need to configure the isolation plug-in for FineBI to connect to Hive. Enter the FineBI system, go to "Management System-Plug-in Management-My Plug-in-Install from Local", then select the course materials in the FineBI folder fr-plugin-hive-driver-loader-3.0.zip
, and then the system will install the Hive isolation plug-in;
2. Then, use Notepad to open it Files webapps\webroot\WEB-INF\embed\finedb
in the directory under the FineBI installation directory db.script
, modify them INSERT INTO FINE_CONF_ENTITY VALUES('SystemConfig.driverUpload','false')
to INSERT INTO FINE_CONF_ENTITY VALUES('SystemConfig.driverUpload','true')
. Only in this way can the Hive driver be installed.
3. Then, restart the FineBI client, close the FineBI client first, and then reopen the FineBI client on the desktop;
4. After logging in to the system again, you need to install the Hive driver first, and open the FineBI official help manual, Hadoop Hive data connection chapter :https://help.fanruan.com/finebi/doc-view-301.html , and download the corresponding version of the driver package and log jar package according to its instructions;
5. After the download is completed, put all the jars in the two compressed packages Unzip the file into a folder;
6. In the system, click "Management System-Data Connection-Data Connection Management-Driver Management" to enter the driver management interface;
7. Click the "New Driver" button in the driver management interface and fill in the Driver name, and then click the "OK" button;
8. Then click the "Upload File" button and select all the jar files you just decompressed to upload;
9. After the upload is completed, select the Hive driver in the "Driver" column, and then click on the upper right corner "Save" button to complete adding the Hive driver. After the addition is successful, click the "Exit Driver Management" button in the upper left corner;
10. Click the "New Data Connection" button in the data connection management interface to open the new data connection interface;
11. Select the "All" tab in the opened page. Then click "Hadoop Hive";
12. Fill in the relevant information of the virtual machine Hive service (i.e. hiveserver2 service) on the Hadoop Hive page. After completing the filling, click the "Test Connection" button in the upper right corner. You will see the "Connection Successful" prompt indicating the configuration. Success, then click the "Save" button in the upper right corner, and the Hive connection is created.
At this point, the data connection configuration from FineBI to Hive is completed. The visualization panel will be configured later.
8.6. Visual display
The goal of this section is to use FineBI to configure the following visual dashboard.
1. Create a report. After logging in to the system, click "Public Data - New Folder" to create the folder used in this case, and then name the folder "Hive Data Analysis".
2. Select the newly created "Hive Data Analysis" folder, then click the "New Data Set" button above and select "Database Table".
3. Then select the 8-indicator data table created in the previous chapter, and then click the "OK" button in the upper right corner;
4. After clicking "OK", you can see that the previous table appears under the "Hive Data Analysis" folder. The selected table (named after the table comment);
5. Click on each table in turn, and then click the "Update Data" button to pull the data from Hive;
6. Click on "My Analysis - New Folder", Name the new folder "Hive Data Analysis";
7. Select "Hive Data Analysis" and click "New Analysis Topic". The analysis topic page will open in another browser window;
8. Select "Hive Data Analysis" on the analysis topic page. "Public data" - "The number of people sending and receiving messages every day" in the newly created Hive data analysis data set, and then click the "OK" button to build the data;
9. After the construction is completed, click the "Components" tab below , enter the component configuration, select "KPI indicator card", then drag the "sender_user_cnt" field on the left to the "Text" column, and then click the configuration button of the text column;
10. In the pop-up text column configuration, cancel "Fix font size", then edit the content and change the prefix of the content to "Number of people sending messages:";
11. Rename the component Tab to "Number of people sending messages", and then click "Add instrument" at the bottom of the page "Board" button to add a dashboard;
12. Then in the dashboard, drag the newly configured "Number of people sending messages" component to the dashboard, adjust the position and size, click the drop-down button next to the component, and cancel Check "Show Title";
13. Then click "Dashboard Style" in the upper right corner and select "Default Dark" to modify the background of the entire data dashboard;
14. Then in the same way, create a new component of "Number of Received Messages" , and place it on the dashboard;
15. Select the "Data" Tab, then click the "+" button above, then select "Public Data-Hive Data Analysis-Total Daily Messages", and then click OK;
16. Then add the "Total Number of Messages" component, refer to the above to complete the component configuration, and place it on the dashboard;
17. Follow a similar method to create the "TOP10 Users Who Send the Most Messages" component;
18. Follow a similar method to create the "Sending User Operating System" Proportion" component;
19. Create the map component according to a similar method;
20. Create the "TOP10 users who received the most messages" component according to a similar method;
21. Create the "Distribution of mobile phone models of sending users" component according to a similar method;
22. Follow a similar method Create the Hourly Message Volume Trend component;
that's the end of this course.