nginx+flume network traffic log real-time data analysis actual combat

nginx+flume network traffic log real-time data analysis actual combat

Network Traffic Log Data Analysis - Overview

Except for government and public welfare websites, the purpose of most websites is to generate monetary income, or to put it bluntly, to make money. To create a website that users need, it is necessary to conduct website analysis. Through analysis, find out the actual needs of users, and build a website that meets the needs of users.
Website analysis can help website administrators, operators, promotion personnel, etc. to obtain website traffic information in real time, and provide data basis for website analysis from various aspects such as traffic sources, website content, and website visitor characteristics. Thereby helping to increase website traffic, improve website user experience, allow more visitors to settle down and become members or customers, and maximize income with less investment.

image-20230511161321733

Network traffic log data analysis - data processing flow

The overall process of website traffic log data analysis is basically based on the data processing flow. In general, it can be summarized as: where does the data come from and where does the data go, which can be divided into the following major steps:

image-20230511161343764

Network traffic log data analysis - data acquisition

website log files

The method of recording website log files is the most primitive data acquisition method, which is mainly completed on the server side. It can be realized by configuring the corresponding log writing function on the application server of the website. Many web application servers have their own log recording function. Such as Nginx's access.log logs, etc.

image-20230511161409425

Start the nginx server:

/usr/local/nginx/sbin/nginx

/usr/local/nginx/sbin/nginx -s stop #Stop the server.

Access via browser: http://192.168.88.100/

image-20230511161448905

Refresh the page to view the log information:

[root@node1 logs]# tail -f /usr/local/nginx/logsaccess.log        
192.168.88.1 - - [03/Feb/2021:16:50:15 +0800] "GET /img/zy03.jpg HTTP/1.1" 200 90034 "http://192.168.88.100/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36"
192.168.88.1 - - [03/Feb/2021:16:50:15 +0800] "GET /img/title2.jpg HTTP/1.1" 200 1703 "http://192.168.88.100/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36"
Chrome/87.0.4280.66 Safari/537.36"

Log Field Explanation

1、访客ip地址:   58.215.204.118
2、访客用户信息:  - -
3、请求时间:[18/Sep/2018:06:51:35 +0000]
4、请求方式:GET
5、请求的url:/wp-includes/js/jquery/jquery.js?ver=1.10.2
6、请求所用协议:HTTP/1.1
7、响应码:304
8、返回的数据流量:0
9、访客的来源url:http://blog.fens.me/nodejs-socketio-chat/
10、访客所用浏览器:Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0

Network traffic log data analysis-data collection-Flume framework

Flume overview

Flume is a highly available, highly reliable, and distributed software for massive log collection, aggregation, and transmission provided by Cloudera.
The core of Flume is to collect data from the data source (source), and then send the collected data to the specified destination (sink). In order to ensure the success of the delivery process, the data (channel) will be cached before being sent to the destination (sink). After the data actually arrives at the destination (sink), Flume will delete the cached data.
There are currently two versions of Flume. The Flume 0.9X version is collectively referred to as Flume OG (original generation), and the Flume 1.X version is collectively referred to as Flume NG (next generation). Since Flume NG has undergone core components, core configuration, and code architecture refactoring, it is very different from Flume OG, so please pay attention to the distinction when using it. Another reason for the change is to incorporate Flume into the apache umbrella, and Cloudera Flume was renamed Apache Flume.

image-20230511161709177

Flume operation mechanism

The core role in the Flume system is the agent, which itself is a Java process that generally runs on the log collection node.

image-20230511161734607

Each agent is equivalent to a data transmitter, and there are three components inside:
Source: collection source, used to connect with the data source to obtain data;
Sink: sinking place, the purpose of collecting data transmission, used to the next level of agent Transfer data or
transfer data to the final storage system;
Channel: the data transmission channel inside the agent, used to transfer data from source to sink;

Flume installation and deployment

Flume has been installed in the virtual machine provided by the course.

Flume collection

Now collect the log data to HDFS.

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
 
# Describe/configure the source
#a1.sources.r1.type = exec
#a1.sources.r1.command = tail -F /usr/local/nginx/logs/access.log
#a1.sources.r1.channels = c1
 
a1.sources = r1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /usr/local/nginx/logs/access*.log
…..

Run flume collection

 /export/server/flume-1.8.0/bin/flume-ng agent -c conf -f /export/server/flume-1.8.0/conf/web_log.conf -n a1  -Dflume.root.logger=INFO,console

image-20230511161942509

Data preprocessing - cleaning

The log data collected by Flume cannot be directly used for analysis, and preprocessing is required: Preprocessing
requirements:
1: Eliminate log data with insufficient field length
2: Eliminate fields that are meaningless for analysis
3: Convert date format
4: Yes The original log data is cut, and the delimiter is specified as '\001'
5: Mark the valid lines of the data

数据清洗:
hadoop jar /export/data/mapreduce/web_log.jar cn.itcast.bigdata.weblog.pre.WeblogPreProcess   

Network traffic log data analysis - click stream model data

clickstream concept

Click Stream (Click Stream) refers to the user's continuous visit track on the website. Pay attention to the whole process of users browsing the website. Each visit of a user to a website includes a series of click actions, and these click behavior data constitute Click Stream Data, which represents the entire process of the user browsing the website.

image-20230511162032708

In the click stream model, there are two types of model data: PageViews and Visits.

Clickstream model pageviews

The Pageviews model data focuses on each 会话(session)identification of the user, as well as the number of steps visited in each session and the dwell time of each step.
In website analysis, usually put 前后两条访问记录时间差在30分钟以内算成一次会话. If it exceeds 30 minutes, the next visit is counted as a new session start.
The general steps are as follows:
Find all the access records of the user in all access logs,
sort all the access records of the user in positive order of time,
calculate whether the time difference between the two records before and after is 30 minutes
, if it is less than 30 minutes, it is the continuation of the same session session
If it is longer than 30 minutes, it is the start of the next session.
Use the time difference between the two records before and after to calculate the stay time of the previous step.
The last step and the business with only one step specify a page stay time of 60s by default.

--得到pageviews模型
hadoop jar /export/data/mapreduce/web_log.jar  cn.itcast.bigdata.weblog.clickstream.ClickStreamPageView   

Clickstream Model Visits

The Visits model focuses on the start and end visit information in each session. For example, in a certain session, the user enters the starting page and starting time of the session, which page the user left from when the session ended, the time of leaving, how many pages the session visited in total, and other information.
The general steps are as follows:
Sorting out on the pageviews model
All access records in each recycling session are sorted in positive order of time The
time page of the first day is the start time page
The business specifies the time page of the last record as the departure time and departure page.

得到visits模型
hadoop jar /export/data/mapreduce/web_log.jar cn.itcast.bigdata.weblog.clickstream.ClickStreamVisit    

Web log data analysis - data loading

For the analysis of log data, Hive is also divided into three layers: ods layer, dw layer, app layer

create database

create database if not exists  web_log_ods;
create database if not exists  web_log_dw;
create database if not exists  web_log_app;

Create ODS layer data table

raw log data table

drop table if exists web_log_ods.ods_weblog_origin;
  create table web_log_ods.ods_weblog_origin(
  valid string , --有效标记
  remote_addr string, --访客ip
  remote_user string, --访客用户信息
  time_local string, --请求时间
  request string,  --请求url
  status string, --响应状态码
  body_bytes_sent string, --响应字节数
  http_referer string, --来源url
  http_user_agent string --访客终端信息
  ) 
  partitioned by (dt string)
  row format delimited fields terminated by '\001';

clickstream model pageviews table

drop table if exists web_log_ods.ods_click_pageviews;
create table  web_log_ods.ods_click_pageviews(
session string, --会话id
remote_addr string, --访客ip
remote_user string, --访客用户信息
time_local string, --请求时间
request string, --请求url
visit_step string, --访问步长
page_staylong string, --页面停留时间(秒)
http_referer string, --来源url
http_user_agent string,--访客终端信息
body_bytes_sent string,--响应字节数
status string --响应状态码
)
partitioned by (dt string)
row format delimited fields terminated by '\001';

Clickstream visits model table

drop table if exists web_log_ods.ods_click_stream_visits;
create table web_log_ods.ods_click_stream_visits(
session     string, --会话id
remote_addr string, --访客ip
inTime      string, --会话访问起始时间
outTime     string, --会话访问离开时间
inPage      string, --会话访问起始页面
outPage     string, --会话访问离开页面
referal     string, --来源url
pageVisits  int --会话页面访问数量
)
partitioned by (dt string)
row format delimited fields terminated by '\001';

Table data loading

load data inpath '/output/web_log/pre_web_log' overwrite into table  web_log_ods.ods_weblog_origin partition(dt='2021-02-01');
 
load data inpath '/output/web_log/pageviews'overwrite into table web_log_ods.ods_click_pageviews partition(dt='2021-02-01');
 
load data inpath '/output/web_log/visits' overwrite into table web_log_ods.ods_click_stream_visits partition(dt='2021-02-01');

Network log data analysis - implementation of detailed table and wide table

concept

In the data of the fact table, some attributes together form a field (combined together), such as the year, month, day, hour, minute, and second constitute the time. When grouping statistics based on a certain attribute is required, operations such as interception and splicing are required. Efficiency extremely low.

For the convenience of analysis, a field in the fact table can be cut and extracted to form a new field. Because there are more fields, it is called a wide table, and the original one becomes a narrow table.
And because the information in the wide table is clearer and more detailed, it can also be called a detailed table.

drop table web_log_dw.dw_weblog_detail;
create table web_log_dw.dw_weblog_detail(
valid           string, --有效标识
remote_addr     string, --来源IP
remote_user     string, --用户标识
time_local      string, --访问完整时间
daystr          string, --访问日期
timestr         string, --访问时间
month           string, --访问月
day             string, --访问日
hour            string, --访问时
request         string, --请求的url
status          string, --响应码
body_bytes_sent string, --传输字节数
http_referer    string, --来源url
ref_host        string, --来源的host
ref_path        string, --来源的路径
ref_query       string, --来源参数query
ref_query_id    string, --来源参数query的值
http_user_agent string --客户终端标识
)
partitioned by(dt string)
row format delimited fields terminated by '\001';

Insert data into the detailed wide table dw_weblog_detail by querying, here you need to use the built-in function parse_url_tuple in Hive to parse the url, and store the following sql in: /export/data/hive_sql/web_log_detail.sql

insert into table web_log_dw.dw_weblog_detail partition(dt='2021-02-01')
select c.valid,c.remote_addr,c.remote_user,c.time_local,
substring(c.time_local,1,10) as daystr,
substring(c.time_local,12) as tmstr,
substring(c.time_local,6,2) as month,
substring(c.time_local,9,2) as day,
substring(c.time_local,12,2) as hour,
c.request,c.status,c.body_bytes_sent,c.http_referer,c.ref_host,c.ref_path,c.ref_query,c.ref_query_id,c.http_user_agent
from
(SELECT
a.valid,a.remote_addr,a.remote_user,a.time_local,
a.request,a.status,a.body_bytes_sent,a.http_referer,a.http_user_agent,b.ref_host,b.ref_path,b.ref_query,b.ref_query_id
FROM web_log_ods.ods_weblog_origin a LATERAL VIEW
parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as ref_host, ref_path, ref_query,
 ref_query_id) c;

Execute sql extraction to convert fields to intermediate table t_ods_tmp_detail hive -f '/export/data/hive_sql/web_log_detail.sql'

Network log data analysis-APP layer data indicator development

Analysis of basic indicators

--浏览页面次数(pv) 
select count(*) as pvs from  web_log_dw.dw_weblog_detail where valid = true and dt='2021-02-01';
 
 --独立访客(uv)
 select count(distinct remote_addr) as uvs from  web_log_dw.dw_weblog_detail where valid = true and dt='2021-02-01';
 
 --访问次数(vv) 
select count(session) from ods_click_stream_visits where  dt='2021-02-01' 


Basic index storage

--基础指标入库
drop table if exists web_log_app.app_webflow_basic_info;
create table web_log_app.app_webflow_basic_info(date_val string,pvs bigint,uvs bigint,vvs bigint) partitioned by(dt string);
 
--允许笛卡尔积
set spark.sql.crossJoin.enabled=true;
 
insert into table web_log_app.app_webflow_basic_info partition(dt='2021-02-01')
select '2021-02-01',a.*,b.* from
( 
   select count(*) as pvs,count(distinct remote_addr) as uvs from web_log_dw.dw_weblog_detail  where dt='2021-02-01'
) a 
join 
(
 select count(session) as vvs from web_log_ods.ods_click_stream_visits where dt='2021-02-01'
) b;

Basic indicator analysis - multi-dimensional analysis

--计算该处理批次(一天)中的各小时pvs
drop table web_log_app.app_pvs_everyhour_oneday;
create table web_log_app.app_pvs_everyhour_oneday(month string,day string,hour string,pvs bigint) partitioned by(dt string);
 
insert into table web_log_app.app_pvs_everyhour_oneday partition(dt='2021-02-01')
select a.month as month,a.day as day,a.hour as hour,count(*) as pvs from web_log_dw.dw_weblog_detail  a
where  a.dt='2021-02-01' group by a.month,a.day,a.hour;
 
--计算每天的pvs
drop table web_log_app.app_pvs_everyday;
create table web_log_app.app_pvs_everyday(pvs bigint,month string,day string);
 
insert into table web_log_app.app_pvs_everyday
select count(*) as pvs,a.month as month,a.day as day from web_log_dw.dw_weblog_detail  a
group by a.month,a.day;

Composite index analysis

--复合指标统计分析
 
--人均浏览页数(平均访问深度)
 --需求描述:统计今日所有来访者平均请求的页面数。
 --总页面请求数pv/去重总人数uv
 
drop table web_log_app.app_avgpv_user_everyday;
create table web_log_app.app_avgpv_user_everyday(
day string,
avgpv string);
        
 
--方式一:
insert into table web_log_app.app_avgpv_user_everyday
select '2021-02-01',pv/uv from web_log_app.app_webflow_basic_info;
 
--方式二:
 
insert  into table web_log_app.app_avgpv_user_everyday
select '2021-02-01',sum(b.pvs)/count(b.remote_addr) from
(select remote_addr,count(*) as pvs from web_log_dw.dw_weblog_detail where dt='2021-02-01' group by remote_addr) b;


--平均访问时长
 
    --平均每次访问(会话)在网站上的停留时间。
    --体现网站对访客的吸引程度。
    --平均访问时长=访问总时长/访问次数。
 
--先计算每次会话的停留时长
 
select session, sum(page_staylong) as web_staylong from web_log_ods.ods_click_pageviews where dt='2021-02-01'
group by session;
 
 
--计算平均访问时长
select
sum(a.web_staylong)/count(a.session)
from 
(select session, sum(page_staylong) as web_staylong from web_log_ods.ods_click_pageviews where dt='2021-02-01'
group by session) a;


Composite index analysis-topn

--热门页面统计
--统计最热门的页面top10
 
drop table web_log_app.app_hotpages_everyday;
create table web_log_app.app_hotpages_everyday(day string,url string,pvs string);
 
--方式1
insert into table web_log_app.app_hotpages_everyday
select '2021-02-01',a.request,a.request_counts from
(select request as request,count(request) as request_counts 
from web_log_dw.dw_weblog_detail where dt='2021-02-01' group by request having request is not null
) a
order by a.request_counts desc limit 10;
 
--方式2
insert into table web_log_app.app_hotpages_everyday
select * from
(
SELECT 
  '2021-02-01',a.request,a.request_counts,
  RANK() OVER( ORDER BY a.request_counts desc) AS rn 
  FROM 
  (
    select request as request,count(request) as request_counts 
    from web_log_dw.dw_weblog_detail where dt='2021-02-01' group by request having request is not null
  )a
)b
where b.rn <= 10
 ;

Composite Index Analysis-Funnel Model-Conversion Analysis

Conversion refers to a closed channel in the business process of the website, which guides users to follow the process to finally achieve business goals (such as product transactions); in this channel, we hope that visitors will move forward without looking back or leaving until the conversion goal is completed. The funnel model refers to the image description of the gradual loss of users who enter the channel during the progressive process of each link.

image-20230511162916858

demand analysis

In a specified business process, find out the number of people who complete each step and the percentage relative to the previous step.

image-20230511163010877

定义好业务流程中的页面标识,下例中的步骤为:
Step1、  /item
Step2、  /category
Step3、  /index
Step4、  /order
load data local inpath '/export/data/hivedatas/click-part-r-00000' overwrite into table web_log_ods.ods_click_pageviews2 partition(dt='2021-02-01');
 

---1、查询每一个步骤的总访问人数
UNION All将多个SELECT语句的结果集合并为一个独立的结果集
 
create table web_log_app.app_oute_numbs as 
select 'step1' as step,count(distinct remote_addr)  as numbs from web_log_ods.ods_click_pageviews where dt='2021-02-01' and request like '/item%'
union all
select 'step2' as step,count(distinct remote_addr)  as numbs from  web_log_ods.ods_click_pageviews where dt='2021-02-01' and request like '/category%'
union all
select 'step3' as step,count(distinct remote_addr)  as numbs from  web_log_ods.ods_click_pageviews where dt='2021-02-01' and request like '/order%'
union all
select 'step4' as step,count(distinct remote_addr)  as numbs from  web_log_ods.ods_click_pageviews where dt='2021-02-01' and request like '/index%';


search result:

+---------------------+----------------------+--+
| dw_oute_numbs.step  | dw_oute_numbs.numbs  |
+---------------------+----------------------+--+
| step1               | 1029                 |
| step2               | 1029                 |
| step3               | 1028                 |
| step4               | 1018                 |
+---------------------+----------------------+--+

–2. Query the ratio of each step to the number of people at the starting point of the path


--级联查询,自己跟自己join
 
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from web_log_app.app_oute_numbs rn
inner join 
web_log_app.app_oute_numbs rr;

The result after joining is shown in the figure below:


+---------+----------+---------+----------+--+
| rnstep  | rnnumbs  | rrstep  | rrnumbs  |
+---------+----------+---------+----------+--+
| step1   | 1029     | step1   | 1029     |
| step2   | 1029     | step1   | 1029     |
| step3   | 1028     | step1   | 1029     |
| step4   | 1018     | step1   | 1029     |
| step1   | 1029     | step2   | 1029     |
| step2   | 1029     | step2   | 1029     |
| step3   | 1028     | step2   | 1029     |
| step4   | 1018     | step2   | 1029     |
| step1   | 1029     | step3   | 1028     |
| step2   | 1029     | step3   | 1028     |
| step3   | 1028     | step3   | 1028     |
| step4   | 1018     | step3   | 1028     |
| step1   | 1029     | step4   | 1018     |
| step2   | 1029     | step4   | 1018     |
| step3   | 1028     | step4   | 1018     |
| step4   | 1018     | step4   | 1018     |
+---------+----------+---------+----------+--+

--每一步的人数/第一步的人数==每一步相对起点人数比例
select tmp.rnstep,tmp.rnnumbs/tmp.rrnumbs as abs_rate
from
(
  select 
   * 
 from 
   web_log_app.app_oute_numbs t1 
 join 
   web_log_app.app_oute_numbs t2
 on t2.step = 'step1';
) tmp 
tmp
+---------+----------+---------+----------+--+
| rnstep  | rnnumbs  | rrstep  | rrnumbs  |
+---------+----------+---------+----------+--+
| step1   | 1029     | step1   | 1029     |
| step2   | 1029     | step1   | 1029     |
| step3   | 1028     | step1   | 1029     |
| step4   | 1018     | step1   | 1029     |

–3. Query the leakage rate of each step relative to the previous step


--首先通过自join表过滤出每一步跟上一步的记录
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from web_log_app.app_oute_numbs rn
inner join 
web_log_app.app_oute_numbs rr
where cast(substr(rn.step,5,1) as int)=cast(substr(rr.step,5,1) as int)-1;
 
 
注意:cast为Hive内置函数 类型转换
select cast(1 as float); --1.0  
select cast('2016-05-22' as date); --2016-05-22 
 
 
+---------+----------+---------+----------+--+
| rnstep  | rnnumbs  | rrstep  | rrnumbs  |
+---------+----------+---------+----------+--+
| step1   | 1029     | step2   | 1029     |
| step2   | 1029     | step3   | 1028     |
| step3   | 1028     | step4   | 1018     |
+---------+----------+---------+----------+--+

– Then it is very simple to calculate the leakage rate of each step relative to the previous step


select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as abs_rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from web_log_app.app_oute_numbs rn
inner join 
web_log_app.app_oute_numbs rr) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1;
 


create result table

drop table if exists web_log_app.app_bounce_rate;
create table web_log_app.app_bounce_rate 
(
 step_num string,
 numbs bigint,
 abs_rate double,
 leakage_rate double
);

insert into table web_log_app.app_bounce_rate 
select abs.step,abs.numbs,abs.rate as abs_rate,rel.leakage_rate as leakage_rate
from 
(
select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs * 100 as rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from web_log_app.app_oute_numbs rn
inner join 
web_log_app.app_oute_numbs rr) tmp
where tmp.rrstep='step1'
) abs
left outer join
(
select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs * 100 as leakage_rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from web_log_app.app_oute_numbs rn
inner join 
web_log_app.app_oute_numbs rr) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1
) rel
on abs.step=rel.step;

Create mysql result table


CREATE DATABASE web_log_result;
drop table if exists web_log_result.app_webflow_basic_info;
CREATE TABLE web_log_result.app_webflow_basic_info(MONTH VARCHAR(50),DAY VARCHAR(50),pv BIGINT,uv BIGINT,ip BIGINT,vv BIGINT);
 
 
 --Sqoop数据导出
/export/server/sqoop-1.4.7/bin/sqoop export \
--connect jdbc:mysql://192.168.88.100:3306/web_log_result \
--username root \
--password 123456 \
--table app_webflow_basic_info \
--input-fields-terminated-by '\001' \
--export-dir /user/hive/warehouse/web_log_app.db/app_webflow_basic_info/dt=2021-02-01


Guess you like

Origin blog.csdn.net/xianyu120/article/details/130625395