Website traffic indicator statistics

For website traffic indicator statistics, it can generally be divided into the following dimensions

  • Count the page views per day
  • Count the number of unique visitors per day (by head count)
  • Count the number of independent sessions per day
  • Statistics by visitor region
  • Statistics visitor ip address
  • Page analysis by origin
    After collecting the above indicators, you can analyze the overall situation of the website by time period
    Insert picture description here
    Insert picture description here
    Insert picture description here
    Insert picture description here
    Insert picture description here

The statistical indicators of this project are summarized as follows

  1. PV, page views. The user clicks the page once, even if it is a PV, the refresh operation is also counted. We will count the total PV in a day.
  2. UV, the number of unique visitors. By head count, we will count how many different users visit the site in a day. Processing idea: When a user visits a website, the background will generate a user id (uvid) for the user, and then save the uvid in the cookie of the user's browser, and the uvid information will be carried when the user visits next time. So this indicator actually counts how many different uvids are in a day.
  3. VV, the number of independent sessions (Session). Count the number of different sessions in a day, and the conditions for generating a new session: 1. Close the browser and open it again, a new session will be generated. 2 After the session operation timeout period (usually half an hour), a new session will be generated. Implementation idea: When a new session is generated, the background will generate a session id (ssid) for this session, and then store it in a cookie. So counting VV is actually counting how many different ssids are in a day.
  4. BR, page bounce rate (bounced sessions/VV total sessions). Out-of-session refers to a session that only generates one access behavior. Therefore, the BR indicator can measure the excellence of the website. The higher the index, the lower the attractiveness to users, and it needs to be improved.
  5. NewCust, the number of new users. If a user today has never appeared in historical data, this user is counted as a new user. Count the number of uvid today that has not appeared in historical data.
  6. NewIp, add IP number. Count the number of ips that have not appeared in historical data today.
  7. AvgDepp, average session depth. AvgDeep=Total session access depth/total number of sessions (VV). Among them, the total session access depth = the sum of the access depths of each session, and the access depth of each session = how many different URL addresses are accessed.
  8. AvgTime, the average session access time. AvgTime=Total session access duration/total number of sessions (VV). Among them, the total session duration = the sum of the duration of each session.
    We can count the timestamp when the page is opened, and calculate the total access time of each session. But in a production environment, the calculated theoretical value is smaller than the real value because the stay time of the last page cannot be obtained.
    Count the total duration of each session, namely: Max TimeStamp-Min TimeStamp.

Data burying point and collection

Insert picture description here

Project structure

Insert picture description here
Insert picture description here
Offline batch processing can use: MapReduce, Hive data ETL (Extract Transform Load)
Insert picture description here
result files are exported to the database. Do data visualization. The front end extracts data from the database.
Insert picture description here
Insert picture description here

flume configuration


a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 44444
a1.sources.s1.interceptors=i1
a1.sources.s1.interceptors.i1.type=timestamp

a1.channels.c1.type = memory
#配置flume的通道容量,表示最多1000个Events事件。在生产环境 ,建议在5万-10万
a1.channels.c1.capacity = 1000
#批处理大小,生产环境建议在1000以上。(capacity和transactionCapacity )这两个参数决定了Flume的吞吐能力
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=hdfs://hadoop01:9000/weblog/reporttime=%Y-%m-%d
//按时间周期滚动生成一人和。生产环境,建议在1小时及以上。避免生成大量小文件。
a1.sinks.k1.hdfs.rollInterval=30
#按文件大小滚动生成新文件。默认是1kb。如果是0,表示不按此条件滚动
a1.sinks.k1.hdfs.rollSize=0
#按文件中行数滚动。默认是10行
a1.sinks.k1.hdfs.rollCount=0
#Flume在HDFS生成文件时的格式,1默认格式是二进制格式。2DataStream
a1.sinks.k1.hdfs.fileType=DataStream


a1.sources.s1.channels = c1
a1.sinks.k1.channel =c1

启动flume
…/bin/flume-ng agent -n a1 -c …/conf/ -f weblog.conf -Dflume.root.logger=INFO,console

Hive for offline data processing

  1. Create a total table (external table + partition table). Load and manage all field data, such as url, urlname, color... etc.
  2. Add partition information to the total table
  3. Create a cleaning table (internal table) to clean out useful business fields.
  4. Insert the cleaned field data from the general table into the cleaned table.
  5. Establish a business table to store various indicators after statistics, the pv, uv, vv of this project...
  • Total table building statement:

create external table flux (url string,urlname string,title string,chset string,scr string,col string,lg string,je string,ec string,fv string,cn string,ref string,uagent string,stat_uv string,stat_ss string,cip string) PARTITIONED BY (reportTime string) row format delimited fields terminated by ‘|’ location ‘/weblog’;

  • Data cleaning table

create table dataclear(reporttime string,url string,urlname string,uvid string,ssid string,sscount string,sstime string,cip string) row format delimited fields terminated by ‘|’;

Insert cleaning data

insert overwrite table dataclear
select reporttime,url,urlname,stat_uv,split(stat_ss,"")[0],split(stat_ss,"")[1],split(stat_ss,"_")[2],cip from flux;

  • Create business table

create table tongji(reportTime string,pv int,uv int,vv int, br double,newip int, newcust int, avgtime double,avgdeep double) row format delimited fields terminated by ‘|’;

  • The final statistical table insert data statement:

insert overwrite table tongji select ‘2019-09-19’,tab1.pv,tab2.uv,tab3.vv,tab4.br,tab5.newip,tab6.newcust,tab7.avgtime,tab8.avgdeep from (select count() as pv from dataclear where reportTime = ‘2019-09-19’) as tab1,(select count(distinct uvid) as uv from dataclear where reportTime = ‘2019-09-19’) as tab2,(select count(distinct ssid) as vv from dataclear where reportTime = ‘2019-09-19’) as tab3,(select round(br_taba.a/br_tabb.b,4)as br from (select count() as a from (select ssid from dataclear where reportTime=‘2019-09-19’ group by ssid having count(ssid) = 1) as br_tab) as br_taba,(select count(distinct ssid) as b from dataclear where reportTime=‘2019-09-19’) as br_tabb) as tab4,(select count(distinct dataclear.cip) as newip from dataclear where dataclear.reportTime = ‘2019-09-19’ and cip not in (select dc2.cip from dataclear as dc2 where dc2.reportTime < ‘2019-09-19’)) as tab5,(select count(distinct dataclear.uvid) as newcust from dataclear where dataclear.reportTime=‘2019-09-19’ and uvid not in (select dc2.uvid from dataclear as dc2 where dc2.reportTime < ‘2019-09-19’)) as tab6,(select round(avg(atTab.usetime),4) as avgtime from (select max(sstime) - min(sstime) as usetime from dataclear where reportTime=‘2019-09-19’ group by ssid) as atTab) as tab7,(select round(avg(deep),4) as avgdeep from (select count(distinct urlname) as deep from dataclear where reportTime=‘2019-09-19’ group by ssid) as adTab) as tab8;

Guess you like

Origin blog.csdn.net/yasuofenglei/article/details/101012038