hive implements website user behavior analysis indicators



Field Explanation
accessDate     //Access time, accurate to date, String format


accessTime   //Access time, accurate to milliseconds, int format


accessHour   //Access hour, interval is 0-23, int format 


requestMethod   //Request method (get post statistics time not used), String format


requestProtocal   //request protocol (http https, not used for statistics), String format


requestUrl   //Request URL address,
eg: http://it18zhang.com/news/view?news_id= 26, the format is String


requestIp   //Request IP address, eg: 192.168.1.1, String format


returnStatus   //Return status (I don't know, it is not used in statistics), String format


referUrl   //The last hop URL address,
eg: http://www.baidu.com/s?wd=ggg&xxxx, format is String


referDomain   //Last hop domain name, eg: baidu.com, String format


userOrigin   //User entry address, eg: http://www.baidu.com, the format is String
(I don't know much about this, but according to originWord it should be a search engine)


originWord   //Entry keyword, eg: it 18zhang it18zhang, Format is String


browser   //Browser, eg: Firefox Chrome, format is String


browserVersion   //Browser version, eg: 51.0 50.1, format is String


operateSystem   //Operating system, eg: Windows10 macOS Ubuntu, format is String


ipNumber   // IP number, (I don't know this, it is not used in statistics), int format


userProvince   //User province, String format


screenSize   //Screen size, eg: 1366x768, String format


screenColor   //Screen color, eg: red blue green, String Format


pageTitle   //Page title, eg: Python, BigData, String format


siteType   //Site type, (I don't know much, the default is 0), String format


userFlag   //User flag, extracted from cookie, equivalent to userID, String format


visitFlag   //Access flag, extracted from cookie, equivalent to sessionID, String format


sFlag   //(Not sure what it does, statistics are not used, divided into 1 and 0), String format


timeOnPage //Page stay time, accurate To milliseconds, int








format
Create external table users(
accessDate string,
accessTime int,
accessHour int,
requestMethod string,
referUrl string,
requestProtocal string,
returnStatus string,
requestUrl string,
referDomain string, using access_day (entry time) as the partition table,
userOrigin   string,
originWord   string,
browser   string,
browserVersion   string,
operateSystem   string,
requestIp   string,
ipNumber   int,
userProvince   string,
screenSize   string,
screenColor   string,
pageTitle   string,
siteType   string,
userFlag   string,
visitFlag   string,
sFlag   string,
timeOnPage int)
partitioned by (access_day string)
row format delimited
fields terminated by ' '

stored as textfile;




load data local inpath '/home/hadoop/data/20160101.txt'  overwrite into table users partition (access_day='20160101');
load data local inpath '/home/hadoop/data/20160601.txt'  overwrite into table users partition (access_day='20160601');
load data local inpath '/home/hadoop/data/20170601.txt'  overwrite into table users partition (access_day='20170601');
load data local inpath '/home/hadoop/data/20170101.txt'  overwrite into table users partition (access_day='20170101');



Statistical methods
PV statistics
1. PV statistics by day
select accessdate ,count(1) from users where access_day='20160601' group by accessdate;
2. PV statistics by hour
select accesshour, count(1) from users where access_day='20160601' group by accesshour;
3. Count PV for each province every day
select accessdate, userprovince, count(1) from users where access_day='20160601' group by accessdate,userprovince;
4. Count every hour in each province
select accessdate, userprovince, accesshour, count(1) from users where access_day='20160601' group by accessdate, userprovince, accesshour;
5. Count the number of visits to each page every day
select accessdate, requesturl, count(1) from users where access_day='20160601' group by accessdate, requesturl ;
UV Statistics
1. Count the total number of visitors, that is, visitors (UV)
select access_day,count (distinct requestIp) from users group by access_day;
2. Count the average number of visited pages on the day (page/person=PV/UV)
select count(1) pvSta, count (distinct requestIp) uvSta from users where access_day='20160601';


select count(1)/count(distinct requestIp) from users where access_day='20160601';


3. Count the number of visitors to each page and the earliest and latest time
Select requesturl, count(distinct requestIp) visitCount, min(accessTime) firstAccessTime, max(accessTime) recentAccessTime from users where access_day='20160601' group by requestUrl order by visitCount desc;
statistics of website stay time
Average website stay time = total website stay time / The number of sessions (visits)
//Note that the data type of this time is int. In order to facilitate the explanation of the business, the two time data are directly subtracted. The specific business needs to implement a user-defined function (UDF) to convert int to time data and perform related operations.
1. The daily website stay time of each visitor = the last time - the first visit time
create view pageTime (id, visitKeepTime) as select requestIp, ceil((max(accessTime) - min(accessTime))/1000) visitKeepTime from users where access_day='20170601' group by requestIp; bug
select * from pageTime;
2. According to The data in the first step is to statistically analyze the average stay time of users on the website,
select avg(visitKeepTime) from pageTime;
statistics of customer equipment related information
1. Analysis of browser proportions and traffic. Count how many people visit the corresponding version of each browser
select browser, browserVersion, count(distinct requestIp) staCount from users e where siteType='0' and access_day='20160601' group by browser,browserVersion order by browser,browserVersion;
2 , operating system statistics, visits. Count how many people use each operation
select operateSystem,count(distinct userflag) from users where siteType='0' and access_day='20160601'
Screen color statistics, traffic. How many people use
Select screenColor,count(distinct userflag) from users where siteType='0' and access_day='20160601' group by screenColor order by screenColor
4. Screen size, number of visits. How many people use
Select screenSize,count(distinct userflag) from users where siteType='0' and access_day='20160601' group by screenSize order by screenSize
Source statistics
1. Source keyword statistics. Count the number of times each keyword is used
select originWord, count(1) staCount from users where siteType='0' and access_day='20160601' and originWord!='-' group by originWord order by staCount desc;
2. Popular portal sites address.
select userOrigin, count(1) staCount from users where siteType='0' and access_day='20160601' group by userOrigin order by staCount desc;

select pageTitle, count(1) staCount from users where siteType='0' and access_day='20160601' group by pageTitle order by staCount desc;
user area access distribution
1. Mainly count the access of each province
Select userprovince, count( 1) from users where siteType='0' and access_day='20160601' group by userProvince;
2. Count the number of visitors in each province
select userprovince, count(distinct requestIp) from users where siteType='0' and access_day='20160601' group by userProvince;




user access-related information retention rate 
1. The proportion of return visitors on
the day is the number of unique visitors who visit the website multiple times in one day (producing multiple sessions).
select count(distinct requestIp) from (select requestIp, count(requestIp) visitNum from users where siteType='0' and access_day='20160601' group by requestIp) a where a.visitNum > 1;

The average number of times each unique visitor visits the website in one day (the number of sessions generated), the average visitor frequency = the number of visits/independent visitors.
select requestIp, count(1)/count(distinct requestIp) from users where siteType='0' and access_day='20160601' group by requestIp;
3. The average visit duration of independent users and the average
stay on the website per visit (session) time. Average visit duration = visit duration / number of visits. It reflects the attractiveness of the website to visitors.
select sum(visitKeepTime)/count(distinct requestIp) from (select requestIp, ceil((max(accessTime) - min(accessTime))/1000) visitKeepTime from users where access_day='20170601' group by requestIp) a ; session 
4, Average Visit Depth   
Average PV generated per visit (session). Average Visit Depth = Views/Visits. It reflects the attractiveness of the website to visitors.
select a.pv/b.visitNum from (select count(1) pv from users where access_day='20160601') a ,(select count(distinct requestIp) visitNum from users where access_day='20160601') b;

select a.pv/b.userNum from (select count(1) pv from users where access_day='20160601') a ,(select count(distinct requestIp) userNum from users  where access_day='20160601') b;
6、 新增独立访客
select count(distinct requestIp) from (select requestIp, min(cast(accessDate as int)) edate from users group by requestIp) a where a.edate = 20160601 And Flag=1;


select count(distinct requestIp) from users where sFlag=1; 


7.来路域名
  select referUrl,count(*) from users group by referUrl;






Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325761398&siteId=291194637