Offline analysis of Web logs traffic analysis systems (automated scripts)

I. Overview

  Data Web logs traffic analysis system of the cleaning process (offline analysis) have the data cleaning process, but the process partition information (date reportTime is written dead), but hql statements also need to manually to perform, certainly not in the actual development would tolerate such a thing, so let the program automatically hql those statements, for off-line analysis is the moment we have to solve the problem.

Second, automated scripts

(1) write a script logdemo.hql

use logdb;

alter table logdemo add partition (reportTime='${tday}') location '/logdemo/reportTime=${tday}';

insert into dataclear partition(reportTime='${tday}') select split(url,'-')[2],urlname,ref,uagent,stat_uv,split(stat_ss,'_')[0],split(stat_ss,'_')[1],split(stat_ss,'_')[2],cip from logdemo where reportTime = '${tday}';

insert into tongji1_temp select '${tday}','pv',pv_tab.pv from (select count(*) as pv from dataclear where reportTime='${tday}') as pv_tab;create table tongji1_temp(reportTime string,k string,v string);

insert into tongji1_temp select '${tday}','uv',uv_tab.uv from (select count(distinct uvid) as uv from dataclear where reportTime='${tday}') as uv_tab;

insert into tongji1_temp select '${tday}','vv',vv_tab.vv from (select count(distinct ssid) as vv from dataclear where reportTime='${tday}') as vv_tab;

insert into tongji1_temp select '${tday}','br',br_tab.br from (select round(br_left_tab.br_count / br_right_tab.ss_count,4) as br from (select count(*) as br_count from (select ssid from dataclear group by ssid having count(*) = 1) as br_intab) br_left_tab,(select count(distinct ssid) as ss_count from dataclear where reportTime='${tday}') as br_right_tab) as br_tab;

insert into tongji1_temp select '${tday}','newip',newip_tab.newip from (select count(distinct out_dc.cip) as newip from dataclear as out_dc where out_dc.reportTime='${tday}' and out_dc.cip not in (select in_dc.cip from dataclear as in_dc where datediff('${tday}',in_dc.reportTime)>0)) as newip_tab;

insert into tongji1_temp select '${tday}','newcust',newcust_tab.newcust from (select count(distinct out_dc.uvid) as newcust from dataclear as out_dc where out_dc.reportTime='${tday}' and out_dc.uvid not in (select in_dc.uvid from dataclear as in_dc where datediff('${tday}',in_dc.reportTime)>0)) as newcust_tab;

insert into tongji1_temp select '${tday}','avgtime',avgtime_tab.avgtime from (select avg(at_tab.usetime) as avgtime from (select max(sstime) - min(sstime) as usetime from dataclear where reportTime='${tday}' group by ssid) as at_tab) as avgtime_tab;

insert into tongji1_temp select '${tday}','avgdeep',avgdeep_tab.avgdeep from (select avg(ad_tab.deep) as avgdeep from (select count(distinct url) as deep from dataclear where reportTime='${tday}' group by ssid) as ad_tab) as avgdeep_tab;

insert into tongji1 select  '${tday}', pv_tab.pv, uv_tab.uv, vv_tab.vv, newip_tab.newip, newcust_tab.newcust, avgtime_tab.avgtime, avgdeep_tab.avgdeep from (select v as pv from tongji1_temp where reportTime='${tday}' and k='pv') as pv_tab, (select v as uv from tongji1_temp where reportTime='${tday}' and k='uv') as uv_tab, (select v as vv from tongji1_temp where reportTime='${tday}' and k='vv') as vv_tab, (select v as newip from tongji1_temp where reportTime='${tday}' and k='newip') as newip_tab, (select v as newcust from tongji1_temp where reportTime='${tday}' and k='newcust') as newcust_tab, (select v as avgtime from tongji1_temp where reportTime='${tday}' and k='avgtime') as avgtime_tab, (select v as avgdeep from tongji1_temp where reportTime='${tday}' and k='avgdeep') as avgdeep_tab; 

  Note: Do not have Chinese notes.

 (2) into the hive's bin directory to start the script: -d represent variables, -f indicates that the file (you may have noticed that we are still here, handwritten date, do not worry.)

[root@hadoopalone bin]# ./hive -d tday=2019-09-07 -f /home/software/logdemo.hql

(3) setting a timing linux task, requiring after day zero, the script calling (./hive -d tday = date "+% Y-% m-% d" -f /home/software/logdemo.hql)

  1. Linux can be configured in the timer task configuration file / etc / crontab, the configuration:

  

 Which you can use wildcards:

  ① asterisk (*): The representative of each meaning, for example, if the month field is an asterisk, then perform the command operation month.

       ② comma (,): the mean time, a separator, e.g., "1,3,5,7,9."

       ③ the bar (-): represents a time range, for example, "2-6" indicates "2,3,4,5,6."  

       ④ forward slash (/): n frequency intervals can slash specified time, for example, "0-23 / 2" performed once every two hours. At the same time slash can be used with an asterisk, for example * / 10, if used in minute field representing performed once every ten minutes.

 Example: 

 ## month daily 3:30 and 12:20 test.sh execution script
 30  . 3 , 12 is  *  *  *   / Home / test.sh            
 
## month intervals for 6 hours per day test performed every 30 minutes. sh script 
30  * / 6  *  *  *   / Home / test.sh 
 
## per month each morning from 8 am to 18:00 pm every 2 hours every 30 minutes script execution test.sh 
30  8 - 18 / 2  *  *  *  / etc / init.d / Network restart 
 
## per month in the evening 21:30 every day test.sh script execution 
30  21  *  *  *   / etc /init.d / Network restart 
 
## No. 1 a month, No. 10, No. 22 4:45 test.sh script execution 
45  4  1 , 10 , 22  *  *  / etc / init.d / Network restart 
 
## 8 Yuefen Monday, Sunday 1:10 test.sh script execution 
10  1  *  8  6 , 0  / etc / init.d / Network restart 
 
## hour a day each month to perform the whole point test.sh script 
00  * / 1  *  *  *   / etc / init.d / Network restart

  2. Management crontab command 

Start crond Service  // start the service 
Service crond STOP  // close the service 
Service crond restart  // restart service 
Service crond reload  // reload the configuration

  3. The project configuration

   

 

  4. Start the crontab configuration to complete timed tasks: service crond start // Start Service

 Third, the summary

  Having thus completed the automated execution hql script, the next step is to complete the data cleansing process result visualization display: data logging site traffic analysis system of visual display

Guess you like

Origin www.cnblogs.com/rmxd/p/11482247.html