I. Overview
Data Web logs traffic analysis system of the cleaning process (offline analysis) have the data cleaning process, but the process partition information (date reportTime is written dead), but hql statements also need to manually to perform, certainly not in the actual development would tolerate such a thing, so let the program automatically hql those statements, for off-line analysis is the moment we have to solve the problem.
Second, automated scripts
(1) write a script logdemo.hql
use logdb; alter table logdemo add partition (reportTime='${tday}') location '/logdemo/reportTime=${tday}'; insert into dataclear partition(reportTime='${tday}') select split(url,'-')[2],urlname,ref,uagent,stat_uv,split(stat_ss,'_')[0],split(stat_ss,'_')[1],split(stat_ss,'_')[2],cip from logdemo where reportTime = '${tday}'; insert into tongji1_temp select '${tday}','pv',pv_tab.pv from (select count(*) as pv from dataclear where reportTime='${tday}') as pv_tab;create table tongji1_temp(reportTime string,k string,v string); insert into tongji1_temp select '${tday}','uv',uv_tab.uv from (select count(distinct uvid) as uv from dataclear where reportTime='${tday}') as uv_tab; insert into tongji1_temp select '${tday}','vv',vv_tab.vv from (select count(distinct ssid) as vv from dataclear where reportTime='${tday}') as vv_tab; insert into tongji1_temp select '${tday}','br',br_tab.br from (select round(br_left_tab.br_count / br_right_tab.ss_count,4) as br from (select count(*) as br_count from (select ssid from dataclear group by ssid having count(*) = 1) as br_intab) br_left_tab,(select count(distinct ssid) as ss_count from dataclear where reportTime='${tday}') as br_right_tab) as br_tab; insert into tongji1_temp select '${tday}','newip',newip_tab.newip from (select count(distinct out_dc.cip) as newip from dataclear as out_dc where out_dc.reportTime='${tday}' and out_dc.cip not in (select in_dc.cip from dataclear as in_dc where datediff('${tday}',in_dc.reportTime)>0)) as newip_tab; insert into tongji1_temp select '${tday}','newcust',newcust_tab.newcust from (select count(distinct out_dc.uvid) as newcust from dataclear as out_dc where out_dc.reportTime='${tday}' and out_dc.uvid not in (select in_dc.uvid from dataclear as in_dc where datediff('${tday}',in_dc.reportTime)>0)) as newcust_tab; insert into tongji1_temp select '${tday}','avgtime',avgtime_tab.avgtime from (select avg(at_tab.usetime) as avgtime from (select max(sstime) - min(sstime) as usetime from dataclear where reportTime='${tday}' group by ssid) as at_tab) as avgtime_tab; insert into tongji1_temp select '${tday}','avgdeep',avgdeep_tab.avgdeep from (select avg(ad_tab.deep) as avgdeep from (select count(distinct url) as deep from dataclear where reportTime='${tday}' group by ssid) as ad_tab) as avgdeep_tab; insert into tongji1 select '${tday}', pv_tab.pv, uv_tab.uv, vv_tab.vv, newip_tab.newip, newcust_tab.newcust, avgtime_tab.avgtime, avgdeep_tab.avgdeep from (select v as pv from tongji1_temp where reportTime='${tday}' and k='pv') as pv_tab, (select v as uv from tongji1_temp where reportTime='${tday}' and k='uv') as uv_tab, (select v as vv from tongji1_temp where reportTime='${tday}' and k='vv') as vv_tab, (select v as newip from tongji1_temp where reportTime='${tday}' and k='newip') as newip_tab, (select v as newcust from tongji1_temp where reportTime='${tday}' and k='newcust') as newcust_tab, (select v as avgtime from tongji1_temp where reportTime='${tday}' and k='avgtime') as avgtime_tab, (select v as avgdeep from tongji1_temp where reportTime='${tday}' and k='avgdeep') as avgdeep_tab;
Note: Do not have Chinese notes.
(2) into the hive's bin directory to start the script: -d represent variables, -f indicates that the file (you may have noticed that we are still here, handwritten date, do not worry.)
[root@hadoopalone bin]# ./hive -d tday=2019-09-07 -f /home/software/logdemo.hql
(3) setting a timing linux task, requiring after day zero, the script calling (./hive -d tday = date "+% Y-% m-% d" -f /home/software/logdemo.hql)
1. Linux can be configured in the timer task configuration file / etc / crontab, the configuration:
Which you can use wildcards:
① asterisk (*): The representative of each meaning, for example, if the month field is an asterisk, then perform the command operation month.
② comma (,): the mean time, a separator, e.g., "1,3,5,7,9."
③ the bar (-): represents a time range, for example, "2-6" indicates "2,3,4,5,6."
④ forward slash (/): n frequency intervals can slash specified time, for example, "0-23 / 2" performed once every two hours. At the same time slash can be used with an asterisk, for example * / 10, if used in minute field representing performed once every ten minutes.
Example:
## month daily 3:30 and 12:20 test.sh execution script 30 . 3 , 12 is * * * / Home / test.sh ## month intervals for 6 hours per day test performed every 30 minutes. sh script 30 * / 6 * * * / Home / test.sh ## per month each morning from 8 am to 18:00 pm every 2 hours every 30 minutes script execution test.sh 30 8 - 18 / 2 * * * / etc / init.d / Network restart ## per month in the evening 21:30 every day test.sh script execution 30 21 * * * / etc /init.d / Network restart ## No. 1 a month, No. 10, No. 22 4:45 test.sh script execution 45 4 1 , 10 , 22 * * / etc / init.d / Network restart ## 8 Yuefen Monday, Sunday 1:10 test.sh script execution 10 1 * 8 6 , 0 / etc / init.d / Network restart ## hour a day each month to perform the whole point test.sh script 00 * / 1 * * * / etc / init.d / Network restart
2. Management crontab command
Start crond Service // start the service Service crond STOP // close the service Service crond restart // restart service Service crond reload // reload the configuration
3. The project configuration
4. Start the crontab configuration to complete timed tasks: service crond start // Start Service
Third, the summary
Having thus completed the automated execution hql script, the next step is to complete the data cleansing process result visualization display: data logging site traffic analysis system of visual display