Original link:
https://www.toutiao.com/i6773241257528394248/
Before we did "java mapreduce achieve PV site analysis," this time we can analyze some demand indicators with hive
Proposed requirements: statistical analysis of each period of 24 hours and uv pv
analysis:
(1) pv Statistics Total page views count (url)
(2) uv statistics to re-count (distinct guid)
(3) obtain the time field, date and hour (partition table)
The final result expected
Next, pay attention to each stage: acquisition phase, washing phase, analysis phase.
Prepare data, view data dictionary to understand the structure and meaning of data (omitted here to show data and data dictionary), can be considered at this time the data has been completed the acquisition (acquisition phase), the general staff will be handed over by the acquisition of data into our hands on.
Login beeline client
Start the server: bin / hiveserver2 &
Start the client
bin/beeline -u jdbc:hive2://mastercdh:10000 -n root -p password
According to the data dictionary, create a data table
Create a database
Create a data table
create table track_log_source(
id string,
url string,
referer string,
keyword string,
type string,
guid string,
pageId string,
moduleId string,
link a string,
attachedInfo string,
sessionId string,
trackerU string,
trackerType string,
ip string,
trackerSrc string,
cookie string,
order code string,
trackTime string,
endUserId string,
firstLink string,
sessionViewNo string,
productId string,
curMerchantId string,
provinceId string,
cityId string,
fee string,
edmActivity string,
edmEmail string,
edmJobId string,
ieVersion string,
platform string,
internalKeyword string,
resultSum string,
currentPage string,
linkPosition string,
buttonPosition string
)row format delimited fields terminated by '\t';
Prepare data
The prepared data import
load data local inpath '/data/test/data1' into table track_log_source;
load data local inpath '/data/test/data2' into table track_log_source;
View next
After the acquisition is complete, data needs to be cleaned, such as done before "MapReduce for data deduplication"
According to previous analysis, we create a table, we need to be extracted from the field of
create table track_log_qingxi(
id string,
url string,
guid string,
date string,
hour string
)row format delimited fields terminated by '\t';
Insert data
insert into table track_log_qingxi select id,url,guid,substring(trackTime,9,2) date,substring(trackTime,12,2) hour from track_log_source;
Partition table: The time field is partitioned
create table track_log_part1(
id string,
url string,
guid string
)partitioned by(date string,hour string)
row format delimited fields terminated by '\t';
Insert data
insert into table track_log_part1 partition(date='20150828',hour='18') select id,url,guid from track_log_qingxi where date='28' and hour='18';
insert into table track_log_part1 partition(date='20150828',hour='19') select id,url,guid from track_log_qingxi where date='28' and hour='19';
Write the words, every time the conditions required to complete, is very inconvenient
Let's look at a concept: dynamic partitioning
In the first hive hive-site.xml configuration file, there are two attributes
Indicates whether to enable dynamic partitioning (this is enabled by default)
<property>
<name>hive.exec.dynamic.partition</name>
<value>true</value>
</property>
Use dynamic partitioning, need to set up a non-strict mode
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>strict</value>
</property>
We use the command changes, not directly configured
set hive.exec.dynamic.partition.mode=nonstrict;
Then we re-create the partition table
create table track_log_part2(
id string,
url string,
guid string
)partitioned by(date string,hour string)
row format delimited fields terminated by '\t';
Re-insert (this place is the use of dynamic characteristics of the partition)
insert into table track_log_part2 partition(date,hour) select * from track_log_qingxi;
View Data found automatically help us separate, so if it is more than time, you will automatically
data analysis
PV View
select date,hour,count(url) pv from track_log_part2 group by date,hour;
UV analysis
select date,hour,count(distinct guid) uv from track_log_part2 group by date,hour;
The final results into a final result table
create table result as select date,hour,count(url) pv,count(distinct guid) uv from track_log_part2 group by date,hour;
Data output
The final result is saved in mysql
Create a table in mysql
create table track_pv_uv_save(
date varchar(30),
hour varchar(30),
pv varchar(30),
uv varchar(30),
primary key (date,hour)
);
sqoop embodiment (hive-mysql)
bin/sqoop export \
--connect jdbc:mysql://mastercdh:3306/track_log_mysql \
--username root \
--password password \
--table track_pv_uv_save \
--export-you /user/hive/warehouse/exp_track_log.db/result \
-m 1 \
--input-fields-terminated-by '\001'
Check in mysql
We can download data to a local
bin/hdfs dfs -get /user/hive/warehouse/exp_track_log.db/result/000000_0 /data/test
View the next data
View the next data is not a problem